## Handling Decoding Errors

The following cell throws an error while trying to read in a data file.  See if you can understand what the problem is.

In [1]:
import pandas as pd 

In [2]:
N = pd.read_csv('Nations.txt')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 7: invalid start byte

Sadly, this type of error is really common when you find a new source of data.

Notice that the error gives you the contents of a specific byte: 0x82.  It also tells you a position, but we can't really use that because it's the position in a buffer (a temporary memory location) instead of in the file, so it's not very helpful.  Here's how you would create the byte in question:

In [3]:
byte = b'\x82'

Depending on what encoding we use, this could mean different things

In [4]:
byte.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x82 in position 0: invalid start byte

In [5]:
byte.decode('windows-1250')

'‚'

In [6]:
byte.decode('macintosh')

'Ç'

If you know what kind of machine the file was made on, there's a chance you can guess the encoding.  Sometime the unix `file` command can help.

In [7]:
!file Nations.txt
!file -I Nations.txt

Nations.txt: Non-ISO extended-ASCII text, with CRLF line terminators
Nations.txt: text/plain; charset=unknown-8bit


There's also a python library, `chardet`, which tries to help you determine what encoding you have.

In [8]:
import cchardet as chardet
with open('Nations.txt', 'rb') as csv_file:
    print(chardet.detect(csv_file.read()))

{'encoding': 'IBM852', 'confidence': 0.7663981914520264}


In [11]:
N = pd.read_csv('Nations.txt', encoding='IBM852', sep = '\s+')
N.head()

Unnamed: 0,Country,fertility_rate,contraception,infant_mortality,GDP,region
0,Afghanistan,6.9,,154.0,2848.0,Asia
1,Albania,2.6,,32.0,863.0,Europe
2,Algeria,3.81,52.0,44.0,1531.0,Africa
3,American-Samoa,,,11.0,,Oceania
4,Andorra,,,,,Europe


Often, you just have to go into your file manually and find and correct the encoding errors.  It can help to use the open command with the `errors` argument set to `"replace"`.  This will let you read in the file, but flag the bytes that are causing trouble with a special character.

In [60]:
N = pd.read_csv(open('Nations.txt', errors = "replace"), sep='\s+')
N.head()

Unnamed: 0,Country,fertility_rate,contraception,infant_mortality,GDP,region
0,Afghanistan,6.9,,154.0,2848.0,Asia
1,Albania,2.6,,32.0,863.0,Europe
2,Algeria,3.81,52.0,44.0,1531.0,Africa
3,American-Samoa,,,11.0,,Oceania
4,Andorra,,,,,Europe


In [50]:
'\N{REPLACEMENT CHARACTER}'

'�'

In [54]:
N[N.apply(lambda row: row.str.contains('\N{REPLACEMENT CHARACTER}').any(), axis=1)]

Unnamed: 0,Country,fertility_rate,contraception,infant_mortality,GDP,region
161,S?o-Tom�,,,51.0,49.0,Africa
