<a href="https://colab.research.google.com/github/jminango20/DataCleaning/blob/master/Character_Encoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Avoid UnicodeDecodeErrors when loading CSV files

## What are encodings?

Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). 

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.



In [1]:
# start with a string
before = "This is the euro symbol: €"

In [3]:
# check to see what datatype it is
type(before)

str

You can convert a string into bytes by specifying which encoding it's in:

In [5]:
# encode it to a different encoding, replacing characters that raise errors
after = before.encode(encoding='utf-8',errors='replace')

In [6]:
#check the type
type(after)

bytes

You'll see that it has a `b` in front of it, and then maybe some text after. 

Here you can see that our euro symbol has been replaced with some mojibake that looks like "\xe2\x82\xac" when it's printed as if it were an ASCII string.

In [7]:
# take a look at what the bytes look like
print(after)

b'This is the euro symbol: \xe2\x82\xac'


In [8]:
# convert it back to utf-8
print(after.decode(encoding='utf-8'))

This is the euro symbol: €


In [9]:
# try to decode our bytes with the ascii encoding
print(after.decode(encoding="ascii"))

UnicodeDecodeError: ignored

## Reading in files with encoding problems

Most files you'll encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time you won't run into problems. 

In [None]:
# helpful character encoding module
import chardet

# look at the first ten thousand bytes to guess the character encoding
with open("../input/kickstarter-projects/ks-projects-201801.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("../input/kickstarter-projects/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()

## Saving your files with UTF-8 encoding

Finally, once you've gone through all the trouble of getting your file into UTF-8, you'll probably want to keep it that way. The easiest way to do that is to save your files with UTF-8 encoding. The good news is, since UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:

In [None]:
# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801.csv")