# Day 4: Character Encodings
<br>
<br>

Welcome to day 4 of the 5-Day Data Challenge. Today's topic: character encodings.

In [54]:
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet

# set seed for reproducibility
np.random.seed(0)

## What are encoding?

Before we begin. Let's discuss what are encodings. Character encodings are specific sets of rules for mapping from raw binary byte strings(eg. 0110100001101001) to characters that make up human-readable text(eg. "hi"). There are many different encodings, and if you tried reading in text from a different encoding then its original one it was written in, you would end up with a scrambled text. You could also end up with "unknown characters." There are what gets printed when there's no mapping between a particular byte and a character in the encoding you're using to read your byte string in and they look like this:
<br>
����������
<br>
Mismatches in character encoding are less common they they used to be, but they are still a problem. The main character encoding you need to know is UTF-8.
    UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.
   There are two main data types you'll encounter when working with text in Python 3. One is is the string, which is what text is by default.

In [55]:
# start with string
before = "This is sanskrit text: नमस्ते"

#check datatype
type(before)

str

The other is the bytes data type, which is a sequence of integers. To convert a string into bytes, just specify which encoding it's in:
<br>

In [56]:
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors = "replace")

# check the type
type(after)

bytes

If you look at a bytes object, you'll see that it has a b in front of it, and then maybe some text after. That's because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn't really work for writing any language other than English.) Here you can see that our euro symbol has been replaced with some mojibake that looks like "\xe2\x82\xac" when it's printed as if it were an ASCII string. 

In [51]:
# let's look at what the bytes look like
after

b'This is sanskrit text: \xe0\xa4\xa8\xe0\xa4\xae\xe0\xa4\xb8\xe0\xa5\x8d\xe0\xa4\xa4\xe0\xa5\x87'

In [52]:
# let's convert back to utf-8
print(after.decode("utf-8"))

This is sanskrit text: नमस्ते




However, when we try to use a different encoding to map our bytes into a string,, we get an error. This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it. You need to tell Python the encoding that the byte string is actually supposed to be in.

    You can think of different encodings as different ways of recording music. You can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, you need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a cd player. If you try to play a cassette in a CD player, it just won't work.



In [53]:
print(after.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 23: ordinal not in range(128)

 If we try to convert a string to bytes for ascii using encode(), we can ask for the bytes to be what they would be if the text was in ASCII. Since our text isn't in ASCII, though, there will be some characters it can't handle. We can automatically replace the characters that ASCII can't handle. If we do that, however, any characters not in ASCII will just be replaced with the unknown character. Then, when we convert the bytes back to a string, the character will be replaced with the unknown character. The dangerous part about this is that there's not way to tell which character it should have been. That means we may have just made our data unusable!

In [None]:
# start with string
before = "This is sanskrit text: नमस्ते"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))





This is bad and we want to avoid doing it! It's far better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert non UTF-8 input into UTF-8 is when you read in files, which we'll talk about next.
<br>
First, however, try converting between bytes and strings with different encodings and see what happens. Notice what this does to your text. Would you want this to happen to data you were trying to analyze?
<br>
<br>
<br>

## Reading in files with encoding problems
<br>
Most files you encounter will most of the time will be encoded with UTF-8. However, sometimes you'll run into an error likes this:


In [58]:
# reading in files not in UTF-8 
policeKillingsUS = pd.read_csv("PoliceKillingsUS.csv")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 2: invalid start byte

As you can see, trying to read the csv file with the default utf-8 encoding gives you an error. We get the same error we got when we tried to decode UTF-8 bytes as if they were ASCII. This means that the file isn't a UTF-8 encoded text. We don't know what it is currently. One way to figure this out is the following:

In [57]:
# look at the first ten thousand bytes to guess the character encoding
with open("PoliceKillingsUS.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'ascii', 'confidence': 1.0}


The result tells us that the right encoding is actually 'ascii' and chardet is 100% confident that this is correct. Let's see if this is true: 

In [59]:
# read in the file with the encoding detected by chardet
policeKillingsUS= pd.read_csv("PoliceKillingsUS.csv", encoding='ascii')


# look at first few lines
policeKillingsUS.head()

UnicodeDecodeError: 'ascii' codec can't decode byte 0x96 in position 2: ordinal not in range(128)

From the output, it looks like chardet was wrong. We used ascii encoding at it still gave us an error that this isn't the right type of encoding. Although, chardet gave us a probability of 100%, so this is interesting. What if we look past just the first 10000 lines? Let's see if we get a different result: 
<br>

In [60]:
# look at the first fifty thousand bytes to guess the character encoding
with open("PoliceKillingsUS.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(50000))

# check what the character encoding might be
print(result)

{'encoding': 'windows-1252', 'confidence': 0.73}


Now, we get a different result for the encoding used for this file. It appears that we weren't able to detect this because we weren't looking enough into the file. Let's see if 'windows-1252' works for this file: 

In [61]:
# read in the file with the encoding detected by chardet
policeKillingsUS= pd.read_csv("PoliceKillingsUS.csv", encoding='windows-1252')


# look at first few lines
policeKillingsUS.head()

Unnamed: 0,id,name,date,manner_of_death,armed,age,gender,race,city,state,signs_of_mental_illness,threat_level,flee,body_camera
0,3,Tim Elliot,02/01/15,shot,gun,53.0,M,A,Shelton,WA,True,attack,Not fleeing,False
1,4,Lewis Lee Lembke,02/01/15,shot,gun,47.0,M,W,Aloha,OR,False,attack,Not fleeing,False
2,5,John Paul Quintero,03/01/15,shot and Tasered,unarmed,23.0,M,H,Wichita,KS,False,other,Not fleeing,False
3,8,Matthew Hoffman,04/01/15,shot,toy weapon,32.0,M,W,San Francisco,CA,True,attack,Not fleeing,False
4,9,Michael Rodriguez,04/01/15,shot,nail gun,39.0,M,H,Evans,CO,False,attack,Not fleeing,False


Finally, we got the file to be read in without error. This is an important lesson to learn. If chardet gives you the wrong encoding type, try looking further into the data and see if that changes your result. 

## Saving your files with UTF-8 encoding
<br>
Now that we gone through all the trouble of getting the file into UTF-8, you'll probably like to keep it that way. The easiest way to do this is to save your files with this UTF-8 encoding. 

In [None]:
# save file(will be saved as UTF-8 by default)
policeKillingsUS.to_csv("policeKillingsUS-utf8.csv")


And that's it for today! We didn't do quite as much coding, but take my word for it: if you don't have the right tools, figuring out what encoding a file is in can be a huge time sink.