# Character Encodings 

Character encodings are specific sets of rules for mapping from raw binary byte strings (that look like this: 0110100001101001) to characters that make up human-readable text (like "hi"). There are many different encodings, and errors can lead to things like this: 

- wrong encoding: æ–‡å—åŒ–ã??
- unknown encoding: ����������

Character encoding mismatches are less common today than they used to be, but it's definitely still a problem. There are lots of different character encodings, but the main one you need to know is UTF-8.

UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

First, lets cover relevant datatypes. There are two main datatypes in Python when dealing with text: strings and bytes. 

In [1]:
string = "This is the euro symbol: €"
type(string)

str

In [7]:
string

'This is the euro symbol: €'

In [4]:
byte = string.encode("utf-8", errors = "replace")
type(byte)

bytes

In [5]:
byte

b'This is the euro symbol: \xe2\x82\xac'

Notice that the bytes object has a 'b' before it. That's because bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn't really work for writing any language other than English.) Here you can see that our euro symbol has been replaced with some [mojibake](https://en.wikipedia.org/wiki/Mojibake) that looks like "\xe2\x82\xac" when it's printed as if it were an ASCII string.



When we convert our bytes back to a string with the correct encoding, we can see that our text is all there correctly, which is great!

In [8]:
byte.decode("utf-8")

'This is the euro symbol: €'

However, when we try to use a different encoding to map our bytes into a string,, we get an error. This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it. You need to tell Python the encoding that the byte string is actually supposed to be in.

In [9]:
byte.decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

We can also lose data by converting to ascii (english characters only) and back again (lossy). 

In [10]:
before = "This is the euro symbol: €"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

This is the euro symbol: ?


## Reading in a file: 

Often wet get a UnicodeDecodeError when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file isn't actually UTF-8. We don't know what encoding it actually is though. One way to figure it out is to try and test a bunch of different character encodings and see if any of them work. 

In [33]:
# try to read in a file not in UTF-8
with open("files/ks-projects-201801.csv") as f: 
    print(f.readlines())

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 7466: character maps to <undefined>



### chardet

A better way, though, is to use the chardet module to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.


In [34]:
import chardet

# look at the first ten thousand bytes to guess the character encoding
with open("files/ks-projects-201801.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


So chardet is 73% confidence that the right encoding is "Windows-1252". We can use this to correctly open the file (though I can't show that here as Windows does the conversion for me by default... Here is a [good example though. ](https://www.kaggle.com/rtatman/data-cleaning-challenge-character-encodings/notebook)