# Introduction
https://www.kaggle.com/alexisbcook/character-encodings
<br>Lets look at some different character encodings


In [2]:
# modules we'll use
import pandas as pd
import numpy as np

# helpful character encoding module
import chardet


In [2]:
# start with a string
before = "This is the euro symbol: €"

# check to see what datatype it is
type(before)

str

The other data is the bytes data type, which is a sequence of integers.
<br>You can convert a string into bytes by specifying which encoding it's in:

In [3]:
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8", errors="replace")

# check the type
type(after)

bytes

### bytes object
If you look at a bytes object, you'll see that it has a b in front of it, and then maybe some text after.<p> bytes are printed out as if they were characters encoded in ASCII. (ASCII is an older character encoding that doesn't really work for writing any language other than English.)</p> <p>Here you can see that our euro symbol has been replaced with some mojibake that looks like "\xe2\x82\xac" when it's printed as if it were an ASCII string.</p>

In [4]:
# take a look at what the bytes look like
after

b'This is the euro symbol: \xe2\x82\xac'

We can convert our bytes back to a string with the correct encoding.
<br>We can see that our text is all there correctly.

In [5]:
# convert it back to utf-8
print(after.decode("utf-8"))

This is the euro symbol: €


However, when we try to use a different encoding to map our bytes into a string, we get an error.<p>This is because the encoding we're trying to use doesn't know what to do with the bytes we're trying to pass it.<p>You need to tell Python the encoding that the byte string is actually supposed to be in.

In [6]:
# try to decode our bytes with the ascii encoding
print(after.decode("ascii"))

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 25: ordinal not in range(128)

Below we lose the original underlying byte string! 
<br>It is replaced with the underlying byte string for the unknown character :(

In [7]:
# start with a string
before = "This is the euro symbol: €"

# encode it to a different encoding, replacing characters that raise errors
after = before.encode("ascii", errors = "replace")

# convert it back to utf-8
print(after.decode("ascii"))

This is the euro symbol: ?


What happens when we read in data that is not in UTF format?

In [3]:
# try to read in a file not in UTF-8
messi_data = pd.read_csv("data/ks-projects-201612.csv")
messi_data

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 7955: invalid start byte

Notice that we get the same UnicodeDecodeError we got when we tried to decode UTF-8 bytes as if they were ASCII! This tells us that this file isn't actually UTF-8. We don't know what encoding it actually is though.

### How do we find the correct encoding?
<p>2 options</p>
<p>1 - We can test lots of different character encodings and see if any of them work</p>
<p>2 - The chardet module can give quite accurate estimates as to the encoding. It's not 100% guaranteed to be right, but it's usually faster than just trying to guess.</p>

Here we check the first ten thousand bytes of this file. This is usually enough for a good guess about what the encoding is and is much faster than trying to look at the whole file. (Especially with a large file this can be very slow.) <p>Another reason to just look at the first part of the file is that we can see by looking at the error message that the first problem is the 11th character. So we probably only need to look at the first little bit of the file to figure out what's going on.

In [14]:
# look at the first ten thousand bytes to guess the character encoding
with open("data/ks-projects-201612.csv", 'rb') as rawdata:
    result = chardet.detect(rawdata.read(10000))

# check what the character encoding might be
print(result)

{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}


chardet is 73% confident that the encoding is Windows-1252
<br> We will use Windows-1252 to decode the file

In [19]:
# read in the file with the encoding detected by chardet
kickstarter_2016 = pd.read_csv("data/ks-projects-201612.csv", encoding='Windows-1252')

# look at the first few lines
kickstarter_2016.head()

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09 11:36:00,1000,2015-08-11 12:12:28,0,failed,0,GB,0,,,,
1,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26 00:20:50,45000,2013-01-12 00:20:50,220,failed,3,US,220,,,,
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16 04:24:11,5000,2012-03-17 03:24:11,1,failed,1,US,1,,,,
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29 01:00:00,19500,2015-07-04 08:35:03,1283,canceled,14,US,1283,,,,
4,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01 13:38:27,50000,2016-02-26 13:38:27,52375,successful,224,US,52375,,,,


Saving your files with UTF-8 encoding
<p>Finally, once you haveyor data in UTF-8 format, you'll probably want to keep it that way. 
<p>The easiest way to do that is to save your files with UTF-8 encoding. UTF-8 is the standard encoding in Python, when you save a file it will be saved as UTF-8 by default:

In [None]:
# save our file (will be saved as UTF-8 by default!)
kickstarter_2016.to_csv("ks-projects-201801-utf8.csv")