In [1]:
import pandas as pd
import numpy as np

In [2]:
import chardet

In [4]:
np.random.seed(0)

### Character Encoding
 - specific set of rules to convert raw binary byte (1010110011) strings to human-readable text characters
 - UTF-8: the standard text encoding
 - all python code is in UTF-8
 - Two text data type in Python 3: str & bytes (sequence of integers)

In [16]:
textInStrFormat = "this is the dollar symbol: €"
type(textInStrFormat)

str

In [17]:
# encoding utf-8 str format --> a binary bytes format
# replacing characters that raise errors

textInBytesFormat = textInStrFormat.encode("utf-8", errors = 'replace')
type(textInBytesFormat)

bytes

In [18]:
textInBytesFormat

b'this is the dollar symbol: \xe2\x82\xac'

- the 'b' and '\xe2\x82\xac' in the printed bytes object 
- it's because bytes are printed as if they were encoded in ASCII

In [19]:
# mapping bytes --> utf-8

textInBytesFormat.decode("utf-8")

'this is the dollar symbol: €'

__Don't try to play cassette in a cd player__

In [20]:
# decode bytes with the ascii encoding
# UnicodeDecodeError!! ascii can't decode bytes 

textInBytesFormat.decode("ascii")

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 27: ordinal not in range(128)

In [21]:
textInStrFormat = "this is the dollar symbol: €"

# convert a string to bytes for ascii 
textInBytesFormat = textInStrFormat.encode("ascii", errors = 'replace')

# convert a bytes back to string for ascii 
# textInStrFormat is not in ASCII format 
# any characters not in ASCII will be replaced with unknown characters
textInBytesFormat.decode("ascii")

'this is the dollar symbol: ?'

The dangerous part about above cell is that there's not way to tell which character the '?' symbol should have been. That means we may have just made our data unusable!

### Reading in files with encoding problems

Most files you'll encounter will probably be encoded with UTF-8. This is what Python expects by default, so most of the time you won't run into problems. However, sometimes you'll get an error like this:

In [22]:
# no error
ksp_2018 = pd.read_csv("ks-projects-201801.csv")

In [27]:
# UnicodeDecodeError: try to read in a file not in UTF-8
ksp_2016 = pd.read_csv("ks-projects-201612.csv")

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x99 in position 11: invalid start byte

Use the __chardet module__ to try and automatically guess what the right encoding is. It's not 100% guaranteed to be right, but it's usually faster than just try and test a bunch of different character encodings and see if any of them work.

In [26]:
# look at the first ten thousand bytes to guess the character encoding

with open('ks-projects-201612.csv', 'rb') as rawdadta:
    result = chardet.detect(rawdadta.read(10000))

result

{'confidence': 0.73, 'encoding': 'windows-1252'}

In [33]:
with open('ks-projects-201801.csv', 'rb') as rawdata:
    result2 = chardet.detect(rawdata.read(10000))
result2

{'confidence': 0.73, 'encoding': 'windows-1252'}

chardet is 73% confidence that the right encoding is "Windows-1252"

In [None]:
???? why use 2018 data to find data type and then check on 2016 data?
??? why 2018 encoding type is also windos?

In [28]:
# instead of using defualt 'utf-8' encoding type, use 'windows-1252'

ksp_2016 = pd.read_csv("ks-projects-201612.csv", encoding = 'windows-1252')

  interactivity=interactivity, compiler=compiler, result=result)


In [30]:
ksp_2016.sample()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,Unnamed: 13,Unnamed: 14,Unnamed: 15,Unnamed: 16
81184,1482271562,The Hero Project: Documentary on single father...,Photography,Photography,USD,2013-04-21 13:01:24,45000,2013-03-07 13:02:35,453,failed,17,US,453,,,,


Now data reading is possible (got the file into UTF-8). Chardet correctly guessed the encoding correctly!

### Saving files with UTF-8 encoding

The easiest way to do is to save files with UTF-8 encoding. Since UTF-8 is the standard encoding in Python, it will be saved as UTF-8 by default:

In [31]:
ksp_2016.to_csv("ksp_2016_utf8.csv")