In [1]:
import numpy as np
import pandas as pd

#charactor encoding module
import charset_normalizer

np.random.seed(0)

#### Character encodings:
are specific sets of rules for mapping from raw binary byte strings to characters that make up human-readable text.

UTF-8 is the standard text encoding. All Python code is in UTF-8 and, ideally, all your data should be as well. It's when things aren't in UTF-8 that you run into trouble.

### Encoding: Transferring str to bytes

In [16]:
before = 'MyStringis: Here $'
type(before)

str

In [17]:
# encode it to a different encoding, replacing characters that raise errors
after = before.encode("utf-8",errors="replace")
type(after)

bytes

In [18]:
print(after)

b'MyStringis: Here $'


In [19]:
print(after.decode("utf-8"))

MyStringis: Here $


You can think of different encodings as different ways of recording music. You can record the same music on a CD, cassette tape or 8-track. While the music may sound more-or-less the same, you need to use the right equipment to play the music from each recording format. The correct decoder is like a cassette player or a CD player. If you try to play a cassette in a CD player, it just won't work.

In [20]:
print(after.decode("ascii"))

MyStringis: Here $


It's better to convert all our text to UTF-8 as soon as we can and keep it in that encoding. The best time to convert non UTF-8 input into UTF-8 is when you read in files

Most files are encoded in UTF-8. But, if they are not and you try to read it as "pd.read_csv('...')" that gives you \\UnicodeDecodeError\\ error

### charset_normalizer.detect

Can you be used to detect the encoding type of data, by looking at the (i.e. 10000) first bytes of data.

In [23]:
# read first 10000 bytes
with open("./NFL Play by Play 2009-2017 (v4).csv",'rb') as rawdata:
    result = charset_normalizer.detect(rawdata.read(10000))

print(result)

{'encoding': 'UTF-8-SIG', 'language': 'English', 'confidence': 1.0}


In [24]:
#So we can read the files by identifying the encoding type
# read in the file with the encoding detected by charset_normalizer
nfl_data = pd.read_csv("./NFL Play by Play 2009-2017 (v4).csv", encoding='UTF-8-SIG')

  nfl_data = pd.read_csv("./NFL Play by Play 2009-2017 (v4).csv", encoding='UTF-8-SIG')


In [25]:
nfl_data.head()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,...,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,...,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,...,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


### Saving your data with UTF-8 encoding

If you don't determine the encoding type, it will be saved as UTF-8 by default

In [None]:
nfl_data.to_csv("NLF_data_")