In [65]:
import thinkplot
import thinkstats2
import pandas as pd
import numpy as np


##Seaborn for fancy plots. 
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (8,8)

<h1>Encoding</h1>

Encoding is taking categorical data and transforming it into a numerical representation. Why? In short, you can't do math with words. 

For example, the linear regression algorithm generates a predictive model in the format: Y = slope * X + intercept. If that X value is a categorical value like hair color, it makes no sense to calculate slope * haricolor. (Note: categorical values don't really make sense in a regression with only one varaible).

By encoding these categorical varaibles, we change them into numbers, numbers can then be used in equations. 

There are multiple different types of encoding, each aiming to translate the data into a number in different ways. We'll focus on two here - using other ones is sometimes usefull, usually simple, but rarely important for us. 

In [66]:
#Load some data
df = pd.read_csv("Student Performance new.csv")
df = df.rename(columns={"parental level of education":"parent education"})
df.head()

Unnamed: 0.1,Unnamed: 0,race/ethnicity,parent education,lunch,test preparation course,math percentage,reading score percentage,writing score percentage,sex
0,0,group B,bachelor's degree,standard,none,0.72,0.72,0.74,F
1,1,group C,some college,standard,completed,0.69,0.9,0.88,F
2,2,group B,master's degree,standard,none,0.9,0.95,0.93,F
3,3,group A,associate's degree,free/reduced,none,0.47,0.57,0.44,M
4,4,group C,some college,standard,none,0.76,0.78,0.75,M


There are a bunch of cetegorical values here. If we want to make a predictive model, we'll need to do some prep work. First, we'll look at the level of education. 

<h2>Ordinal Encoding</h2>

First, there are two basic data types of categorical variables - ordinal and nominal. The difference is that ordinal implies some order, e.g.:
<ul>
<li>A Likert scale (Strongly Agree, Agree, Neutral...)
<li>A ranking of weight classes for boxing/wrestling.
<li>A tier for rankings like Michelin Star restaraunt ratings.
</ul>

Basically, if the value implies some time of lesser/greater, first/last, smaller/bigger type of relationship, then that categorical varaible is likely ordinal. 

If that ordinal property is important - the order is meaningful to the problem being solved - we can capture it with ordinal encoding. We will do the education level as an ordinal value - we are assuming here that the amount of education has an order that matters for this problem.

The code is pretty simple, we:
<ul>
<li>Create a dictionary with the mappings. 
<li>Use the map function to assign the values. (We could loop through with if statements if we forget the map function, that's just more work)
</ul>

In [67]:
#Values in parental edcation column
df["parent education"].value_counts()

some college          226
associate's degree    222
high school           196
some high school      179
bachelor's degree     118
master's degree        59
Name: parent education, dtype: int64

In [68]:
#Ordinal Encoding. 
ed_dict = {"some high school":1,
            "high school":2,
            "some college":3,
            "associate's degree":4,
            "bachelor's degree":5,
            "master's degree":6}
df["parent education ordinal"] = df["parent education"].map(ed_dict)
df.sample(10)

Unnamed: 0.1,Unnamed: 0,race/ethnicity,parent education,lunch,test preparation course,math percentage,reading score percentage,writing score percentage,sex,parent education ordinal
724,724,group B,some college,standard,none,0.47,0.43,0.41,M,3
483,483,group A,high school,standard,none,0.59,0.52,0.46,M,2
110,110,group D,associate's degree,free/reduced,completed,0.77,0.89,0.98,F,4
165,165,group C,bachelor's degree,standard,completed,0.96,1.0,1.0,F,5
385,385,group E,some college,standard,none,0.67,0.76,0.75,F,3
386,386,group E,bachelor's degree,standard,none,0.64,0.73,0.7,F,5
389,389,group D,master's degree,standard,none,0.73,0.7,0.75,M,6
384,384,group A,some high school,free/reduced,none,0.38,0.43,0.43,F,1
155,155,group C,some college,standard,completed,0.7,0.89,0.88,F,3
263,263,group E,high school,standard,none,0.99,0.93,0.9,F,2


Now that new ordinal column can replace the original for doing analysis. 

How you want to handle the new column is up to you, in general, I'd probably make a new dataframe with all the encoded columns in it, and leave the original as is. You could also replace the column as you go. It doesn't matter much in the end. 

Now, on to the rest...

<h1>One Hot Encoding</h1>

One hot encoding can be used in almost any situation and is very common. The basic idea is that it takes each different category in a categorical variable and converts it into a true-false column (*note: slight twist on that to come) of its own, e.g.
<ul>
<li>If we have a column representing hair color, with black, blonde, and brown the 3 choices. 
<ul>
<li>Create a column for each value - black, brown, and blonde.
<li>For whatever hair color that row of data is, they get a 1 in that column. All others get 0. 
</ul>
</ul>

We can do this with a pretty easy function in scikit learn. We'l do the race/ethnicity as an example...

In [69]:
#List Values
df["race/ethnicity"].value_counts()

group C    319
group D    262
group B    190
group E    140
group A     89
Name: race/ethnicity, dtype: int64

In [70]:
#Load one-hot stuff. 
from sklearn.preprocessing import OneHotEncoder
ohc = OneHotEncoder()

In [71]:
#One Hot Encoding. 

#We need to feed it our data as an array
#And make the array 2D - however many rows by 1 column
tmpArray = np.array(df["race/ethnicity"])
print(tmpArray.shape)
tmpArray = tmpArray.reshape(-1,1)
print(tmpArray.shape)

#Use the one-hot functions to encode
encoder = ohc.fit_transform(tmpArray)
encoded = encoder.toarray()
print(encoded)

(1000,)
(1000, 1)
[[0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0.]
 ...
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 0.]]


We now have to put it back together with the original data. Here we'll put it back into a dataframe, but in other scenarios you might be making everything into an array anyway, so you may want to put them into that array. 

In [72]:
#Show the generated feature names
ohc.get_feature_names(["race/ethnicity"])

array(['race/ethnicity_group A', 'race/ethnicity_group B',
       'race/ethnicity_group C', 'race/ethnicity_group D',
       'race/ethnicity_group E'], dtype=object)

In [73]:
#Make a new df from the encoded data. Concat it to the original 
dfEncoded = pd.DataFrame(encoded, columns=ohc.get_feature_names(["race/ethnicity"]))
dfComb = pd.concat([df,dfEncoded], axis=1)
dfComb.head()

Unnamed: 0.1,Unnamed: 0,race/ethnicity,parent education,lunch,test preparation course,math percentage,reading score percentage,writing score percentage,sex,parent education ordinal,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E
0,0,group B,bachelor's degree,standard,none,0.72,0.72,0.74,F,5,0.0,1.0,0.0,0.0,0.0
1,1,group C,some college,standard,completed,0.69,0.9,0.88,F,3,0.0,0.0,1.0,0.0,0.0
2,2,group B,master's degree,standard,none,0.9,0.95,0.93,F,6,0.0,1.0,0.0,0.0,0.0
3,3,group A,associate's degree,free/reduced,none,0.47,0.57,0.44,M,4,1.0,0.0,0.0,0.0,0.0
4,4,group C,some college,standard,none,0.76,0.78,0.75,M,3,0.0,0.0,1.0,0.0,0.0


It is probably a good idea to take a double check of the column names to ensure that they are accurate. By default, it should just all line up. If you've done sorting or something with your data there may be a misalignment. If your data hasn't been shuffled around, it should just work.

This one looks ok.

Now, back to that one exception from above....

<h2>Dropping a Columns</h2>

A common thing with one hot is to drop one of the columns, there's a couple of reasons this makes sense:
<ul>
<li>Having all 0s can mean something, so whatever we remove is represented by just all 0s everywhere else. 
<li>This can reduce issues of multicolinearity, which can be an issue - particularly with linear regression. 
</ul>

However, there are also scenarios where we don't want to do it:
<ul>
<li>A tree based method should have everything left in.  
<li>When using regularization, which is a common ML technique to fight overfitting. 
</ul>

For now, just drop one, since we're doing regression and that is negatively impacted. Later on it can be a case by case decision, we'll often leave them as is. Ideally, you'd probably want to drop or not depending on the exact algorithm you use. In code, the difference is simple:

In [77]:
#Drop One - we can specify it in one argument for the encoder. 
ohc2 = OneHotEncoder(drop="first")

#We need to feed it our data as an array
#And make the array 2D - however many rows by 1 column
tmpArray2 = np.array(df["race/ethnicity"])
tmpArray2 = tmpArray2.reshape(-1,1)
#Use the one-hot functions to encode
encoder2 = ohc2.fit_transform(tmpArray2)
encoded2 = encoder2.toarray()

dfEncoded2 = pd.DataFrame(encoded2, columns=ohc2.get_feature_names(["race/ethnicity"]))
dfComb2 = pd.concat([df,dfEncoded2], axis=1)
dfComb2.head()

Unnamed: 0.1,Unnamed: 0,race/ethnicity,parent education,lunch,test preparation course,math percentage,reading score percentage,writing score percentage,sex,parent education ordinal,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E
0,0,group B,bachelor's degree,standard,none,0.72,0.72,0.74,F,5,1.0,0.0,0.0,0.0
1,1,group C,some college,standard,completed,0.69,0.9,0.88,F,3,0.0,1.0,0.0,0.0
2,2,group B,master's degree,standard,none,0.9,0.95,0.93,F,6,1.0,0.0,0.0,0.0
3,3,group A,associate's degree,free/reduced,none,0.47,0.57,0.44,M,4,0.0,0.0,0.0,0.0
4,4,group C,some college,standard,none,0.76,0.78,0.75,M,3,0.0,1.0,0.0,0.0


Look at the data and find one where the race is A. They are now represented with all 0s in every other column. 

From here on out (assuming you've dealt with other columns) you can proceed to do whatever you need to do. 
<br><br><br>
<h2>That's too much work, you dummy!</h2>

Lastly, as a shortcut, we can just do this pretty quickly with a pandas function, get_dummies. The main reason that we'd care about the above stuff is that later on we can build pipelines of steps to prep data, so getting a little familiarity doesn't hurt. This should be just as good though, you can do it the easy way. 

In [75]:
#Get dummies
df2 = pd.get_dummies(df)
df2.head()

Unnamed: 0.1,Unnamed: 0,math percentage,reading score percentage,writing score percentage,parent education ordinal,race/ethnicity_group A,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,...,parent education_high school,parent education_master's degree,parent education_some college,parent education_some high school,lunch_free/reduced,lunch_standard,test preparation course_completed,test preparation course_none,sex_F,sex_M
0,0,0.72,0.72,0.74,5,0,1,0,0,0,...,0,0,0,0,0,1,0,1,1,0
1,1,0.69,0.9,0.88,3,0,0,1,0,0,...,0,0,1,0,0,1,1,0,1,0
2,2,0.9,0.95,0.93,6,0,1,0,0,0,...,0,1,0,0,0,1,0,1,1,0
3,3,0.47,0.57,0.44,4,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,1
4,4,0.76,0.78,0.75,3,0,0,1,0,0,...,0,0,1,0,0,1,0,1,0,1


In [76]:
#And with dropping
df3 = pd.get_dummies(df, drop_first=True)
df3.head()

Unnamed: 0.1,Unnamed: 0,math percentage,reading score percentage,writing score percentage,parent education ordinal,race/ethnicity_group B,race/ethnicity_group C,race/ethnicity_group D,race/ethnicity_group E,parent education_bachelor's degree,parent education_high school,parent education_master's degree,parent education_some college,parent education_some high school,lunch_standard,test preparation course_none,sex_M
0,0,0.72,0.72,0.74,5,1,0,0,0,1,0,0,0,0,1,1,0
1,1,0.69,0.9,0.88,3,0,1,0,0,0,0,0,1,0,1,0,0
2,2,0.9,0.95,0.93,6,1,0,0,0,0,0,1,0,0,1,1,0
3,3,0.47,0.57,0.44,4,0,0,0,0,0,0,0,0,0,0,1,1
4,4,0.76,0.78,0.75,3,0,1,0,0,0,0,0,1,0,1,1,1
