# Preprocessing Data

## Dummy variables

Scikit-learn only accepts numerical variables. To transform categorical into numerical variables, we create binary variables called dummy variables, where each column represents a value the variable can take, and numbers 0 or 1 indicate the absence or presence of that feature.



In [1]:
import pandas as pd

data = {'kills':['Drowner', 'Drowner', 'Harpy', 'Basilisk', 'Harpy'],
          'xp':[75, 84, 141, 603, 218]}

witcher = pd.DataFrame(data, columns=['kills', 'xp'])

witcher

Unnamed: 0,kills,xp
0,Drowner,75
1,Drowner,84
2,Harpy,141
3,Basilisk,603
4,Harpy,218


In order to transform the kills column into three binary variables (one for each value), we can use:

scikit-learn: OneHotEncoder
pandas: get_dummies()

In [2]:
witcher_dummies = pd.get_dummies(witcher)

witcher_dummies

Unnamed: 0,xp,kills_Basilisk,kills_Drowner,kills_Harpy
0,75,0,1,0
1,84,0,1,0
2,141,0,0,1
3,603,1,0,0
4,218,0,0,1


We get a 1 in the column for the corresponding feature, and 0 in the rest. We can do without the first column, since we can infer the value of that column from the others. If all the others are 0, then the first column has a 1, and if there's a 1 in one of the others, the first column has a 0.

In [3]:
witcher_dummies_drop_first = pd.get_dummies(witcher, drop_first=True)

witcher_dummies_drop_first

Unnamed: 0,xp,kills_Drowner,kills_Harpy
0,75,1,0
1,84,1,0
2,141,0,1
3,603,0,0
4,218,0,1
