# Module 10 - Preprocessing - Encoding Categorical Variables with SKLearn

As always, we start with importing our usual packages and change a setting for getting more output:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

We're going to work on encoding a basic dataset for cars and their features:

In [2]:
loc = "https://github.com/mhall-simon/python/blob/main/data/misc/basic-categories.xlsx?raw=true"

df = pd.read_excel(loc)
df.head()

Unnamed: 0,cylinders,origin,hybrid
0,4,Americas,Yes
1,4,Americas,Yes
2,4,Asia,No
3,6,Europe,No
4,6,Europe,No


The first step is going to be splitting apart our columns into groups, we can do this programmaically to detect data types, or we can manually define them.

I'm going to cover manual definition for now.

We can put our groups with column names in a list with strings:

In [5]:
cols = list(df.columns.values)
cols

['cylinders', 'origin', 'hybrid']

And now we can use SKLearn's built in `OneHotEncoder()` to encode the categorical columns and binary all at once to start.

Let's look at it with no changes from default. 

**NOTE: I am only setting `sparse=False` so we get back a NumPy array, and not a SciPy Sparse Matrix since it's a small dataset. I always leave the default to a SciPy Sparse Matrix to improve performance when training models. In the first linear regression I show you how to generate a DataFrame from a sparse matrix.**

In [12]:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)

And now we can toss the dataframe into the encoder to see what it has done:

In [13]:
encoder.fit_transform(df)

array([[1., 0., 0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 1., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0., 1., 0.],
       [0., 0., 1., 0., 1., 0., 1., 0.],
       [0., 0., 1., 1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0., 0., 1., 0.]])

We now see all of our categorical features are an array of zeros and ones! This is good, however, we don't know what our column names are.

We can get the feature names by using the `.get_feature_names()` method with the transformer. Let's try it out:

In [14]:
encoder.get_feature_names(input_features=cols)

array(['cylinders_4', 'cylinders_6', 'cylinders_8', 'origin_Americas',
       'origin_Asia', 'origin_Europe', 'hybrid_No', 'hybrid_Yes'],
      dtype=object)

Let's put this all together in a DataFrame so we can see it effectively:

In [15]:
transformed = pd.DataFrame(encoder.fit_transform(df), columns=encoder.get_feature_names(input_features=cols))
transformed

Unnamed: 0,cylinders_4,cylinders_6,cylinders_8,origin_Americas,origin_Asia,origin_Europe,hybrid_No,hybrid_Yes
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
5,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
6,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0
7,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
8,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
9,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0


Isn't this a lot better than Excel!! If this was a complex dataset we would be prone to errors and have to manually define a lot of column names!

Let's now modify our encoder a bit to see how some changes work.

Our first one is going to be setting `drop='first'` to remove the first instance of a categorical feature.

In [16]:
encoder = OneHotEncoder(drop='first', sparse=False)

Let's see how our dataset looks now:

In [17]:
transformed = pd.DataFrame(encoder.fit_transform(df), columns=encoder.get_feature_names(input_features=cols))
transformed

Unnamed: 0,cylinders_6,cylinders_8,origin_Asia,origin_Europe,hybrid_Yes
0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0
5,1.0,0.0,0.0,1.0,0.0
6,1.0,0.0,1.0,0.0,0.0
7,0.0,1.0,1.0,0.0,0.0
8,0.0,1.0,1.0,0.0,0.0
9,0.0,1.0,0.0,0.0,0.0


`drop='first'` is removing the collinear feature by implying the additional factor is all zeros!

Depending upon your dataset and algorithm, you may only want to drop the first one for binary columns (yes/no).

Another setting we can check for is `drop='if_binary'`:

In [20]:
encoder = OneHotEncoder(drop='if_binary', sparse=False)

In [21]:
transformed = pd.DataFrame(encoder.fit_transform(df), columns=encoder.get_feature_names(input_features=cols))
transformed

Unnamed: 0,cylinders_4,cylinders_6,cylinders_8,origin_Americas,origin_Asia,origin_Europe,hybrid_Yes
0,1.0,0.0,0.0,1.0,0.0,0.0,1.0
1,1.0,0.0,0.0,1.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,1.0,0.0
5,0.0,1.0,0.0,0.0,0.0,1.0,0.0
6,0.0,1.0,0.0,0.0,1.0,0.0,0.0
7,0.0,0.0,1.0,0.0,1.0,0.0,0.0
8,0.0,0.0,1.0,0.0,1.0,0.0,0.0
9,0.0,0.0,1.0,1.0,0.0,0.0,0.0


We're now only dropping off the binary duplicate, as those two would be perfectly collinear! *(If we know a value is yes, for certain the other is no.)*

This is it for a quick introduction to encoding categorical features! This will be expanded upon and brought into the machine learning pipline later!