# OneHot Encoding Revisited
## Dr Jose Albornoz
### February 2021

This notebook explores variations on the theme of one-hot encoding

In [1]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# 1.- Naive use of OneHot encoding

In [2]:
enc = OneHotEncoder(handle_unknown='ignore')

In [3]:
X = [['Male', 1], ['Female', 3], ['Female', 2]]

In [4]:
enc.fit(X)

OneHotEncoder(handle_unknown='ignore')

In [5]:
enc.categories_

[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]

In [6]:
tr = enc.transform(X).toarray()

In [7]:
tr

array([[0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1.],
       [1., 0., 0., 1., 0.]])

As seen above: 
* the argument to the one hot encoder must be a numpy array
* the dummy columns are created considering alphabetical order (female first, male second)
* a numerical feature was encoded; we don't want this behaviour

# 2.- A better alternative to OneHot encoding

In [8]:
df1 = pd.DataFrame([
['green', 'M', 10.1, 'class1'],
['red', 'L', 13.5, 'class2'],
['blue', 'XL', 15.3, 'class1']])

In [9]:
df1.columns = ['color', 'size', 'price', 'classlabel']

In [10]:
model_OHE1 = ColumnTransformer(
    [('OHE', OneHotEncoder(),['color','size'])],
    remainder = 'passthrough')

In [11]:
model_OHE1.fit(df1)

ColumnTransformer(remainder='passthrough',
                  transformers=[('OHE', OneHotEncoder(), ['color', 'size'])])

In [12]:
model_OHE1.transform(df1)

array([[0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 10.1, 'class1'],
       [0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 13.5, 'class2'],
       [1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 15.3, 'class1']], dtype=object)

This is much more convenient:
* we can pass a dataframe to the ColumnTransformer
* we can specify which columns will be encoded
* the dummy columns are created considering alphabetical order (blue, green, red)
* the columns that are not encoded (e.g. numerical features) are concatenated with the encoded features

# 3.- A production situation: unexpected categories

In [13]:
df2 = pd.DataFrame([
['blue', 'M', 10.1, 'class1'],
['black', 'L', 13.5, 'class2'],
['red', 'XL', 15.3, 'class1']])

In [14]:
df2.columns = ['color', 'size', 'price', 'classlabel']

In [15]:
model_OHE1.transform(df2)

ValueError: Found unknown categories ['black'] in column 0 during transform

In this case there is an error. However, we can ignore unknown categories in production

In [16]:
model_OHE2 = ColumnTransformer(
    [('OHE', OneHotEncoder(handle_unknown = 'ignore'),['color','size'])],
    remainder = 'passthrough')

In [17]:
model_OHE2.fit(df1)

ColumnTransformer(remainder='passthrough',
                  transformers=[('OHE', OneHotEncoder(handle_unknown='ignore'),
                                 ['color', 'size'])])

In [18]:
model_OHE2.transform(df2)

array([[1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 10.1, 'class1'],
       [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 13.5, 'class2'],
       [0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 15.3, 'class1']], dtype=object)

In this case the second column (the one for 'green') is empty, and the unknown category was ignored - no crashes in production
