# (Robust) One Hot Encoding

One-hot encoding is a common technique used to work with categorical features. There are multiple tools available to facilitate this preprocessing step in `Python`, but it usually becomes a much harder when you need your preprocessing code to work on new data that might have missing or additional values. That's the case if you want to deploy a model to production for instance, sometimes you don't know what new values will appear in the data you receive. 

In this tutorial I will present two ways of dealing with this problem. Everytime we will run one hot encoding on our training set first and save a few attributes that we can reuse later on when we need to process new data. If you deploy a model to production, the best way of saving those values is writing your own class and define them as attributes that will be set at training time, as an internal state.

If you're working in a notebook, it's fine to save them as simple variables.

# Let's create a dataset

Let's make up a dataset containing journeys that happened in different cities in the UK, using different ways of transportation/

We'll create a new `DataFrame` that contains two categorical features, `city` and `transport` as well as a numerical feature `duration` for the duration of the journey in minutes.

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame([["London", "car", 20],
                   ["Cambridge", "car", 10], 
                   ["Liverpool", "bus", 30]], 
                  columns=["city", "transport", "duration"])

In [3]:
df

Unnamed: 0,city,transport,duration
0,London,car,20
1,Cambridge,car,10
2,Liverpool,bus,30


Now let's create our "unseen" test data. To make it difficult, we will simulate the case where the test data has different values for the categorical features.

In [4]:
df_test = pd.DataFrame([["Manchester", "bike", 30], 
                        ["Cambridge", "car", 40], 
                        ["Liverpool", "bike", 10]], 
                       columns=["city", "transport", "duration"])

In [5]:
df_test

Unnamed: 0,city,transport,duration
0,Manchester,bike,30
1,Cambridge,car,40
2,Liverpool,bike,10


Here our column `city` does not have the value `London` but has a new value `Cambridge`. Our column `transport` has no value `bus` but the new value `bike`. Let's see how we can build one hot encoded features for those datasets!

We'll show two different methods, one using the `get_dummies` method from `pandas`, and the other with the `OneHotEncoder` class from `sklearn`.

## Using pandas' `get_dummies`

### Process our training data

First we define the list of categorical features that we will want to process:

In [6]:
cat_columns = ["city", "transport"]

We can really quickly build dummy features with pandas by calling the `get_dummies` function. Let's create a new `DataFrame` for our processed data:

In [7]:
df_processed = pd.get_dummies(df, prefix_sep="__", columns=cat_columns)

In [8]:
df_processed

Unnamed: 0,duration,city__Cambridge,city__Liverpool,city__London,transport__bus,transport__car
0,20,0,0,1,0,1
1,10,1,0,0,0,1
2,30,0,1,0,1,0


That's it for the training set part, now you have a `DataFrame` with one hot encoded features. We will need to save a few things into variables to make sure that we build the exact same columns on the test dataset.

See how pandas created new columns with the following format: `<column__value>`. Let's create a list that looks for those new columns and store them in a new variable `cat_dummies`.

In [9]:
cat_dummies = [col for col in df_processed if "__" in col and col.split("__")[0] in cat_columns]

In [10]:
cat_dummies

['city__Cambridge',
 'city__Liverpool',
 'city__London',
 'transport__bus',
 'transport__car']

Let's also save the list of columns so we can enforce the order of columns later on

In [11]:
processed_columns = list(df_processed.columns[:])

In [12]:
processed_columns

['duration',
 'city__Cambridge',
 'city__Liverpool',
 'city__London',
 'transport__bus',
 'transport__car']

### Process our unseen (test) data!

Now let's see how to ensure our test data has the same columns, first let's call `get_dummies` on it:

In [13]:
df_test_processed = pd.get_dummies(df_test, prefix_sep="__", columns=cat_columns)

Let's look at our new dataset

In [14]:
df_test_processed

Unnamed: 0,duration,city__Cambridge,city__Liverpool,city__Manchester,transport__bike,transport__car
0,30,0,0,1,1,0
1,40,1,0,0,0,1
2,10,0,1,0,1,0


As expected we have new columns (`city__Manchester`) and missing ones (`transport__bus`). But we can easily clean it up!

In [15]:
# Remove additional columns
for col in df_test_processed.columns:
    if ("__" in col) and (col.split("__")[0] in cat_columns) and col not in cat_dummies:
        print("Removing additional feature {}".format(col))
        df_test_processed.drop(col, axis=1, inplace=True)

Removing additional feature city__Manchester
Removing additional feature transport__bike


Now we need to add the missing columns. We can set all missing columns to a vector of 0s since those values did not appear in the test data.

In [16]:
for col in cat_dummies:
    if col not in df_test_processed.columns:
        print("Adding missing feature {}".format(col))
        df_test_processed[col] = 0

Adding missing feature city__London
Adding missing feature transport__bus


In [17]:
df_test_processed

Unnamed: 0,duration,city__Cambridge,city__Liverpool,transport__car,city__London,transport__bus
0,30,0,0,0,0,0
1,40,1,0,1,0,0
2,10,0,1,0,0,0


That's it, we now have the same features. Note that the order of the columns isn't kept though, if you need to reorder the columns, reuse the list of processed columns we saved earlier:

In [18]:
df_test_processed = df_test_processed[processed_columns]

In [19]:
df_test_processed

Unnamed: 0,duration,city__Cambridge,city__Liverpool,city__London,transport__bus,transport__car
0,30,0,0,0,0,0
1,40,1,0,0,0,1
2,10,0,1,0,0,0


All good! Now let's see how to do the same with sklearn and the `OneHotEncoder`

## With sklearn one hot and label encoder

### Process our training data

Let's start by importing what we need. The `OneHotEncoder` to build one hot features, but also the `LabelEncoder` to transform strings into integer labels (needed before using the `OneHotEncoder`)

In [20]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

We're starting again from our initial dataframe and our list of categorical features.

In [21]:
cat_columns = ["city", "transport"]
df

Unnamed: 0,city,transport,duration
0,London,car,20
1,Cambridge,car,10
2,Liverpool,bus,30


First let's create our `df_processed` DataFrame, we can take all the non-categorical features to start with:

In [22]:
df_processed = df[[col for col in df.columns if col not in cat_columns]]

In [23]:
df_processed

Unnamed: 0,duration
0,20
1,10
2,30


Now we need to encode every categorical feature separately, meaning we need as many encoders as categorical features. Let's loop over all categorical features and build a dictionary that will map a feature to its encoder:

In [24]:
# For each categorical column
# We fit a label encoder, transform our column and 
# add it to our new dataframe
label_encoders = {}
for col in cat_columns:
    print("Encoding {}".format(col))
    new_le = LabelEncoder()
    df_processed[col] = new_le.fit_transform(df[col])
    label_encoders[col] = new_le

Encoding city
Encoding transport


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [25]:
df_processed

Unnamed: 0,duration,city,transport
0,20,2,1
1,10,0,1
2,30,1,0


Now that we have proper integer labels, we need to one hot encode our categorical features.

Unfortunately, the one hot encoder does not support passing the list of categorical features by their names but only by their indexes, so let's get a new list, now with indexes. We can use the `get_loc` method to get the index of each of our categorical columns:

In [26]:
cat_columns_idx = [df_processed.columns.get_loc(col) for col in cat_columns]

We'll need to specify `handle_unknown` as `ignore` so the `OneHotEncoder` can work later on with our unseen data.
The `OneHotEncoder` will build a numpy array for our data, replacing our original features by one hot encoding versions. Unfortunately it can be hard to re-build the DataFrame with nice labels, but most algorithms work with numpy arrays, so we can stop there.

In [27]:
ohe = OneHotEncoder(categorical_features=cat_columns_idx, sparse=False, handle_unknown="ignore")

In [28]:
df_processed_np = ohe.fit_transform(df_processed)

In [29]:
df_processed_np

array([[ 0.,  0.,  1.,  0.,  1., 20.],
       [ 1.,  0.,  0.,  0.,  1., 10.],
       [ 0.,  1.,  0.,  1.,  0., 30.]])

### Process our unseen (test) data

Now we need to apply the same steps on our test data; first create a new dataframe with our non-categorical features:

In [30]:
df_test_processed = df_test[[col for col in df_test.columns if col not in cat_columns]]

In [31]:
df_test_processed

Unnamed: 0,duration
0,30
1,40
2,10


Now we need to reuse our `LabelEncoder`s to properly assign the same integer to the same values. Unfortunately since we have new, unseen, values in our test dataset, we cannot use transform. Instead we will create a new dictionary from the `classes_` defined in our label encoder. Those classes map a value to an integer. If we then use `map` on our pandas `Series`, it set the new values as `NaN` and convert the type to float. 

Here we will add a new step that fills the `NaN` by a huge integer, say 9999 and converts the column to `int`.

In [32]:
for col in cat_columns:
    print("Encoding {}".format(col))
    label_map = {val: label for label, val in enumerate(label_encoders[col].classes_)}
    print(label_map)
    df_test_processed[col] = df_test[col].map(label_map)
    # fillna and convert to int
    df_test_processed[col] = df_test_processed[col].fillna(9999).astype(int)

Encoding city
{'Cambridge': 0, 'Liverpool': 1, 'London': 2}
Encoding transport
{'bus': 0, 'car': 1}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys


In [33]:
df_test_processed

Unnamed: 0,duration,city,transport
0,30,9999,9999
1,40,0,1
2,10,1,9999


Looks good, now we can finally apply our fitted `OneHotEncoder` "out-of-the-box" by using the transform method:

In [34]:
df_test_processed_np = ohe.transform(df_test_processed)

In [35]:
df_test_processed_np

array([[ 0.,  0.,  0.,  0.,  0., 30.],
       [ 1.,  0.,  0.,  0.,  1., 40.],
       [ 0.,  1.,  0.,  0.,  0., 10.]])

Double check that it has the same columns as the `pandas` version!