## One Hot Encoding

One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

For example, for the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is 'female' or 0 otherwise, or we can generate the variable "male", which takes 1 if the person is 'male' and 0 otherwise.

For the categorical variable "colour" with values 'red', 'blue' and 'green', we can create 3 new variables called "red", "blue" and "green". These variables will take the value 1, if the observation is of the said colour or 0 otherwise. 


### Encoding into k-1 dummy variables

Note however, that for the variable "colour", by creating 2 binary variables, say "red" and "blue", we already encode **ALL** the information:

- if the observation is red, it will be captured by the variable "red" (red = 1, blue = 0)
- if the observation is blue, it will be captured by the variable "blue" (red = 0, blue = 1)
- if the observation is green, it will be captured by the combination of "red" and "blue" (red = 0, blue = 0)

We do not need to add a third variable "green" to capture that the observation is green.

More generally, a categorical variable should be encoded by creating k-1 binary variables, where k is the number of distinct categories. In the case of gender, k=2 (male / female), therefore we need to create only 1 (k - 1 = 1) binary variable. In the case of colour, which has 3 different categories (k=3), we need to create 2 (k - 1 = 2) binary variables to capture all the information.

One hot encoding into k-1 binary variables takes into account that we can use 1 less dimension and still represent the whole information: if the observation is 0 in all the binary variables, then it must be 1 in the final (not present) binary variable.

**When one hot encoding categorical variables, we create k - 1 binary variables**


Most machine learning algorithms, consider the entire data set while being fit. Therefore, encoding categorical variables into k - 1 binary variables, is better, as it avoids introducing redundant information.


### Exception: One hot encoding into k dummy variables

There are a few occasions when it is better to encode variables into k dummy variables:

- when building tree based algorithms
- when doing feature selection by recursive algorithms
- when interested in determine the importance of each single category

Tree based algorithms, as opposed to the majority of machine learning algorithms, **do not** evaluate the entire dataset while being trained. They randomly extract a subset of features from the data set at each node for each tree. Therefore, if we want a tree based algorithm to consider **all** the categories, we need to encode categorical variables into **k binary variables**.

If we are planning to do feature selection by recursive elimination (or addition), or if we want to evaluate the importance of each single category of the categorical variable, then we will also need the entire set of binary variables (k) to let the machine learning model select which ones have the most predictive power.


### Advantages of one hot encoding

- Straightforward to implement
- Makes no assumption about the distribution or categories of the categorical variable
- Keeps all the information of the categorical variable
- Suitable for linear models

### Limitations

- Expands the feature space
- Does not add extra information while encoding
- Many dummy variables may be identical, introducing redundant information


### Notes

If our datasets contain a few highly cardinal variables, we will end up very soon with datasets with thousands of columns, which may make training of our algorithms slow, and model interpretation hard.

In addition, many of these dummy variables may be similar to each other, since it is not unusual that 2 or more variables share the same combinations of 1 and 0s. Therefore one hot encoding may introduce redundant or duplicated information even if we encode into k-1.


## In this demo:

We will see how to perform one hot encoding with:
- pandas
- Scikit-learn
- Feature-Engine

And the advantages and limitations of each implementation using the Titanic dataset.

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder

from feature_engine.encoding import OneHotEncoder as fe_OneHotEncoder

In [3]:
# Loading Titanic Dataset
data = pd.read_csv('../data/titanic.csv')

In [4]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
# Columns that have null values

data.columns[data.isna().any()]

Index(['Age', 'Cabin', 'Embarked'], dtype='object')

In [13]:
# Count of nulls in columns
data[data.columns[data.isna().any()]].isna().sum()

Age         177
Cabin       687
Embarked      2
dtype: int64

In [27]:
# Filtering only Object data types

data.select_dtypes(include='object').columns.to_list()

['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked']

In [30]:
# Looking at count of unique values in each column
data.nunique()

PassengerId    891
Survived         2
Pclass           3
Name           891
Sex              2
Age             88
SibSp            7
Parch            7
Ticket         681
Fare           248
Cabin          147
Embarked         3
dtype: int64

In [32]:
# To These categorical columns we would like to encode


data = data[['Sex','Survived','Cabin','Embarked']]
data.head()

Unnamed: 0,Sex,Survived,Cabin,Embarked
0,male,0,,S
1,female,1,C85,C
2,female,1,,S
3,female,1,C123,S
4,male,0,,S


In [34]:
# Count of values in each column

data.count()

Sex         891
Survived    891
Cabin       204
Embarked    889
dtype: int64

In [40]:
# Count of nulls in each columns

data.isna().sum()

Sex           0
Survived      0
Cabin       687
Embarked      2
dtype: int64

In [41]:
# Filtering only null columns

data.columns[data.isna().any()]

Index(['Cabin', 'Embarked'], dtype='object')

In [49]:
data.Cabin.value_counts().to_frame()

Unnamed: 0,Cabin
B96 B98,4
G6,4
C23 C25 C27,4
C22 C26,3
F33,3
...,...
E34,1
C7,1
C54,1
E36,1


In [52]:
data.Cabin[data.Cabin =='C23 C25 C27']

27     C23 C25 C27
88     C23 C25 C27
341    C23 C25 C27
438    C23 C25 C27
Name: Cabin, dtype: object

In [55]:
# Capturing only 1st letter of the Cabin 

# data['Cabin']= data['Cabin'].str[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data['Cabin']= data['Cabin'].str[0]


In [60]:
# Capturing only 1st letter of the Cabin 

data.loc[0:,'Cabin'] = data['Cabin'].str[0]

In [61]:
data.head()

Unnamed: 0,Sex,Survived,Cabin,Embarked
0,male,0,,S
1,female,1,C,C
2,female,1,,S
3,female,1,C,S
4,male,0,,S


### Encoding important
Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set.

Why?

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In [63]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['Sex', 'Embarked', 'Cabin']],  # predictors
    data['Survived'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((623, 3), (268, 3))

In [64]:
X_train.head()

Unnamed: 0,Sex,Embarked,Cabin
857,male,S,E
52,female,C,D
386,male,S,
124,male,S,D
578,female,C,


In [70]:
# Lets explor the cardinality

for col in X_train.columns:
    print(col,'-',X_train[col].unique(),'-',X_train[col].nunique())

Sex - ['male' 'female'] - 2
Embarked - ['S' 'C' 'Q' nan] - 3
Cabin - ['E' 'D' nan 'B' 'C' 'A' 'F' 'G' 'T'] - 8


In [76]:
X_train.count()

Sex         623
Embarked    621
Cabin       152
dtype: int64

In [75]:
X_train.isna().sum()

Sex           0
Embarked      2
Cabin       471
dtype: int64

## One Hot Encoding With Pandas


### Advantages

- quick
- returns pandas dataframe
- returns feature names for the dummy variables

### Limitations of pandas:

- it does not preserve information from train data to propagate to test data


-----

The pandas method get_dummies(), will create as many binary variables as categories in the variable:

If the variable colour has 3 categories in the train data, it will create 2 dummy variables. However, if the variable colour has 5 categories in the test data, it will create 4 binary variables, therefore train and test sets will end up with different number of features and will be incompatible with training and scoring using Scikit-learn.

In practice, we shouldn't be using get-dummies in our machine learning pipelines. It is however useful, for a quick data exploration. Let's look at this with examples.

### into k  dummy variables

In [77]:
# Creating dummy variables with the pandas buildin method

tmp = pd.get_dummies(X_train['Sex'])
tmp.head()

Unnamed: 0,female,male
857,0,1
52,1,0
386,0,1
124,0,1
578,1,0


In [79]:
# for better visualisation let's put the dummies next
# to the original variable

pd.concat([X_train['Sex'],tmp],axis = 1)

Unnamed: 0,Sex,female,male
857,male,0,1
52,female,1,0
386,male,0,1
124,male,0,1
578,female,1,0
...,...,...,...
835,female,1,0
192,female,1,0
629,male,0,1
559,female,1,0


In [83]:
# In the similar manner we can do it for other variables or
# we can do for all categorical variables in one go

tmp = pd.get_dummies(X_train)
print(tmp.shape)
tmp.head()

(623, 13)


Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
857,0,1,0,0,1,0,0,0,0,1,0,0,0
52,1,0,1,0,0,0,0,0,1,0,0,0,0
386,0,1,0,0,1,0,0,0,0,0,0,0,0
124,0,1,0,0,1,0,0,0,1,0,0,0,0
578,1,0,1,0,0,0,0,0,0,0,0,0,0


In [84]:
#tmp.isna().sum()

Sex_female    0
Sex_male      0
Embarked_C    0
Embarked_Q    0
Embarked_S    0
Cabin_A       0
Cabin_B       0
Cabin_C       0
Cabin_D       0
Cabin_E       0
Cabin_F       0
Cabin_G       0
Cabin_T       0
dtype: int64

In [86]:
# If any of the value is null then encoding will be 0 categories of that column
# for index 386 cabin has a null value hence all the categories for cabin are 0

tmp.iloc[386]

Sex_female    0
Sex_male      1
Embarked_C    0
Embarked_Q    0
Embarked_S    1
Cabin_A       0
Cabin_B       0
Cabin_C       0
Cabin_D       0
Cabin_E       0
Cabin_F       0
Cabin_G       0
Cabin_T       0
Name: 667, dtype: uint8

In [87]:
# and now for all variables together: test set

tmp = pd.get_dummies(X_test)

print(tmp.shape)

tmp.head()

(268, 12)


Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G
495,0,1,1,0,0,0,0,0,0,0,0,0
648,0,1,0,0,1,0,0,0,0,0,0,0
278,0,1,0,1,0,0,0,0,0,0,0,0
31,1,0,1,0,0,0,1,0,0,0,0,0
255,1,0,1,0,0,0,0,0,0,0,0,0


Notice the positives of pandas get_dummies:
- dataframe returned with feature names

**And the limitations:**

The train set contains 13 dummy features, whereas the test set contains 12 features. This occurred because there was no category T in cabin in the test set.

This will cause problems if training and scoring models with scikit-learn, because predictors require train and test sets to be of the same shape.

# K-1 with get_dummies

In [89]:
# obtaining k-1 labels: we need to indicate get_dummies
# to drop the first binary variable

tmp = pd.get_dummies(X_train['Sex'], drop_first=True)

tmp.head()

Unnamed: 0,male
857,1
52,0
386,1
124,1
578,0


In [91]:
# obtaining k-1 labels: we need to indicate get_dummies
# to drop the first binary variable

tmp = pd.get_dummies(X_train['Embarked'], drop_first=True)

tmp.head()

Unnamed: 0,Q,S
857,0,1
52,0,0
386,0,1
124,0,1
578,0,0


For embarked, if an observation shows 0 for Q and S, then its value must be C, the remaining category.

Caveat, this variable has missing data, so unless we encode missing data as well, all the information contained in the variable is not captured.

In [92]:
# altogether: train set

tmp = pd.get_dummies(X_train, drop_first=True)

print(tmp.shape)

tmp.head()

(623, 10)


Unnamed: 0,Sex_male,Embarked_Q,Embarked_S,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T
857,1,0,1,0,0,0,1,0,0,0
52,0,0,0,0,0,1,0,0,0,0
386,1,0,1,0,0,0,0,0,0,0
124,1,0,1,0,0,1,0,0,0,0
578,0,0,0,0,0,0,0,0,0,0


In [93]:
# altogether: test set

tmp = pd.get_dummies(X_test, drop_first=True)

print(tmp.shape)

tmp.head()

(268, 9)


Unnamed: 0,Sex_male,Embarked_Q,Embarked_S,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G
495,1,0,0,0,0,0,0,0,0
648,1,0,1,0,0,0,0,0,0
278,1,1,0,0,0,0,0,0,0
31,0,0,0,1,0,0,0,0,0
255,0,0,0,0,0,0,0,0,0


###  get_dummies() can handle missing values

In [95]:
# we can add an additional dummy variable to indicate
# missing data

pd.get_dummies(X_train['Embarked'], drop_first=True, dummy_na=True).head()

Unnamed: 0,Q,S,NaN
857,0,1,0
52,0,0,0
386,0,1,0
124,0,1,0
578,0,0,0


In [100]:
X_train.Embarked[X_train['Embarked'].isna()]

61     NaN
829    NaN
Name: Embarked, dtype: object

In [108]:
emb = pd.get_dummies(X_train['Embarked'], drop_first=True, dummy_na=True)
emb.head(2)

Unnamed: 0,Q,S,NaN
857,0,1,0
52,0,0,0


In [127]:
emb.isna().sum()

Q      0
S      0
NaN    0
dtype: int64

In [128]:
embt = pd.get_dummies(X_test['Embarked'], drop_first=True, dummy_na=True)
embt.head(2)

Unnamed: 0,Q,S,NaN
495,0,0,0
648,0,1,0


In [129]:
embt.isna().sum()

Q      0
S      0
NaN    0
dtype: int64

In [134]:
pd.get_dummies(X_train, drop_first=True, dummy_na=True)

Unnamed: 0,Sex_male,Sex_nan,Embarked_Q,Embarked_S,Embarked_nan,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_T,Cabin_nan
857,1,0,0,1,0,0,0,0,1,0,0,0,0
52,0,0,0,0,0,0,0,1,0,0,0,0,0
386,1,0,0,1,0,0,0,0,0,0,0,0,1
124,1,0,0,1,0,0,0,1,0,0,0,0,0
578,0,0,0,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
835,0,0,0,0,0,0,0,0,1,0,0,0,0
192,0,0,0,1,0,0,0,0,0,0,0,0,1
629,1,0,1,0,0,0,0,0,0,0,0,0,1
559,0,0,0,1,0,0,0,0,0,0,0,0,1


## One hot encoding with Scikit-learn

### Advantages

- quick
- Creates the same number of features in train and test set

### Limitations

- it returns a numpy array instead of a pandas dataframe
- it does not return the variable names, therefore inconvenient for variable exploration

In [135]:
# for one hot encoding with sklearn
from sklearn.preprocessing import OneHotEncoder

In [143]:
encoder = OneHotEncoder(categories='auto',
                       drop = 'first',# to return k-1, use drop = false to return k dummies
                       sparse = False,
                       handle_unknown= 'error') # help deal with rare labels

encoder.fit(X_train.fillna('Missing'))

OneHotEncoder(drop='first', sparse=False)

In [144]:
# Looking at the Learned Categories 
encoder.categories_

[array(['female', 'male'], dtype=object),
 array(['C', 'Missing', 'Q', 'S'], dtype=object),
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Missing', 'T'], dtype=object)]

In [145]:
# transform the train set

tmp = encoder.transform(X_train.fillna('Missing'))

pd.DataFrame(tmp).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [148]:
# NEW: in latest release of Scikit-learn
# we can now retrieve the feature names as follows:

encoder.get_feature_names_out()

array(['Sex_male', 'Embarked_Missing', 'Embarked_Q', 'Embarked_S',
       'Cabin_B', 'Cabin_C', 'Cabin_D', 'Cabin_E', 'Cabin_F', 'Cabin_G',
       'Cabin_Missing', 'Cabin_T'], dtype=object)

In [151]:
# we can go ahead and transfom the test set
# and then reconstitute it back to a pandas dataframe
# and add the feature names derived by OHE

tmp = encoder.transform(X_test.fillna('Missing'))

tmp = pd.DataFrame(tmp)
tmp.columns = encoder.get_feature_names_out()

tmp.head()

Unnamed: 0,Sex_male,Embarked_Missing,Embarked_Q,Embarked_S,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G,Cabin_Missing,Cabin_T
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
