#### One Hot encoding 

One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

For example, for the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is 'female' or 0 otherwise, or we can generate the variable "male", which takes 1 if the person is 'male' and 0 otherwise.

For the categorical variable "colour" with values 'red', 'blue' and 'green', we can create 3 new variables called "red", "blue" and "green". These variables will take the value 1, if the observation is of the said colour or 0 otherwise.

**Encoding into k-1 dummy variables**

Note however, that for the variable "colour", by creating 2 binary variables, say "red" and "blue", we already encode ALL the information:

- if the observation is red, it will be captured by the variable "red" (red = 1, blue = 0)
- if the observation is blue, it will be captured by the variable "blue" (red = 0, blue = 1)
- if the observation is green, it will be captured by the combination of "red" and "blue" (red = 0, blue = 0)
- We do not need to add a third variable "green" to capture that the observation is green.

More generally, a categorical variable should be encoded by creating k-1 binary variables, where k is the number of distinct categories. In the case of gender, k=2 (male / female), therefore we need to create only 1 (k - 1 = 1) binary variable. In the case of colour, which has 3 different categories (k=3), we need to create 2 (k - 1 = 2) binary variables to capture all the information.

One hot encoding into k-1 binary variables takes into account that we can use 1 less dimension and still represent the whole information: if the observation is 0 in all the binary variables, then it must be 1 in the final (not present) binary variable.

**When one hot encoding categorical variables, we create k - 1 binary variables**

Most machine learning algorithms, consider the entire data set while being fit. Therefore, encoding categorical variables into k - 1 binary variables, is better, as it avoids introducing redundant information.

**Exception: One hot encoding into k dummy variables**

There are a few occasions when it is better to encode variables into k dummy variables:

- when building tree based algorithms
- when doing feature selection by recursive algorithms
- when interested in determine the importance of each single category

Tree based algorithms, as opposed to the majority of machine learning algorithms, do not evaluate the entire dataset while being trained. They randomly extract a subset of features from the data set at each node for each tree. Therefore, if we want a tree based algorithm to consider all the categories, we need to encode categorical variables into k binary variables.

If we are planning to do feature selection by recursive elimination (or addition), or if we want to evaluate the importance of each single category of the categorical variable, then we will also need the entire set of binary variables (k) to let the machine learning model select which ones have the most predictive power.

**Pros**

- Straightforward to implement
- Makes no assumption about the distribution or categories of the categorical variable
- Keeps all the information of the categorical variable
- Suitable for linear models


**Cons**

- Expands the feature space
- Does not add extra information while encoding
- Many dummy variables may be identical, introducing redundant information

**Notes**

If our datasets contain a few highly cardinal variables, we will end up very soon with datasets with thousands of columns, which may make training of our algorithms slow, and model interpretation hard.

In addition, many of these dummy variables may be similar to each other, since it is not unusual that 2 or more variables share the same combinations of 1 and 0s. Therefore one hot encoding may introduce redundant or duplicated information even if we encode into k-1.

Dataset: Titanic 


In [37]:
import pandas as pd
import numpy as np
# to split the datasets
from sklearn.model_selection import train_test_split

# for one hot encoding with sklearn
from sklearn.preprocessing import OneHotEncoder

In [3]:
use_columns = ['sex', 'embarked', 'cabin', 'survived']

data = pd.read_csv('../datasets/titanic.csv', usecols=use_columns)
data.head()


Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B5,S
1,1,male,C22,S
2,0,female,C22,S
3,0,male,C22,S
4,0,female,C22,S


In [4]:
# let's capture only the first letter of the 
# cabin for this demonstration

data['cabin'] = data['cabin'].str[0]

data.head()

Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B,S
1,1,male,C,S
2,0,female,C,S
3,0,male,C,S
4,0,female,C,S


**Important**

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set.

*Why?*

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In [5]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['sex', 'embarked', 'cabin']],  # predictors
    data['survived'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}")

Train: (916, 3), Test: (393, 3)


Lets check the cardinality:

In [6]:
print(f"Sex: {X_train.sex.unique()} \n")
print(f"Embarked: {X_train.embarked.unique()} \n")
print(f"Cabin: {X_train.cabin.unique()} \n")

Sex: ['female' 'male'] 

Embarked: ['S' 'C' 'Q' nan] 

Cabin: [nan 'E' 'C' 'D' 'B' 'A' 'F' 'T' 'G'] 



**One hot encoding with pandas:**

Pros: 

- Quick 
- Returns pandas DF
- returns the feature names from the dummy variables 

Cons: 

- it does not preserve information from train data to propagate to test data

The pandas method get_dummies(), will create as many binary variables as categories in the variable:

If the variable colour has 3 categories in the train data, it will create 2 dummy variables. However, if the variable colour has 5 categories in the test data, it will create 4 binary variables, therefore train and test sets will end up with different number of features and will be incompatible with training and scoring using Scikit-learn.

In practice, we shouldn't be using get-dummies in our machine learning pipelines. It is however useful, for a quick data exploration. Let's look at this with examples.

In [7]:
# we can create dummy variables with the build in
# pandas method get_dummies

tmp = pd.get_dummies(X_train['sex'])

tmp.head()

Unnamed: 0,female,male
501,1,0
588,1,0
402,1,0
1193,0,1
686,1,0


In [8]:
# for better visualisation let's put the dummies next
# to the original variable

pd.concat([X_train['sex'],
           pd.get_dummies(X_train['sex'])], axis=1).head()

Unnamed: 0,sex,female,male
501,female,1,0
588,female,1,0
402,female,1,0
1193,male,0,1
686,female,1,0


In [9]:
# and now let's repeat for embarked

tmp = pd.get_dummies(X_train['embarked'])

tmp.head()

Unnamed: 0,C,Q,S
501,0,0,1
588,0,0,1
402,1,0,0
1193,0,1,0
686,0,1,0


In [10]:
# for better visualisation

pd.concat([X_train['embarked'],
           pd.get_dummies(X_train['embarked'])], axis=1).head()

Unnamed: 0,embarked,C,Q,S
501,S,0,0,1
588,S,0,0,1
402,C,1,0,0
1193,Q,0,1,0
686,Q,0,1,0


In [11]:
# and now for cabin

tmp = pd.get_dummies(X_train['cabin'])

tmp.head()

Unnamed: 0,A,B,C,D,E,F,G,T
501,0,0,0,0,0,0,0,0
588,0,0,0,0,0,0,0,0
402,0,0,0,0,0,0,0,0
1193,0,0,0,0,0,0,0,0
686,0,0,0,0,0,0,0,0


In [12]:
# and now for all variables together: train set

tmp = pd.get_dummies(X_train)

print(tmp.shape)

tmp.head()

(916, 13)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,1,0,0,0,1,0,0,0,0,0,0,0,0
588,1,0,0,0,1,0,0,0,0,0,0,0,0
402,1,0,1,0,0,0,0,0,0,0,0,0,0
1193,0,1,0,1,0,0,0,0,0,0,0,0,0
686,1,0,0,1,0,0,0,0,0,0,0,0,0


In [13]:
# and now for all variables together: test set

tmp = pd.get_dummies(X_test)

print(tmp.shape)

tmp.head()

(393, 12)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G
1139,0,1,0,0,1,0,0,0,0,0,0,0
533,1,0,0,0,1,0,0,0,0,0,0,0
459,0,1,0,0,1,0,0,0,0,0,0,0
1150,0,1,0,0,1,0,0,0,0,0,0,0
393,0,1,0,0,1,0,0,0,0,0,0,0


Notice the positives of pandas get_dummies:

dataframe returned with feature names
And the limitations:

The train set contains 13 dummy features, whereas the test set contains 12 features. This occurred because there was no category T in cabin in the test set.

This will cause problems if training and scoring models with scikit-learn, because predictors require train and test sets to be of the same shape.

**into k -1**

In [14]:
# obtaining k-1 labels: we need to indicate get_dummies
# to drop the first binary variable

tmp = pd.get_dummies(X_train['sex'], drop_first=True)

tmp.head()

Unnamed: 0,male
501,0
588,0
402,0
1193,1
686,0


In [15]:
# obtaining k-1 labels: we need to indicate get_dummies
# to drop the first binary variable

tmp = pd.get_dummies(X_train['embarked'], drop_first=True)

tmp.head()

Unnamed: 0,Q,S
501,0,1
588,0,1
402,0,0
1193,1,0
686,1,0


For embarked, if an observation shows 0 for Q and S, then its value must be C, the remaining category.

Caveat, this variable has missing data, so unless we encode missing data as well, all the information contained in the variable is not captured.

In [16]:
# altogether: train set

tmp = pd.get_dummies(X_train, drop_first=True)

print(tmp.shape)

tmp.head()

(916, 10)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,0,0,1,0,0,0,0,0,0,0
588,0,0,1,0,0,0,0,0,0,0
402,0,0,0,0,0,0,0,0,0,0
1193,1,1,0,0,0,0,0,0,0,0
686,0,1,0,0,0,0,0,0,0,0


get_dummies() can handle missing values

In [17]:
# we can add an additional dummy variable to indicate
# missing data

pd.get_dummies(X_train['embarked'], drop_first=True, dummy_na=True).head()

Unnamed: 0,Q,S,NaN
501,0,1,0
588,0,1,0
402,0,0,0
1193,1,0,0
686,1,0,0


**One hot encoding with Scikit-learn**

*Pros*
- quick
- Creates the same number of features in train and test set

*Limitations*

- it returns a numpy array instead of a pandas dataframe
- it does not return the variable names, therefore inconvenient for variable exploration



In [18]:
# we create and train the encoder

encoder = OneHotEncoder(categories='auto',
                       drop='first', # to return k-1, use drop=false to return k dummies
                       sparse=False,
                       handle_unknown='error'
                    ) # helps deal with rare labels

encoder = encoder.fit(X_train.fillna('Missing'))

In [19]:
encoder.categories_

[array(['female', 'male'], dtype=object),
 array(['C', 'Missing', 'Q', 'S'], dtype=object),
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Missing', 'T'], dtype=object)]

In [20]:
# transform the train set

tmp = encoder.transform(X_train.fillna('Missing'))

pd.DataFrame(tmp).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [21]:
# NEW: in latest release of Scikit-learn
# we can now retrieve the feature names as follows:

encoder.get_feature_names()



array(['x0_male', 'x1_Missing', 'x1_Q', 'x1_S', 'x2_B', 'x2_C', 'x2_D',
       'x2_E', 'x2_F', 'x2_G', 'x2_Missing', 'x2_T'], dtype=object)

In [22]:
# we can go ahead and transfom the test set
# and then reconstitute it back to a pandas dataframe
# and add the feature names derived by OHE

tmp = encoder.transform(X_test.fillna('Missing'))

tmp = pd.DataFrame(tmp)
tmp.columns = encoder.get_feature_names()

tmp.head()



Unnamed: 0,x0_male,x1_Missing,x1_Q,x1_S,x2_B,x2_C,x2_D,x2_E,x2_F,x2_G,x2_Missing,x2_T
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


#### One Hot Encoding of Frequent feature 

We learned in Section 3 that high cardinality and rare labels may result in certain categories appearing only in the train set, therefore causing over-fitting, or only in the test set, and then our models wouldn't know how to score those observations.

We also learned in the previous lecture on one hot encoding, that if categorical variables contain multiple labels, then by re-encoding them with dummy variables we will expand the feature space dramatically.

**In order to avoid these complications, we can create dummy variables only for the most frequent categories**

This procedure is also called one hot encoding of top categories.

OHE of frequent or top categories is equivalent to grouping all the remaining categories under a new category. We will have a better look at grouping rare values into a new category in a later notebook in this section.


**Advantages of OHE of top categories**

- Straightforward to implement
- Does not require hrs of variable exploration
- Does not expand massively the feature space
- Suitable for linear models

Limitations

- Does not add any information that may make the variable more predictive
- Does not keep the information of the ignored labels
Often, categorical variables show a few dominating categories while the remaining labels add little information. Therefore, OHE of top categories is a simple and useful technique.

**Note**
The number of top variables is set arbitrarily. In the KDD competition the authors selected 10, but it could have been 15 or 5 as well. This number can be chosen arbitrarily or derived from data exploration.

In [23]:

use_columns = ['Neighborhood', 'Exterior1st', 'Exterior2nd', 'SalePrice']
data = pd.read_csv('../datasets/houseprice.csv',usecols=use_columns)

data.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,SalePrice
0,CollgCr,VinylSd,VinylSd,208500
1,Veenker,MetalSd,MetalSd,181500
2,CollgCr,VinylSd,VinylSd,223500
3,Crawfor,Wd Sdng,Wd Shng,140000
4,NoRidge,VinylSd,VinylSd,250000


In [27]:
# let's have a look at how many labels each variable has

for col in data.columns:
    print(f"{col}: {len(data[col].unique())} Labels" )

Neighborhood: 25 Labels
Exterior1st: 15 Labels
Exterior2nd: 16 Labels
SalePrice: 663 Labels


In [30]:
# let's explore the unique categories
for col in data.columns:
    if not col == 'SalePrice':
        print(f"{col}: \n {data[col].unique()} \n" )

Neighborhood: 
 ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste'] 

Exterior1st: 
 ['VinylSd' 'MetalSd' 'Wd Sdng' 'HdBoard' 'BrkFace' 'WdShing' 'CemntBd'
 'Plywood' 'AsbShng' 'Stucco' 'BrkComm' 'AsphShn' 'Stone' 'ImStucc'
 'CBlock'] 

Exterior2nd: 
 ['VinylSd' 'MetalSd' 'Wd Shng' 'HdBoard' 'Plywood' 'Wd Sdng' 'CmentBd'
 'BrkFace' 'Stucco' 'AsbShng' 'Brk Cmn' 'ImStucc' 'AsphShn' 'Stone'
 'Other' 'CBlock'] 



*Attention*

It is important to select the top or most frequent categories based of the train data. Then, we will use those top categories to encode the variables in the test data as well

In [32]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['Neighborhood', 'Exterior1st', 'Exterior2nd']],  # predictors
    data['SalePrice'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0
)
print(f"Train: {X_train.shape}, X_test:{X_test.shape}")

Train: (1022, 3), X_test:(438, 3)


In [33]:
# let's first examine how OHE expands the feature space

pd.get_dummies(X_train, drop_first=True).shape

(1022, 53)

From the initial 3 categorical variables, we end up with 53 variables.

These numbers are still not huge, and in practice we could work with them relatively easily. However, in real-life datasets, categorical variables can be highly cardinal, and with OHE we can end up with datasets with thousands of columns.

**OHE with pandas and NumPy**

**Advantages**

- quick
- returns pandas dataframe
- returns feature names for the dummy variables

**Limitations:**

- it does not preserve information from train data to propagate to test data


In [34]:
# let's find the top 10 most frequent categories for the each variable

for col in data.columns:
    if not col =='SalePrice':
        print(f"{col}: \n {X_train[col].value_counts().sort_values(ascending=False).head(10)} \n")

Neighborhood: 
 NAmes      151
CollgCr    105
OldTown     73
Edwards     71
Sawyer      61
Somerst     56
Gilbert     55
NridgHt     51
NWAmes      51
SawyerW     45
Name: Neighborhood, dtype: int64 

Exterior1st: 
 VinylSd    364
HdBoard    153
Wd Sdng    148
MetalSd    138
Plywood     86
CemntBd     39
BrkFace     35
WdShing     21
Stucco      17
AsbShng     15
Name: Exterior1st, dtype: int64 

Exterior2nd: 
 VinylSd    353
Wd Sdng    142
HdBoard    141
MetalSd    136
Plywood    112
CmentBd     39
Wd Shng     29
BrkFace     18
AsbShng     17
Stucco      16
Name: Exterior2nd, dtype: int64 



In [39]:
# let's make a list with the most frequent categories of the variable Neighborhood

top_10 = [
    x for x in X_train['Neighborhood'].value_counts().sort_values(ascending=False).head(10).index
]

top_10

['NAmes',
 'CollgCr',
 'OldTown',
 'Edwards',
 'Sawyer',
 'Somerst',
 'Gilbert',
 'NridgHt',
 'NWAmes',
 'SawyerW']

In [40]:
# and now we make the 10 binary variables

for label in top_10:
    X_train['Neighborhood' + '_' + label] = np.where(
        X_train['Neighborhood'] == label, 1, 0)
    
    X_test['Neighborhood' + '_' + label] = np.where(
        X_test['Neighborhood'] == label, 1, 0)

# let's visualise the result
X_train[['Neighborhood'] + ['Neighborhood'+'_'+c for c in top_10]].head(10)

Unnamed: 0,Neighborhood,Neighborhood_NAmes,Neighborhood_CollgCr,Neighborhood_OldTown,Neighborhood_Edwards,Neighborhood_Sawyer,Neighborhood_Somerst,Neighborhood_Gilbert,Neighborhood_NridgHt,Neighborhood_NWAmes,Neighborhood_SawyerW
64,CollgCr,0,1,0,0,0,0,0,0,0,0
682,ClearCr,0,0,0,0,0,0,0,0,0,0
960,BrkSide,0,0,0,0,0,0,0,0,0,0
1384,Edwards,0,0,0,1,0,0,0,0,0,0
1100,SWISU,0,0,0,0,0,0,0,0,0,0
416,Sawyer,0,0,0,0,1,0,0,0,0,0
1034,Crawfor,0,0,0,0,0,0,0,0,0,0
853,NAmes,1,0,0,0,0,0,0,0,0,0
472,Edwards,0,0,0,1,0,0,0,0,0,0
1011,Edwards,0,0,0,1,0,0,0,0,0,0


In [41]:
# we can turn the previous commands into 2 functions


def calculate_top_categories(df, variable, how_many=10):
    return [
        x for x in df[variable].value_counts().sort_values(
            ascending=False).head(how_many).index
    ]


def one_hot_encode(train, test, variable, top_x_labels):

    for label in top_x_labels:
        train[variable + '_' + label] = np.where(
            train[variable] == label, 1, 0)
        
        test[variable + '_' + label] = np.where(
            test[variable] == label,1, 0)

In [42]:
# and now we run a loop over the remaining categorical variables

for variable in data.columns:
    if not variable == 'SalePrice':
        top_categories = calculate_top_categories(X_train, variable, how_many=10)
        one_hot_encode(X_train, X_test, variable, top_categories)

In [43]:
X_train.head()

Unnamed: 0,Neighborhood,Exterior1st,Exterior2nd,Neighborhood_NAmes,Neighborhood_CollgCr,Neighborhood_OldTown,Neighborhood_Edwards,Neighborhood_Sawyer,Neighborhood_Somerst,Neighborhood_Gilbert,...,Exterior2nd_VinylSd,Exterior2nd_Wd Sdng,Exterior2nd_HdBoard,Exterior2nd_MetalSd,Exterior2nd_Plywood,Exterior2nd_CmentBd,Exterior2nd_Wd Shng,Exterior2nd_BrkFace,Exterior2nd_AsbShng,Exterior2nd_Stucco
64,CollgCr,VinylSd,VinylSd,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
682,ClearCr,Wd Sdng,Wd Sdng,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
960,BrkSide,Wd Sdng,Plywood,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1384,Edwards,WdShing,Wd Shng,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1100,SWISU,Wd Sdng,Wd Sdng,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
