## One Hot Encoding

**One hot encoding**, consists in **encoding each categorical variable with different boolean variables** (also called **dummy variables**) which take values 0 or 1, indicating **if a category is present in an observation**, like "Gender", with labels 'female' and 'male', assigned as 0 and 1, or 'color' with color labels as numbers!

**When one hot encoding categorical variables, we create k - 1 binary variables** For example, for **blue, red and green** colors, only 2 binary variable is sufficient, **if green, red and blue should be 0**. no need the green variable! If k=2 (male/female), only 1 variable (k-1=1) is sufficient!

But, to build **tree based algorithms** or do feature selection by **recursive algorithms**, or we want to determine **the importance of each single category**, we should do **one hot encoding into k dummy variables**.

It is **Straightforward to implement!** Makes **no assumption** about the distribution or categories! **Keeps all the information** of the categorical variable! Suitable for linear models! But it **expands the feature space**! Does not add extra information while encoding! Many dummy variables may be identical, introducing redundant information!

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder  # one hot encoding with sklearn
from feature_engine.encoding import OneHotEncoder as fe_OneHotEncoder # with feature-engine

**Load titanic dataset!**

In [2]:
data = pd.read_csv('titanic.csv',
                   usecols=['sex', 'embarked', 'cabin', 'survived'])
data.head()

Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B5,S
1,1,male,C22,S
2,0,female,C22,S
3,0,male,C22,S
4,0,female,C22,S


**Capture only the first letter of the cabin for this demonstration!**

In [3]:
data['cabin'] = data['cabin'].str[0]
data.head()

Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B,S
1,1,male,C,S
2,0,female,C,S
3,0,male,C,S
4,0,female,C,S


**Separate into training and testing set!**

In [4]:
X_train, X_test, y_train, y_test = train_test_split(
    data[['sex', 'embarked', 'cabin']],  # predictors
    data['survived'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility
X_train.shape, X_test.shape

((916, 3), (393, 3))

### Let's explore the cardinality

**Sex has 2 labels!**

In [5]:
X_train['sex'].unique()

array(['female', 'male'], dtype=object)

**Embarked has 3 labels and missing data!**

In [6]:
X_train['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

**Cabin has 9 labels and missing data!**

In [7]:
X_train['cabin'].unique()

array([nan, 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'], dtype=object)

## One hot encoding with pandas

It is **quick**, returns pandas **dataframe**, returns **feature names for the dummy variables**. But it does **not preserve information from train data** to propagate to test data! Pandas.get_dummies(), will create as many binary variables as categories in the variable!

### into k  dummy variables

**Create dummy variables with the build in pandas method get_dummies!**

In [8]:
tmp = pd.get_dummies(X_train['sex'])
tmp.head()

Unnamed: 0,female,male
501,1,0
588,1,0
402,1,0
1193,0,1
686,1,0


**For better visualisation let's put the dummies next to the original variable!**

In [10]:
pd.concat([X_train['sex'],
           pd.get_dummies(X_train['sex'])], axis=1).head()

Unnamed: 0,sex,female,male
501,female,1,0
588,female,1,0
402,female,1,0
1193,male,0,1
686,female,1,0


**Repeat for embarked!**

In [15]:
tmp = pd.get_dummies(X_train['embarked'])
tmp.head()

Unnamed: 0,C,Q,S
501,0,0,1
588,0,0,1
402,1,0,0
1193,0,1,0
686,0,1,0


**For better visualisation!**

In [12]:
pd.concat([X_train['embarked'],
           pd.get_dummies(X_train['embarked'])], axis=1).head()

Unnamed: 0,embarked,C,Q,S
501,S,0,0,1
588,S,0,0,1
402,C,1,0,0
1193,Q,0,1,0
686,Q,0,1,0


**Now for cabin!**

In [21]:
tmp = pd.get_dummies(X_train['cabin'])
tmp.head()

Unnamed: 0,A,B,C,D,E,F,G,T
501,0,0,0,0,0,0,0,0
588,0,0,0,0,0,0,0,0
402,0,0,0,0,0,0,0,0
1193,0,0,0,0,0,0,0,0
686,0,0,0,0,0,0,0,0


**Now for all variables together: train set!**

In [23]:
tmp = pd.get_dummies(X_train)
print(tmp.shape)
tmp.head()

(916, 13)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,1,0,0,0,1,0,0,0,0,0,0,0,0
588,1,0,0,0,1,0,0,0,0,0,0,0,0
402,1,0,1,0,0,0,0,0,0,0,0,0,0
1193,0,1,0,1,0,0,0,0,0,0,0,0,0
686,1,0,0,1,0,0,0,0,0,0,0,0,0


**Now for all variables together: test set!!**

In [24]:
tmp = pd.get_dummies(X_test)
print(tmp.shape)
tmp.head()

(393, 12)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G
1139,0,1,0,0,1,0,0,0,0,0,0,0
533,1,0,0,0,1,0,0,0,0,0,0,0
459,0,1,0,0,1,0,0,0,0,0,0,0
1150,0,1,0,0,1,0,0,0,0,0,0,0
393,0,1,0,0,1,0,0,0,0,0,0,0


Notice the positives of pandas get_dummies: dataframe returned with **feature names!** But the train set contains **13 dummy features**, whereas the test set contains **12 features**. This occurred because there was no category T in cabin in the test set. Predictors require train and test sets to be of the same shape!

### into k -1 

**obtain k-1 labels:** we need to indicate **get_dummies to drop the first binary variable!**

In [11]:
tmp = pd.get_dummies(X_train['sex'], drop_first=True)
tmp.head()

Unnamed: 0,male
501,0
588,0
402,0
1193,1
686,0


In [12]:
tmp = pd.get_dummies(X_train['embarked'], drop_first=True)
tmp.head()

Unnamed: 0,Q,S
501,0,1
588,0,1
402,0,0
1193,1,0
686,1,0


For embarked, if an observation shows **0 for Q and S**, then **its value must be C**, the remaining category!

In [14]:
tmp = pd.get_dummies(X_train, drop_first=True)
print(tmp.shape)  # altogether: train set
tmp.head()

(916, 10)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,0,0,1,0,0,0,0,0,0,0
588,0,0,1,0,0,0,0,0,0,0
402,0,0,0,0,0,0,0,0,0,0
1193,1,1,0,0,0,0,0,0,0,0
686,0,1,0,0,0,0,0,0,0,0


In [15]:
tmp = pd.get_dummies(X_test, drop_first=True)
print(tmp.shape)   # altogether: train set
tmp.head()

(393, 9)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G
1139,1,0,1,0,0,0,0,0,0
533,0,0,1,0,0,0,0,0,0
459,1,0,1,0,0,0,0,0,0
1150,1,0,1,0,0,0,0,0,0
393,1,0,1,0,0,0,0,0,0


### Bonus: get_dummies() can handle missing values

In [16]:
pd.get_dummies(X_train['embarked'], drop_first=True, dummy_na=True).head()

Unnamed: 0,Q,S,NaN
501,0,1,0
588,0,1,0
402,0,0,0
1193,1,0,0
686,1,0,0


## One hot encoding with Scikit-learn

It is **quick**, creates the **same number of features in train and test set**. but it returns a **numpy array** instead of a pandas dataframe, it does **not return the variable names**, therefore inconvenient for variable exploration!

**Create and train the encoder!**

In [17]:
encoder = OneHotEncoder(categories='auto',
                       drop='first', # to return k-1, use drop=false to return k dummies
                       sparse=False,
                       handle_unknown='error') # helps deal with rare labels
encoder.fit(X_train.fillna('Missing'))

OneHotEncoder(drop='first', sparse=False)

**Observe the learned categories!**

In [18]:
encoder.categories_

[array(['female', 'male'], dtype=object),
 array(['C', 'Missing', 'Q', 'S'], dtype=object),
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Missing', 'T'], dtype=object)]

**Transform the train set!..**

In [19]:
tmp = encoder.transform(X_train.fillna('Missing'))
pd.DataFrame(tmp).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11
0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


**NEW: in latest release of Scikit-learn!Now retrieve the feature names as follows:**

In [20]:
encoder.get_feature_names()

array(['x0_male', 'x1_Missing', 'x1_Q', 'x1_S', 'x2_B', 'x2_C', 'x2_D',
       'x2_E', 'x2_F', 'x2_G', 'x2_Missing', 'x2_T'], dtype=object)

**Transfom the test set! Then reconstitute it back to a dataframe! Add the feature names derived by OHE!**

In [21]:
tmp = encoder.transform(X_test.fillna('Missing'))
tmp = pd.DataFrame(tmp)
tmp.columns = encoder.get_feature_names()
tmp.head()

Unnamed: 0,x0_male,x1_Missing,x1_Q,x1_S,x2_B,x2_C,x2_D,x2_E,x2_F,x2_G,x2_Missing,x2_T
0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
4,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


We can see that **train and test contain the same number** of features!


## One hot encoding with Feature-Engine

It is **quick**, returns **dataframe**, returns **feature names**, allows to **select features to encode**! **Limitations** Not sure yet!

In [23]:
ohe_enc = fe_OneHotEncoder(
    top_categories=None,
    variables=['sex', 'embarked'],  # we can select which variables to encode
    drop_last=True)  # to return k-1, false to return k
ohe_enc.fit(X_train.fillna('Missing'))

OneHotEncoder(drop_last=True, variables=['sex', 'embarked'])

In [24]:
tmp = ohe_enc.transform(X_train.fillna('Missing'))
tmp.head()

Unnamed: 0,cabin,sex_female,embarked_S,embarked_C,embarked_Q
501,Missing,1,1,0,0
588,Missing,1,1,0,0
402,Missing,1,0,1,0
1193,Missing,0,0,0,1
686,Missing,1,0,0,1


Note how feature-engine returns the dummy variables with their names, and drops the original variable, leaving the dataset ready for further exploration or building machine learning models.

In [25]:
tmp = ohe_enc.transform(X_test.fillna('Missing'))
tmp.head()

Unnamed: 0,cabin,sex_female,embarked_S,embarked_C,embarked_Q
1139,Missing,0,1,0,0
533,Missing,1,1,0,0
459,Missing,0,1,0,0
1150,Missing,0,1,0,0
393,Missing,0,1,0,0


**Feature-Engine's one hot encoder also selects all categorical variables automatically!**

In [26]:
ohe_enc = fe_OneHotEncoder(
    top_categories=None,
    drop_last=True)  # to return k-1, false to return k
ohe_enc.fit(X_train.fillna('Missing'))

OneHotEncoder(drop_last=True)

In [27]:
ohe_enc.variables_

['sex', 'embarked', 'cabin']

In [28]:
tmp = ohe_enc.transform(X_train.fillna('Missing'))
tmp.head()

Unnamed: 0,sex_female,embarked_S,embarked_C,embarked_Q,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T
501,1,1,0,0,1,0,0,0,0,0,0,0
588,1,1,0,0,1,0,0,0,0,0,0,0
402,1,0,1,0,1,0,0,0,0,0,0,0
1193,0,0,0,1,1,0,0,0,0,0,0,0
686,1,0,0,1,1,0,0,0,0,0,0,0


In [29]:
tmp = ohe_enc.transform(X_test.fillna('Missing'))
tmp.head()

Unnamed: 0,sex_female,embarked_S,embarked_C,embarked_Q,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T
1139,0,1,0,0,1,0,0,0,0,0,0,0
533,1,1,0,0,1,0,0,0,0,0,0,0
459,0,1,0,0,1,0,0,0,0,0,0,0
1150,0,1,0,0,1,0,0,0,0,0,0,0
393,0,1,0,0,1,0,0,0,0,0,0,0


Note how this encoder returns a variable cabin_T for the test set as well, even though this category is not present in the test set. This allows the integration with Scikit-learn pipeline and scoring of test set by the built algorithm..

In fact, we can check that the sum of cabin_t is 0:

In [30]:
tmp['cabin_T'].sum()

0