## One Hot Encoding - Category encoders



We will see how to perform one hot encoding with Feature-engine using the Titanic dataset.



In [56]:
!pip install category_encoders



In [57]:
import pandas as pd
from sklearn.model_selection import train_test_split

from category_encoders.one_hot import OneHotEncoder

In [58]:
# load titanic dataset

usecols = ["pclass", "sibsp", "parch", "sex", "embarked", "cabin", "survived"]

data = pd.read_csv("../../Datasets/titanic.csv", usecols=usecols)
data["cabin"] = data["cabin"].str[0]

data.head()

Unnamed: 0,pclass,survived,sex,sibsp,parch,cabin,embarked
0,1,1,female,0,0,B,S
1,1,1,male,1,2,C,S
2,1,0,female,1,2,C,S
3,1,0,male,1,2,C,S
4,1,0,female,1,2,C,S


### Encoding important

Just like imputation, all methods of categorical encoding should be performed over the training set, and then propagated to the test set. 

Why? 

Because these methods will "learn" patterns from the train data, and therefore you want to avoid leaking information and overfitting. But more importantly, because we don't know whether in future / live data, we will have all the categories present in the train data, or if there will be more or less categories. Therefore, we want to anticipate this uncertainty by setting the right processes right from the start. We want to create transformers that learn the categories from the train set, and used those learned categories to create the dummy variables in both train and test sets.

In [59]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data.drop("survived", axis=1),  # predictors
    data["survived"],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0,
)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 6), (393, 6))

## One hot encoding with Category encoders

### Advantages

- quick
- returns dataframe
- returns feature names
- allows to select features to encode
- appends dummies to original dataset

Limitations

- No option for k-1 dummies

In [60]:
# To start, we fillna manually. Later on
# we add this step to a pipeline

X_train.fillna("Missing", inplace=True)
X_test.fillna("Missing", inplace=True)

In [61]:
# set up encoder

encoder = OneHotEncoder(
    cols=None,  # If kept None, it will detect categorical variables; alternatively we can pass a list of variables
    use_cat_names=True, # category names_labels ex: sex_male
)

In [62]:
# fit the encoder (finds categories)

encoder.fit(X_train) # learn categorical variables

0,1,2
,verbose,0
,cols,"['sex', 'cabin', ...]"
,drop_invariant,False
,return_df,True
,handle_missing,'value'
,handle_unknown,'value'
,use_cat_names,True


In [63]:
# automatically found numerical variables

encoder.cols # cols can find the categorical variables

['sex', 'cabin', 'embarked']

In [64]:
# we observe the learned categories

encoder.mapping 

[{'col': 'sex',
  'mapping':     sex_female  sex_male
   1           1         0
   2           0         1
  -1           0         0
  -2           0         0},
 {'col': 'cabin',
  'mapping':     cabin_Missing  cabin_E  cabin_C  cabin_D  cabin_B  cabin_A  cabin_F  \
   1              1        0        0        0        0        0        0   
   2              0        1        0        0        0        0        0   
   3              0        0        1        0        0        0        0   
   4              0        0        0        1        0        0        0   
   5              0        0        0        0        1        0        0   
   6              0        0        0        0        0        1        0   
   7              0        0        0        0        0        0        1   
   8              0        0        0        0        0        0        0   
   9              0        0        0        0        0        0        0   
  -1              0        0        0

Here in the mapping:

{'col': 'sex',
 'mapping': 
          sex_female  sex_male
  1           1         0
  2           0         1
  -1          0         0
  -2          0         0}

Means:
Original values: 1, 2 (likely representing female/male)
Encoded as: Two binary columns sex_female and sex_male
Mapping:

Original value 1 → [1, 0] (female)

Original value 2 → [0, 1] (male)

Special values -1, -2 → [0, 0] (likely for missing data (-1), it will be encoded to [0, 0] /unknown data (-2), it will be encoded to [0, 0])

In [65]:
# transform the data sets

X_train_enc = encoder.transform(X_train)
X_test_enc = encoder.transform(X_test)

X_train_enc.head()

# Here we have numerical variables unchanged (pclass, sibsp, parch)
# Again we have one new column per category in the categorical variables. For example, the "sex" variable is now represented by two columns: "sex_female" and "sex_male".

Unnamed: 0,pclass,sex_female,sex_male,sibsp,parch,cabin_Missing,cabin_E,cabin_C,cabin_D,cabin_B,cabin_A,cabin_F,cabin_T,cabin_G,embarked_S,embarked_C,embarked_Q,embarked_Missing
501,2,1,0,0,1,1,0,0,0,0,0,0,0,0,1,0,0,0
588,2,1,0,1,1,1,0,0,0,0,0,0,0,0,1,0,0,0
402,2,1,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0
1193,3,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0
686,3,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0


In [66]:
encoder.get_feature_names_out() #names of variables after encoding

array(['pclass', 'sex_female', 'sex_male', 'sibsp', 'parch',
       'cabin_Missing', 'cabin_E', 'cabin_C', 'cabin_D', 'cabin_B',
       'cabin_A', 'cabin_F', 'cabin_T', 'cabin_G', 'embarked_S',
       'embarked_C', 'embarked_Q', 'embarked_Missing'], dtype=object)