<a href="https://colab.research.google.com/github/vinay10949/AnalyticsAndML/blob/master/FeatureEngineering/Categorical-Variable-Encoding/6_9_Comparison_categorical_encoding_techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Comparison of Categorical Variable Encodings

In this lecture, we will compare the performance of the different feature categorical encoding techniques we learned so far.

We will compare:

- One hot encoding
- Replacing labels by the count
- Ordering labels according to target
- Mean Encoding
- WoE


In [0]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import roc_auc_score

In [65]:
# let's load the titanic dataset

# we will only use these columns in the demo
cols = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
        'Sex', 'Cabin', 'Embarked', 'Survived']

data = pd.read_csv('titanic_train.csv', usecols=cols)

data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,,S
1,1,1,female,38.0,1,0,71.2833,C85,C
2,1,3,female,26.0,0,0,7.925,,S
3,1,1,female,35.0,1,0,53.1,C123,S
4,0,3,male,35.0,0,0,8.05,,S


In [66]:
# let's check for missing data

data.isnull().sum()

Survived      0
Pclass        0
Sex           0
Age         177
SibSp         0
Parch         0
Fare          0
Cabin       687
Embarked      2
dtype: int64

In [0]:
# Drop observations with NA in Fare and embarked

data.dropna(subset=['Fare', 'Embarked'], inplace=True)

In [68]:
# Now we extract the first letter of the cabin

data['Cabin'] = data['Cabin'].astype(str).str[0]

data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,0,3,male,22.0,1,0,7.25,n,S
1,1,1,female,38.0,1,0,71.2833,C,C
2,1,3,female,26.0,0,0,7.925,n,S
3,1,1,female,35.0,1,0,53.1,C,S
4,0,3,male,35.0,0,0,8.05,n,S


In [0]:
# drop observations with cabin = T, they are too few

data = data[data['Cabin'] != 'T']

In [74]:
# Let's divide into train and test set

X_train, X_test, y_train, y_test = train_test_split(
   data.drop(labels='Survived', axis=1),  # predictors
    data['Survived'],  # target
    test_size=0.3,
    random_state=0)

X_train.shape, X_test.shape


Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
352,3,male,15.0,1,1,7.2292,n,C
125,3,male,12.0,1,0,11.2417,n,C
579,3,male,32.0,0,0,7.9250,n,S
424,3,male,18.0,1,1,20.2125,n,S
119,3,female,2.0,4,2,31.2750,n,S
...,...,...,...,...,...,...,...,...
838,3,male,32.0,0,0,56.4958,n,S
193,2,male,3.0,1,1,26.0000,F,S
631,3,male,51.0,0,0,7.0542,n,S
561,3,male,40.0,0,0,7.8958,n,S


In [0]:
# Let's replace null values in numerical variables by the mean


def impute_na(df, variable, value):
    df[variable].fillna(value, inplace=True)


impute_na(X_test, 'Age', X_train['Age'].mean())
impute_na(X_train, 'Age',  X_train['Age'].mean())
# note how I impute first the test set, this way the value of
# the median used will be the same for both train and test

In [0]:
X_train.head()

In [0]:
# let's check that we have no missing data after NA imputation

X_train.isnull().sum(), X_test.isnull().sum()

### One Hot Encoding

In [0]:
def get_OHE(df):

    df_OHE = pd.concat(
        [df[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare']],
         pd.get_dummies(df[['Sex', 'Cabin', 'Embarked']], drop_first=True)],
        axis=1)

    return df_OHE


X_train_OHE = get_OHE(X_train)
X_test_OHE = get_OHE(X_test)

X_train_OHE.head()

In [0]:
X_test_OHE.head()

### Count encoding

In [0]:
def categorical_to_counts(df_train, df_test):

    # make a temporary copy of the original dataframes
    df_train_temp = df_train.copy()
    df_test_temp = df_test.copy()

    for col in ['Sex', 'Cabin', 'Embarked']:

        # make dictionary mapping category to counts
        counts_map = df_train_temp[col].value_counts().to_dict()

        # remap the labels to their counts
        df_train_temp[col] = df_train_temp[col].map(counts_map)
        df_test_temp[col] = df_test_temp[col].map(counts_map)

    return df_train_temp, df_test_temp


X_train_count, X_test_count = categorical_to_counts(X_train, X_test)

X_train_count.head()

### Ordered Integer Encoding

In [0]:
def categories_to_ordered(df_train, df_test, y_train, y_test):

    # make a temporary copy of the datasets
    df_train_temp = pd.concat([df_train, y_train], axis=1).copy()
    df_test_temp = pd.concat([df_test, y_test], axis=1).copy()

    for col in ['Sex', 'Cabin', 'Embarked']:

        # order categories according to target mean
        ordered_labels = df_train_temp.groupby(
            [col])['Survived'].mean().sort_values().index

        # create the dictionary to map the ordered labels to an ordinal number
        ordinal_label = {k: i for i, k in enumerate(ordered_labels, 0)}

        # remap the categories  to these ordinal numbers
        df_train_temp[col] = df_train[col].map(ordinal_label)
        df_test_temp[col] = df_test[col].map(ordinal_label)

    # remove the target
    df_train_temp.drop(['Survived'], axis=1, inplace=True)
    df_test_temp.drop(['Survived'], axis=1, inplace=True)

    return df_train_temp, df_test_temp


X_train_ordered, X_test_ordered = categories_to_ordered(
    X_train, X_test, y_train, y_test)

X_train_ordered.head()

### Mean Encoding

In [76]:
def categories_to_mean(df_train, df_test, y_train, y_test):

    # make a temporary copy of the datasets
    df_train_temp = pd.concat([df_train, y_train], axis=1).copy()
    df_test_temp = pd.concat([df_test, y_test], axis=1).copy()

    for col in ['Sex', 'Cabin', 'Embarked']:

        # calculate mean target per category
        ordered_labels = df_train_temp.groupby(
            [col])['Survived'].mean().to_dict()

        # remap the categories to target mean
        df_train_temp[col] = df_train[col].map(ordered_labels)
        df_test_temp[col] = df_test[col].map(ordered_labels)

    # remove the target
    df_train_temp.drop(['Survived'], axis=1, inplace=True)
    df_test_temp.drop(['Survived'], axis=1, inplace=True)

    return df_train_temp, df_test_temp


X_train_mean, X_test_mean = categories_to_mean(
    X_train, X_test, y_train, y_test)

X_train_mean.head()

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
352,3,0.199005,15.0,1,1,7.2292,0.308668,0.555556
125,3,0.199005,12.0,1,0,11.2417,0.308668,0.555556
579,3,0.199005,32.0,0,0,7.925,0.308668,0.347921
424,3,0.199005,18.0,1,1,20.2125,0.308668,0.347921
119,3,0.739726,2.0,4,2,31.275,0.308668,0.347921


### WoE

In [0]:
def categories_to_woe(df_train, df_test, y_train, y_test):
    # make a temporary copy of the datasets
    df_train_temp = pd.concat([df_train, y_train], axis=1).copy()
    df_test_temp = pd.concat([df_test, y_test], axis=1).copy()
    #print(df_train_temp.columns)
    #print(df_train_temp['Survived'].mean())
    for col in ['Sex', 'Cabin', 'Embarked']:
        # create df containing the different parts of the WoE equation
        # prob survived =1
        prob_df = pd.DataFrame(df_train_temp.groupby([col])['Survived'].mean())
        # prob survived = 0
        prob_df['died'] = 1-prob_df.survived
        # calculate WoE
        prob_df['WoE'] = np.log(prob_df.survived/prob_df.died)
        # capture woe in dictionary
        woe = prob_df['WoE'].to_dict()
        # re-map the labels to WoE
        df_train_temp[col] = df_train[col].map(woe)
        df_test_temp[col] = df_test[col].map(woe)

    # drop the target
    df_train_temp.drop(['Survived'], axis=1, inplace=True)
    df_test_temp.drop(['Survived'], axis=1, inplace=True)

    return df_train_temp, df_test_temp


X_train_woe, X_test_woe = categories_to_woe(X_train, X_test, y_train, y_test)

X_train_woe.head()

### Random Forest Performance

In [0]:
# create a function to build random forests and compare performance in train and test set


def run_randomForests(X_train, X_test, y_train, y_test):

    rf = RandomForestClassifier(n_estimators=50, random_state=39, max_depth=3)
    rf.fit(X_train, y_train)

    print('Train set')
    pred = rf.predict_proba(X_train)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = rf.predict_proba(X_test)
    print(
        'Random Forests roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

In [0]:
# OHE
run_randomForests(X_train_OHE, X_test_OHE, y_train, y_test)

In [0]:
# counts
run_randomForests(X_train_count, X_test_count, y_train, y_test)

In [0]:
# ordered labels
run_randomForests(X_train_ordered, X_test_ordered, y_train, y_test)

In [0]:
# mean encoding
run_randomForests(X_train_mean, X_test_mean, y_train, y_test)

In [0]:
# woe
run_randomForests(X_train_woe, X_test_woe, y_train, y_test)

Comparing the roc_auc values on the test sets, we can see that one hot encoding has the worse performance. This makes sense because trees do not perform well in datasets with big feature spaces.

The remaining encodings returned similar performances. This also makes sense, because trees are non-linear models, so target guided encodings may not necessarily improve the model performance

### Logistic Regression Performance

In [0]:
def run_logistic(X_train, X_test, y_train, y_test):

    # function to train and test the performance of logistic regression
    logit = LogisticRegression(random_state=44, C=0.01)
    logit.fit(X_train, y_train)

    print('Train set')
    pred = logit.predict_proba(X_train)
    print(
        'Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:, 1])))

    print('Test set')
    pred = logit.predict_proba(X_test)
    print(
        'Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:, 1])))

In [0]:
# OHE
run_logistic(X_train_OHE, X_test_OHE, y_train, y_test)

In [0]:
# counts
run_logistic(X_train_count, X_test_count, y_train, y_test)

In [0]:
# ordered labels
run_logistic(X_train_ordered, X_test_ordered, y_train, y_test)

In [0]:
# mean encoding
run_logistic(X_train_mean, X_test_mean, y_train, y_test)

In [0]:
# woe
run_logistic(X_train_woe, X_test_woe, y_train, y_test)

For Logistic regression, the best performances are obtained with one hot encoding, as it preserves linear relationships with variables and target, and also with weight of evidence, and ordered encoding.

Note however how count encoding, returns the worse performance as it does not create a monotonic relationship between variables and target, and in this case, mean target encoding is probably causing over-fitting.