# Target encoding with CV
* An encoding technique used typically for high cardinality categorical information (e.g. postal code, occupation, etc.).
* Converts categorical features to a numerical value using information from the target.
* The simplest application is for binary classification problem where we replace eadh class of a given categorical feature with the mean of the target
* If, however, the cardinality is high and the train and test distribution of the target differs, this approach could overfit. To overcome this, one can use K-fold cross validation compute the target means.
    * For a given fold, K, assign the mean of the target for class A where the mean is calculated from the K-1 sample
    * This [article explains this well](https://medium.com/@pouryaayria/k-fold-target-encoding-dfe9a594874b)


In [1]:
import random

import pandas as pd
import numpy as np
import category_encoders as ce
from kaggler.preprocessing import LabelEncoder, TargetEncoder
from sklearn.model_selection import StratifiedKFold

Using TensorFlow backend.


In [2]:
# Generate fake categorical data
N = 1000
y = np.random.choice(a=[0,1], size=N, p=[.3, .7])
x = np.random.choice(a=['a', 'b', 'c', 'd'], size=N, p=[.3, .05, .45, .2])
df = pd.DataFrame({'y': y, 'x': x})

In [3]:
display(df.x.value_counts())
display(df.y.describe())

c    441
a    314
d    198
b     47
Name: x, dtype: int64

count    1000.000000
mean        0.699000
std         0.458922
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: y, dtype: float64

# Using Kaggler

In [4]:
y = df['y']
trn = df.drop('y',axis=1)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
te = TargetEncoder(cv=cv)
te.fit(trn,y)
trn = te.transform(trn)

In [5]:
trn.x.value_counts()

0.732438    441
0.665904    314
0.707138    198
0.573758     47
Name: x, dtype: int64

# Using Sklearn
* code originally from https://www.kaggle.com/caesarlupum/2020-20-lines-target-encoding

In [6]:
print("Train target encoder...")
cat_feat_to_encode = trn.columns.tolist()
smoothing = 0.20
oof = pd.DataFrame([])

# Target encoding for training set
stratified_idx = StratifiedKFold(n_splits=5,
                                 random_state=2020,
                                 shuffle=True).split(trn, y)
for tr_idx, oof_idx in stratified_idx:
    print("-------------")
    print(len(tr_idx), len(oof_idx))

    ce_target_encoder = ce.TargetEncoder(cols=cat_feat_to_encode,
                                         smoothing=smoothing)
    ce_target_encoder.fit(trn.iloc[tr_idx, :], y.iloc[tr_idx])
    oof = oof.append(ce_target_encoder.transform(trn.iloc[oof_idx, :]),
                     ignore_index=False)
trn = oof.sort_index()

# Target encoding for test set
# why are we retraining the target encoder here ???
# ce_target_encoder = ce.TargetEncoder(cols=cat_feat_to_encode,
#                                      smoothing=smoothing)
# ce_target_encoder.fit(trn, y)
# test = ce_target_encoder.transform(test)

Train target encoder...
-------------
800 200
-------------
800 200
-------------
800 200
-------------
800 200
-------------
800 200
