# What is it about?
It's a small workbook to try out mean encoding  on the competition's dataset and understand  its regularisation techniques. 

**Main questions**:

* What is mean encoding?
* What are the techniques to avloid overfitting?
* Is cross-validation scheme worth the additional workload compared to smoothing?
* What uplift does mean encoding offer compared to one-hot encoding?

I publish it to save time for somebody who may pose similar questions.


# UPDATE 25.03.19
Examine encoding schemes on a synthetic set following p.14 of [vtreat: a data.frame Processor for Predictive Modeling](https://arxiv.org/abs/1611.09477)


# UPDATE 17.03.19

Thank you, Aditya Soni, for pointing out the data leakage in impact_coding() function. This pushed me to read more about it and I found this paper that explains the issue more

[vtreat: a data.frame Processor for Predictive Modeling](https://arxiv.org/abs/1611.09477)



Now I see that the main purpose of cross-validation loops in impact_coding() function was to compute mean for the given chunk of data based on the rest of data, thus ensuring that we don't simply memorise the data (aka leaking target variable into encoding categorical variable for given data fold).

Encoding categorical variable as it's done in encode_target_smooth()  - on the whole dataset - and fitting the model causes the data leakage that cross-validation technique aimed to prevent.


# I Motivation
Maybe it's this time of year, but I've suddenly felt an urge to come to Kaggle for glory and fame and sat down to explore  present competition's dataset, only to find that a few categorical variables have pretty high cardinality.

In [None]:
import numpy as np 
import pandas as pd
import os

import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
import seaborn as sns

rcParams['figure.figsize'] = (8, 5)

train = pd.read_csv('../input/train/train.csv')
test  = pd.read_csv('../input/test/test.csv')
train.columns = train.columns.str.lower()
test.columns = test.columns.str.lower()

cat_vars = ['type', 'breed1', 'breed2', 'gender', 'color1', 'color2',
            'color3', 'vaccinated', 'dewormed', 'sterilized', 'health', 'state']

# no missing values in categorical variables
assert not train[cat_vars].isnull().any().any()
assert not test[cat_vars].isnull().any().any()

train[cat_vars].nunique()

 One-hot encoding has been my solution before but this time around I stopped and pondered: there must be a better way to handle it.



In [None]:
def encode_onehot(train, test, categ_variables):
    df_onehot = pd.get_dummies(pd.concat([train[cat_vars], test[cat_vars]]).astype(str))
    return df_onehot[:len(train)], df_onehot[len(train):]

train_onehot, test_onehot = encode_onehot(train, test, cat_vars)
train_onehot.shape

I've read a few articles on common encoding techniques, one of which caught my attention: **mean encoding (aka target, aka likelihood, aka impact)**, whereby each distinct value of categorical value is replaced with average value of target variable we're trying to predict.

After some googling and reading through amazing Kaggle kernels and forums, I've realized that: 

*  mean encoding as it is has high risk of overfitting, so some kind of regularisation is required
* there are several ways of doing this regularisation

Dissecting their code and aplplying on single dataset led to this kernel.

# II Main part

## 1. What is mean encoding?

Treating **adoptionspeed** as continuous variable for simplicity, each unique value of **type**, **breed1** etc. variables  will be replaced with average of **adoptionspeed**. For example

In [None]:
encod_type = train.groupby('type')['adoptionspeed'].mean()
print(encod_type)
train.loc[:, 'type_mean_enc'] = train['type'].map(encod_type)
train[['type','type_mean_enc']].head()

## 2. What causes overfitting?
~~For many unique values of such  high-cardinality variable as **breed1**, we have only a handful available observations~~

Learning mean encoding on the same data as we use to train the model. This is similar to tuning model parameters on the whole data.

As the  authors of  **vtreat** package poin out:

*Care must be taken when impact coding variables – or when using nested models in general, for example in model stacking or superlearning (van der Laan, Polley, and Hubbard (2007)): **the data used to do the impact coding should not be the same as the data used to fit the overall model**. This is because the impact coding (or the base models in superlearning) are relatively complex, high-degree-of-freedom models masquerading as low-degree-of-freedom single variables. As such, they may not be handled appropriately by downstream machine learning algorithms. In the case of impact-coded, high-cardinality categorical variables, the resulting impact coding may memorize patterns in the training data, making the variable appear more statistically significant than it really is to downstream modeling algorithms.* [p.13](https://arxiv.org/abs/1611.09477)

In [None]:
(train.groupby('breed1').size() / len(train)).nlargest(10)

~~E.g. more than half of pets in **train** were of either 307-th or 266-th breed, which means  that for majority of **breed1** values we'll have only a handful of observations to estimate the mean.  Memorising the training dataset in such a way  is the definition of  **overfitting**. To make use of our mean encoding of **breed1** nonetheless, we can employ **regularisation*~~

## 3. What regularisations are out there?

There seem to be two schools of thoughts on how to incorporate regularisation into mean encoding
* additive smoothing
* cross-validation



### 3.1 Additive smoothing
**Main idea**: if there are few observations  for an  unique value of categorical variable that we want to encode, rely more on global average for target variable.

Simplest version is implemented below and follows  https://maxhalford.github.io/blog/target-encoding-done-the-right-way/

**Quesion: should it be  applied on disjoint sets of data to avoid target leaking?****

In [None]:
def encode_target_smooth(data, target, categ_variables, smooth):
    """    
    Apply target encoding with smoothing.
    
    Parameters
    ----------
    data: pd.DataFrame
    target: str, dependent variable
    categ_variables: list of str, variables to encode
    smooth: int, number of observations to weigh global average with
    
    Returns
    --------
    encoded_dataset: pd.DataFrame
    code_map: dict, mapping to be used on validation/test datasets 
    defaul_map: dict, mapping to replace previously unseen values with
    """
    train_target = data.copy()
    code_map = dict()    # stores mapping between original and encoded values
    default_map = dict() # stores global average of each variable
    
    for v in categ_variables:
        prior = data[target].mean()
        n = data.groupby(v).size()
        mu = data.groupby(v)[target].mean()
        mu_smoothed = (n * mu + smooth * prior) / (n + smooth)
        
        train_target.loc[:, v] = train_target[v].map(mu_smoothed)        
        code_map[v] = mu_smoothed
        default_map[v] = prior        
    return train_target, code_map, default_map

If value of categorical variable has *n* observations that is lower than *smooth*, then the weighted average will be dominated by the global average.

The most popular Kernel that I've found implements a more elaborate version: https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features

It refers to  this  paper: https://dl.acm.org/citation.cfm?id=507538 Direct link: http://helios.mm.di.uoa.gr/~rouvas/ssi/sigkdd/sigkdd.vol3.1/barreca.ps


In [None]:
train_target_smooth, target_map, default_map = encode_target_smooth(train, 'adoptionspeed', cat_vars, 500)
test_target_smooth = test.copy()
for v in cat_vars:
    test_target_smooth.loc[:, v] = test_target_smooth[v].map(target_map[v])

In [None]:
train_target_smooth[cat_vars].head()

### 3.2 Cross-validation
**Main idea**: introduce variability into encoding estimates by averaging mean over several folds.

Source: https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features, which seems to have been motivated by this discussion: https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36136#201638

 The scheme below does the folloiwing:
* splits data set into 20 folds, 
* for each held-out fold,  estimate the mean encoding based on the further 10-fold splits of the training folds.

In [None]:
def impact_coding_leak(data, feature, target, n_folds=20, n_inner_folds=10):
    from sklearn.model_selection import StratifiedKFold
    '''
    ! Using oof_default_mean for encoding inner folds introduces leak.
    
    Source: https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features
    
    Changelog:    
    a) Replaced KFold with StratifiedFold due to class imbalance
    b) Rewrote .apply() with .map() for readability
    c) Removed redundant apply in the inner loop
    '''
    impact_coded = pd.Series()
    
    oof_default_mean = data[target].mean() # Gobal mean to use by default (you could further tune this)
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True) # KFold in the original
    oof_mean_cv = pd.DataFrame()
    split = 0
    for infold, oof in kf.split(data[feature], data[target]):

        kf_inner = StratifiedKFold(n_splits=n_inner_folds, shuffle=True)
        inner_split = 0
        inner_oof_mean_cv = pd.DataFrame()
        oof_default_inner_mean = data.iloc[infold][target].mean()
        
        for infold_inner, oof_inner in kf_inner.split(data.iloc[infold], data.loc[infold, target]):
            # The mean to apply to the inner oof split (a 1/n_folds % based on the rest)
            oof_mean = data.iloc[infold_inner].groupby(by=feature)[target].mean()

            # Also populate mapping (this has all group -> mean for all inner CV folds)
            inner_oof_mean_cv = inner_oof_mean_cv.join(pd.DataFrame(oof_mean), rsuffix=inner_split, how='outer')
            inner_oof_mean_cv.fillna(value=oof_default_inner_mean, inplace=True)
            inner_split += 1

        # compute mean for each value of categorical value across oof iterations
        inner_oof_mean_cv_map = inner_oof_mean_cv.mean(axis=1)

        # Also populate mapping
        oof_mean_cv = oof_mean_cv.join(pd.DataFrame(inner_oof_mean_cv), rsuffix=split, how='outer')
        oof_mean_cv.fillna(value=oof_default_mean, inplace=True)
        split += 1

        feature_mean = data.loc[oof, feature].map(inner_oof_mean_cv_map).fillna(oof_default_mean)
        impact_coded = impact_coded.append(feature_mean)
            
    return impact_coded, oof_mean_cv.mean(axis=1), oof_default_mean


def impact_coding(data, feature, target, n_folds=20, n_inner_folds=10):
    from sklearn.model_selection import StratifiedKFold
    '''
    ! Using oof_default_mean for encoding inner folds introduces leak.
    
    Source: https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features
    
    Changelog:    
    a) Replaced KFold with StratifiedFold due to class imbalance
    b) Rewrote .apply() with .map() for readability
    c) Removed redundant apply in the inner loop
    d) Removed global average; use local mean to fill NaN values in out-of-fold set
    '''
    impact_coded = pd.Series()
        
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True) # KFold in the original
    oof_mean_cv = pd.DataFrame()
    split = 0
    for infold, oof in kf.split(data[feature], data[target]):

        kf_inner = StratifiedKFold(n_splits=n_inner_folds, shuffle=True)
        inner_split = 0
        inner_oof_mean_cv = pd.DataFrame()
        oof_default_inner_mean = data.iloc[infold][target].mean()
        
        for infold_inner, oof_inner in kf_inner.split(data.iloc[infold], data.loc[infold, target]):
                    
            # The mean to apply to the inner oof split (a 1/n_folds % based on the rest)
            oof_mean = data.iloc[infold_inner].groupby(by=feature)[target].mean()
            
            # Also populate mapping (this has all group -> mean for all inner CV folds)
            inner_oof_mean_cv = inner_oof_mean_cv.join(pd.DataFrame(oof_mean), rsuffix=inner_split, how='outer')
            inner_oof_mean_cv.fillna(value=oof_default_inner_mean, inplace=True)
            inner_split += 1

        # compute mean for each value of categorical value across oof iterations
        inner_oof_mean_cv_map = inner_oof_mean_cv.mean(axis=1)

        # Also populate mapping
        oof_mean_cv = oof_mean_cv.join(pd.DataFrame(inner_oof_mean_cv), rsuffix=split, how='outer')
        oof_mean_cv.fillna(value=oof_default_inner_mean, inplace=True) # <- local mean as default
        split += 1

        feature_mean = data.loc[oof, feature].map(inner_oof_mean_cv_map).fillna(oof_default_inner_mean)
        impact_coded = impact_coded.append(feature_mean)
    
    oof_default_mean = data[target].mean() # Gobal mean to use by default (you could further tune this)
    return impact_coded, oof_mean_cv.mean(axis=1), oof_default_mean

def encode_target_cv(data, target, categ_variables, impact_coder=impact_coding):
    """Apply original function for each <categ_variables> in  <data>
    Reduced number of validation folds
    """
    train_target = data.copy() 
    
    code_map = dict()
    default_map = dict()
    for f in categ_variables:
        train_target.loc[:, f], code_map[f], default_map[f] = impact_coder(train_target, f, target)
        
    return train_target, code_map, default_map

In [None]:
train_target_cv, code_map, default_map = encode_target_cv(train, 'adoptionspeed', cat_vars, impact_coder=impact_coding)

In [None]:
train_target_cv[cat_vars].head()

## 4. How close are the results of the two regularisation methods?

Let's see how correated are the results of additive smoothinga and cross-validation

In [None]:
corr_map = dict()
for v in cat_vars:
    corr_map[v] = np.corrcoef(train_target_cv[v], train_target_smooth[v])[0, 1]    
reg_correl = pd.Series(corr_map)

num_categories = train[cat_vars].nunique()

In [None]:
reg_correl.plot(kind='barh', color='green', alpha=0.3)
_ = plt.title('Correlation between mean-encoded-variables\n using smoothing and CV', fontsize=16)

For majority of categorical variables, the results are pretty close. They differ most for high-cardinality like **breed1** and **breed2**.

In [None]:
fig = plt.figure(figsize=(10, 5))
_ = sns.kdeplot(train_target_smooth['breed1'], label='simple smoothing')
_ = sns.kdeplot(train_target_cv['breed1'], label='cross-validation')
_ = plt.title('Cross-validation regularisation introduced more variation than simple smoothing')

Surprisingly, tecniques deviated even for **health** and **state*** that have relatively low number of unique values.

In [None]:
train[cat_vars].nunique().plot(kind='barh')
_ = plt.title('Number of unique categories')

In [None]:
fig, ax = plt.subplots()
_ = ax.scatter(num_categories, reg_correl)
_ = ax.set_xlabel('Number of unique categories in a variable', fontsize=14)
_ = ax.set_ylabel('Correlation between 2 regularisations', fontsize=14)
for i, txt in enumerate(num_categories.index):
    ax.annotate(txt, (num_categories[i], reg_correl[i]))

On second thought, it makes sense: **what matters is not only how many unique categories are in the feature, but how evenly the data is distributed across them. **

In [None]:
train.groupby('health').size()

In [None]:
def get_categor_spread(data, categ_variables):
    spread = dict()
    for v in categ_variables:
        dist = data.groupby(v).size()
        spread[v] = dist.max() / dist.min() / len(data)
    return spread

In [None]:
spread = pd.Series(get_categor_spread(train, cat_vars))
spread.plot(kind='barh')
_ = plt.title('Larger spread indicates bigger difference between value\n with highest and lowest number of observations')

In [None]:
fig, ax = plt.subplots()
_ = ax.scatter(spread, reg_correl)
_ = ax.set_xlabel('Number of unique categories in a variable', fontsize=14)
_ = ax.set_ylabel('Correlation between 2 regularisations', fontsize=14)
for i, txt in enumerate(num_categories.index):
    ax.annotate(txt, (num_categories[i], reg_correl[i]))

## 5. Do they actually improve predictions?
* Compare to one-hot encoding
* Use GBM


In [None]:
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.metrics import cohen_kappa_score, make_scorer

gbc = GradientBoostingClassifier(n_estimators=10, random_state=20190301)

skf = StratifiedKFold(n_splits=10, random_state=20190301)

kappa_scorer = make_scorer(cohen_kappa_score, weights='quadratic')

In [None]:
def cross_validate_encoder(X, target, categorical_vars, encoder,
                           model, n_splits=10, **kwargs):
    """Evaluate perfomance of encoding categorical varaibles with <encoder> by fitting 
    <model> and measuring average kappa on <n_samples> cross validation.
    
    Make sure to apply mean encoding only to infold set.
            
    Parameters
    ----------
    X: pd.DataFrame, train data
    target: str, response variable
    categorical_vars: list of str, categorical variables to encode
    encoder: custom function to apply
    model: sklearn model to fit
    n_splits: number of cross-validation folds
    **kwargs: key-word arguments to encoder
    
    Returns
    ----------
    metric_cvs: np.array of float, metrics computed on the held-out fold
    """     
    skf = StratifiedKFold(n_splits=n_splits, random_state=20190301)
    metric_cvs = list()
    
    for fold_idx, val_idx in skf.split(X=X, y=X[target]):
        train_fold, valid_fold = X.loc[fold_idx].reset_index(drop=True), \
                                 X.loc[val_idx].reset_index(drop=True)
        
        # apply encoding to k-th fold and validation set 
        train_fold, code_map, default_map = encoder(train_fold, target, categorical_vars, **kwargs)
        for v in categorical_vars:
            valid_fold.loc[:, v] = valid_fold[v].map(code_map[v]).fillna(default_map[v])
        
        # fit model on training fold
        model.fit(train_fold[categorical_vars], train_fold[target])
        # predict out-of-fold
        oof_pred = model.predict(valid_fold[categorical_vars])
        metric_cvs.append(cohen_kappa_score(valid_fold[target], oof_pred, weights='quadratic'))        
    return np.array(metric_cvs)

In [None]:
print('Evaluating one-hot... ')
score_onehot_gbc = cross_val_score(estimator=gbc, X=train_onehot, y=train.adoptionspeed,
                                   cv=skf, scoring=kappa_scorer)

print('Evaluating mean encoding with smoothing... ')
score_target_smooth_gbc = cross_validate_encoder(train, 'adoptionspeed', cat_vars, 
                                                 encode_target_smooth, gbc, smooth=500)

print('Evaluating mean encoding with cross-validation...')
score_target_cv_gbc = cross_validate_encoder(train, 'adoptionspeed', cat_vars, 
                                             encode_target_cv, gbc)

In [None]:
summary_gb = pd.DataFrame({'kappa_cv10_mean': [score_onehot_gbc.mean(), 
                                               score_target_smooth_gbc.mean(), 
                                               score_target_cv_gbc.mean()],
                           'kappa_cv10_std': [score_onehot_gbc.std(), 
                                              score_target_smooth_gbc.std(),
                                              score_target_cv_gbc.std()]},
                          index=['GradientBoosting - one hot', 
                                 'GradientBoosting - mean (smoothing)',
                                 'GradientBoosting - mean (cross-validation)']
                         )

summary_gb

In [None]:
summary_gb['kappa_cv10_mean'].plot(kind='barh', color='lightblue', xerr=summary_gb.kappa_cv10_std, ecolor='red')
_ = plt.xlabel('10-CV kappa average', fontsize=14)
_ = plt.title('Comparison of encoding schemes', fontsize=16, y=1.01)

## Conclusion
(At least for GBM)

* Mean encoding indeed seems to work better than one-hot encoding,
* CV regularisation seems to out-perform smoothing

# 6. Verify encodings on synthetic dataset
[Reference, p.14](https://arxiv.org/abs/1611.09477)

Generate 4 categorial variables with 500 unique labels each; target variable depends only two of them.

In [None]:
from sklearn.metrics import roc_auc_score

In [None]:
def generate_data(n_obs=10000, n_lev=500, variance=1):
    np.random.seed(20190325)
    n_vars = 4
    df = np.empty([n_obs, n_vars])
    for i in range(n_vars):
        df[:, i] = np.random.choice(np.arange(n_lev), size=n_obs, replace=True)
    df = pd.DataFrame(df).astype(int)
    cat_cols = ['x_bad1', 'x_bad2', 'x_good1', 'x_good2']
    df.columns = cat_cols
    
    # y depends only x_good1 and x_good2
    y = (0.2 * np.random.normal(size=n_obs) 
         + 0.5 * np.where(df.x_good1 > n_lev / 2, 1, -1) 
         + 0.3 * np.where(df.x_good2 > n_lev / 2, 1, -1)
         + np.random.normal(scale=variance, size=n_obs)
        )
    df.loc[:, 'y'] = y > 0
    df.loc[:, 'split_group'] = np.random.choice(('cal','train','test'), 
                                                size=n_obs, 
                                                replace=True, 
                                                p=(0.6, 0.2, 0.2))
    
    df.loc[:, cat_cols] = df[cat_cols].astype(str) + '_level'
    return df

In [None]:
df = generate_data()
df.head()

In [None]:
df_train = df.loc[df.split_group!='test'].reset_index(drop=True)
df_test = df.loc[df.split_group=='test'].reset_index(drop=True)
cat_cols = ['x_bad1', 'x_bad2', 'x_good1', 'x_good2']

## 1) One-hot

In [None]:
def encode_onehot(train, test, categ_variables):
    df_onehot = pd.get_dummies(pd.concat([train[categ_variables], test[categ_variables]]).astype(str))
    return df_onehot[:len(train)], df_onehot[len(train):]

train_onehot, test_onehot = encode_onehot(df_train, df_test, cat_cols)
train_onehot.shape

In [None]:
gbc = GradientBoostingClassifier(n_estimators=10, random_state=20190325)
gbc.fit(train_onehot, df_train['y'])
print(gbc.classes_)

# train
print(roc_auc_score(df_train['y'], gbc.predict_proba(train_onehot)[:, 1])) # taking True class
# test
print(roc_auc_score(df_test['y'], gbc.predict_proba(test_onehot)[:, 1]))

# import features
pd.Series(index=train_onehot.columns, data=gbc.feature_importances_).nlargest(10)

The model correctly picked important feature, however, AUC is quite poor.

## 2) Target encoding

### Naive

In [None]:
def encode_target_naive(train, test):
    df_train_naive = train.copy()
    df_test_naive = test.copy()
    
    default = df_train['y'].mean()
    for v in cat_cols:    
        encod_map = df_train.groupby(v)['y'].mean()
        df_train_naive.loc[:, v] = df_train_naive[v].map(encod_map).fillna(default)
        df_test_naive.loc[:, v] = df_test_naive[v].map(encod_map).fillna(default)
    return df_train_naive, df_test_naive

In [None]:
df_train_naive, df_test_naive = encode_target_naive(df_train, df_test)

In [None]:
gbc = GradientBoostingClassifier(random_state=20190325, n_estimators=10)
gbc.fit(df_train_naive[cat_cols], df_train_naive['y'])
pd.Series(gbc.feature_importances_, index=cat_cols)

In [None]:
pd.Series(gbc.feature_importances_, index=cat_cols)

In [None]:
# train
roc_auc_score(y_true=df_train_naive['y'], 
              y_score = gbc.predict_proba(df_train_naive[cat_cols])[:, 1])

In [None]:
# test
roc_auc_score(y_true=df_test_naive['y'], 
              y_score = gbc.predict_proba(df_test_naive[cat_cols])[:, 1])

### Check default number of trees

In [None]:
gbc = GradientBoostingClassifier(random_state=20190325)
gbc.fit(df_train_naive[cat_cols], df_train_naive['y'])
pd.Series(gbc.feature_importances_, index=cat_cols)

In [None]:
# train
roc_auc_score(y_true=df_train_naive['y'], 
              y_score = gbc.predict_proba(df_train_naive[cat_cols])[:, 1])

In [None]:
# test
roc_auc_score(y_true=df_test_naive['y'], 
              y_score = gbc.predict_proba(df_test_naive[cat_cols])[:, 1])

It starts to overfit with more estimators.

## With regularisations

In [None]:
from sklearn.metrics import roc_auc_score
def cross_validate_encoder(X, target, categorical_vars, encoder,
                           model, n_splits=10, **kwargs):
    """Evaluate perfomance of encoding categorical varaibles with <encoder> by fitting 
    <model> and measuring average kappa on <n_samples> cross validation.
    
    Make sure to apply mean encoding only to infold set.
            
    Parameters
    ----------
    X: pd.DataFrame, train data
    target: str, response variable
    categorical_vars: list of str, categorical variables to encode
    encoder: custom function to apply
    model: sklearn model to fit
    n_splits: number of cross-validation folds
    **kwargs: key-word arguments to encoder
    
    Returns
    ----------
    metric_cvs: np.array of float, metrics computed on the held-out fold
    """     
    skf = StratifiedKFold(n_splits=n_splits, random_state=20190301)
    metric_cvs = list()
    
    for fold_idx, val_idx in skf.split(X=X, y=X[target]):
        train_fold, valid_fold = X.loc[fold_idx].reset_index(drop=True), \
                                 X.loc[val_idx].reset_index(drop=True)
        
        # apply encoding to k-th fold and validation set 
        train_fold, code_map, default_map = encoder(train_fold, target, categorical_vars, **kwargs)
        for v in categorical_vars:
            valid_fold.loc[:, v] = valid_fold[v].map(code_map[v]).fillna(default_map[v])
        
        # fit model on training fold
        model.fit(train_fold[categorical_vars], train_fold[target])
        # predict out-of-fold
        oof_pred = model.predict_proba(valid_fold[categorical_vars])[:, 1]
        metric_cvs.append(roc_auc_score(valid_fold[target], oof_pred))
    return np.array(metric_cvs)

In [None]:
gbc = GradientBoostingClassifier(n_estimators=10, random_state=20190325)

print('Evaluating mean encoding with smoothing... ')
score_target_smooth_gbc = cross_validate_encoder(df_train, 'y', cat_cols, encode_target_smooth, gbc, smooth=500)

print('Evaluating mean encoding with cross-validation...')
score_target_cv_gbc = cross_validate_encoder(df_train, 'y', cat_cols, encode_target_cv, gbc)

In [None]:
score_target_smooth_gbc.mean(), score_target_cv_gbc.mean()

In [None]:
score_target_smooth_gbc.mean(), score_target_cv_gbc.mean()

**Observations**
* the model trained on target encoding  has much higher accuracy than one-hot
* hold-out accuracy of cross-validation and smoothed encodings  are very similar, although smoothed encoding was evaluated on the same dataset that the model was trained, which poses higher risks of overfit

## 3) Verify on whole train/test sets + examine feature importance

In [None]:
gbc = GradientBoostingClassifier(random_state=20190325, n_estimators=10)
df_train_smooth, code_map, default_map = encode_target_smooth(df_train, 'y', cat_cols, smooth=500)

df_test_smooth = df_test.copy()
for v in cat_cols:
    df_test_smooth.loc[:, v] = df_test_smooth[v].map(code_map[v]).fillna(default_map[v])
    
gbc.fit(df_train_smooth[cat_cols], df_train_smooth['y'])

In [None]:
pd.Series(gbc.feature_importances_, index=cat_cols)

In [None]:
# train
roc_auc_score(y_true=df_train_smooth['y'], 
              y_score = gbc.predict_proba(df_train_smooth[cat_cols])[:, 1])

In [None]:
# test
roc_auc_score(y_true=df_test_smooth['y'], y_score = gbc.predict_proba(df_test_smooth[cat_cols])[:, 1])

* It correcly picks x_good1 feature
* Accuracy is much higher than with one-hot encoding

In [None]:
gbc = GradientBoostingClassifier(random_state=20190325, n_estimators=10)
df_train_cv, code_map, default_map = encode_target_cv(df_train, 'y', cat_cols)

df_test_cv = df_test.copy()
for v in cat_cols:
    df_test_cv.loc[:, v] = df_test_cv[v].map(code_map[v]).fillna(default_map[v])

gbc.fit(df_train_cv[cat_cols], df_train_cv['y'])    

In [None]:
pd.Series(gbc.feature_importances_, index=cat_cols)

In [None]:
roc_auc_score(y_true=df_train_cv['y'], y_score = gbc.predict_proba(df_train_cv[cat_cols])[:, 1])

In [None]:
roc_auc_score(y_true=df_test_cv['y'], y_score = gbc.predict_proba(df_test_cv[cat_cols])[:, 1])

Suprisingly, mean encoding with smoothing performs very close to cross-validated one.  

# III Further reading

0. More on comparing Laplace smoothing and cross-validation schemes: http://www.win-vector.com/blog/2016/11/laplace-noising-versus-simulated-out-of-sample-methods-cross-frames/?relatedposts_hit=1&relatedposts_origin=5231&relatedposts_position=1
1. Incorporating mean encoding into grand cross-validation: https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/discussion/44987
2. Comprehensive study of mean encoding: https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study

# IV References
## Additive Smoothing
* https://www.kaggle.com/ogrellier/python-target-encoding-for-categorical-features
* https://maxhalford.github.io/blog/target-encoding-done-the-right-way/

    
## Cross Validation
* https://www.kaggle.com/tnarik/likelihood-encoding-of-categorical-features
* https://www.kaggle.com/c/mercedes-benz-greener-manufacturing/discussion/36136#201638    