# Table of Contents
1. [Introduction](#introduction)
2. [Sources](#sources)
3. [Short Guideline](#short_guidelines)
4. [Notation](#notation)
5. [Import Dependencies and Data Loading](#import)
6. [Encoders](#encoders)
    1. [Label Encoder / Ordinary Encoder](#labelencoder)
    2. [One-Hot Encoder / Binary Encoder](#onehotencoder)
    3. [Sum Encoder](#sumencoder)
    4. [Helmert Encoder](#helmertencoder)
    5. [Frequency Encoder](#freqencoder)
    6. [Target Encoder / Mean Encoder](#targetencoder)
    7. [Leave-one-out Encoder](#looencoder)
    8. [M-Estimate Encoder](#meestimateencoder)
    9. [Weight Of Evidence Encoder](#woeencoder)
    10. [Probability Ratio Encoding](#probabilityratioencoder)
    11. [James-Stein Encoder](#jamessteinencoder) 
    12. [Catboost Encoder](#catboostencoder) 
    13. [Hashing Encoding](#hashingencoder)
    14. [Generalized Linearn Mixed Model (GLMM)](#glmmencoder)
7. [FAQ](#faq)
8. [Conclusion](#conclusion)
9. [Validation](#validation)
10. [Output](#output)

# Introduction <a name="introduction"></a>

Recently, my attention got caught by an article about mRMR method for feature selection on
[Towards Data Science](https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b). I started to investigate a [git repo from smazzanti](https://github.com/smazzanti/mrmr) which can
be installed by
``` python
pip install mrmr_selection
```

Because it can take and process even categorical variables I was curious, what method is used. In the source code, there are 3 categorical encoders from package `category-encoders`. Those are `Leave One Out`, `James-Stein`, `Target-Encoder`. The features selection itself worked quite
well on my work dataset which I know back-to-forth. 

I created this notebook to serve me and you as a summary and hands-on experience with all different kinds of encoderes. There is quite big number of encoders I have never heard about and it may be that you neither. Some of those encoders are product of Kaggle challanges.

However, keep in mind that
- there is no free lunch. You have to try multiple different encoders to find one which suits your data.
- Based on double-validation, it seems so-far that the most stable and accurate are Catboost Encoder, James-Stein Encoder, and Target Encoder.
- encoder.fit_transform() on the whole train is a road to nowhere: it turned out that Single Validation is a much better option than commonly used None Validation. If you want to achieve a stable high score, Double Validation is your choice, but bear in mind that it requires much more time to be trained.
- Regularization is a **must** for target-based encoders.


# Sources  <a name="sources"></a>
I'll try to post all resources I used for this tutorial, however, if I missed one, I'm sorry to
the author. There is a lot of information on the internet, it is easy to forget some citations.

- [Benchmarking Categorical Encoders](https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8)

- [CategoricalEncodingBenchmark](https://github.com/DenisVorotyntsev/CategoricalEncodingBenchmark)

# Notation  <a name="notation"></a>

- $y$ and $y+$ — the total number of observations and the total number of positive observations (y=1);
- $x_i$, $y_i$ — the i-th value of category and target;
- $n$ and $n+$ — the number of observations and the number of positive observations (y=1) for a given value of a categorical column;
- $a$ — a regularization hyperparameter (selected by a user), 
- prior — an average value of the target.

### Example Dataset:

![example dataset](https://miro.medium.com/v2/resize:fit:628/format:webp/1*wl-W3sDYRWn3bjsymeUFhA.png)

- y=10, y+=5;
- ni=”D”, yi=1 for 9th line of the dataset (the last observation);
- For category B: n=3, n+=1;
- prior = y+/ y= 5/10 = 0.5.

# Import Dependencies and Data Loading  <a name="import"></a>

In [50]:
# http://contrib.scikit-learn.org/categorical-encoding/
# !pip install category-encoders

In [5]:
import os

import pandas as pd
import numpy as np
import warnings
import xgboost as xgb
import random

from category_encoders.ordinal import OrdinalEncoder
from category_encoders.woe import WOEEncoder
from category_encoders.target_encoder import TargetEncoder
from category_encoders.sum_coding import SumEncoder
from category_encoders.m_estimate import MEstimateEncoder
from category_encoders.leave_one_out import LeaveOneOutEncoder
from category_encoders.helmert import HelmertEncoder
from category_encoders.cat_boost import CatBoostEncoder
from category_encoders.james_stein import JamesSteinEncoder
from category_encoders.one_hot import OneHotEncoder
from category_encoders import HashingEncoder
from category_encoders import GLMMEncoder

warnings.filterwarnings('ignore')
TEST = True
SAVE_RESULTS = False

read csv and doing some preprocessing

In [3]:
%%time
train = pd.read_csv('./data/train.csv')
test = pd.read_csv('./data/test.csv')
target = train['target']
train_id = train['id']
test_id = test['id']
train.drop(['target', 'id'], axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

CPU times: total: 2.48 s
Wall time: 2.53 s


In [4]:
train.shape

(300000, 23)

In [5]:
target.shape

(300000,)

In [6]:
train.head()

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,T,Y,Green,Triangle,Snake,Finland,Bassoon,...,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2
1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,Piano,...,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8
2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,Theremin,...,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2
3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,Oboe,...,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1
4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,Oboe,...,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8


In [7]:
feature_list = list(train.columns) # you can custumize later.

# Encoders <a name="encoders"></a>

## 1. Label Encoder (LE), Ordinary Encoder(OE)  <a name="labelencoder"></a>

One of the most common encoding methods.

An encoding method that converts categorical data into column of numbers. For M categories in one column
will be one column with M numbers.

The disadvantage is that the labels are ordered randomly (in the existing order of the data),
which can add noise while assigning an unexpected order between labels. In other words, the data
becomes ordinary (ordinal, ordered) data, which can lead to unintended consequences.

The difference between LabelEncoder and OrdinalEncoder is that LabelEncoder works on 1D tuple ```(n, )```, while OrdinalEncoder works on 2D array ``` (n_samples, m_columns)```. Historacaly, LabelEncoder was used to encode targer (hence the tuple).

The code is very simple, and when you encode a specific column you can proceed as follows:

``` python
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()

train[column_name] = label.fit_transform(train[column_name])
```

In case of OrdinalEncoder you can also provide a mapping how to map specific values of columns by parameter [```mapping```](https://contrib.scikit-learn.org/category_encoders/ordinal.html)

In [8]:
%%time
LE_encoder = OrdinalEncoder(feature_list)
train_le = LE_encoder.fit_transform(train)
test_le = LE_encoder.transform(test)

CPU times: total: 7.27 s
Wall time: 7.29 s


In [9]:
train_le.shape

(300000, 23)

In [10]:
train_le

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,1,1,1,1,1,1,1,...,1,1,2,1,1,1,1,1,2,2
1,0,1,0,1,1,1,2,2,2,2,...,2,2,1,1,2,2,2,2,7,8
2,0,0,0,2,1,2,2,3,2,3,...,3,3,1,2,3,1,3,3,7,2
3,0,1,0,2,1,3,2,1,3,4,...,4,4,1,1,4,3,1,4,2,1
4,0,0,0,2,2,3,2,3,3,4,...,5,5,1,1,5,2,3,5,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,1,2,3,2,1,6,4,...,2110,5077,1,4,5,9,6,27,3,8
299996,0,0,0,2,1,1,2,3,2,2,...,1288,7421,2,3,5,1,21,150,3,2
299997,0,0,0,2,1,2,5,6,2,4,...,44,4958,3,3,4,13,2,75,7,9
299998,0,1,0,2,1,1,4,6,4,2,...,792,9383,1,5,4,1,21,146,3,8


## 2. One-Hot Encoder (OHE, dummy encoder)  <a name="onehotencoder"></a>


So what can you do to give values by category instead of ordering them?

If you have data with specific category values, you can create a column. For M categories OHE
creates M columns with zeroes everywhere except for the row with given category.

Since only the row containing the content is given as 1, it is called one-hot encoding. Also
called dummy encoding in the sense of creating a dummy. Alternatively, called ```binary encoding```. You can drop
the first category (or the most frequent) to reduce the dimensionality and avoid the multicollinearity. Some models cannot cope
with full N columns.

Using pandas for this type of encoding since sklearn had problems with memory.

``` python
traintest = pd.concat([train, test])
dummies = pd.get_dummies(traintest, columns=traintest.columns, drop_first=True, sparse=True)
train_ohe = dummies.iloc[:train.shape[0], :]
test_ohe = dummies.iloc[train.shape[0]:, :]
train_ohe = train_ohe.sparse.to_coo().tocsr()
test_ohe = test_ohe.sparse.to_coo().tocsr()
```

If you use `Category-Encoders` it will look like this code below.

In [11]:
# %%time
# this method didn't work because of RAM memory.
# so we have to use pd.dummies
# OHE_encoder = OneHotEncoder(feature_list)
# train_ohe = OHE_encoder.fit_transform(train)
# test_ohe = OHE_encoder.transform(test)

In [12]:
%%time
traintest = pd.concat([train, test])
dummies = pd.get_dummies(traintest, columns=traintest.columns, drop_first=True, sparse=True)
train_ohe = dummies.iloc[:train.shape[0], :]
test_ohe = dummies.iloc[train.shape[0]:, :]
train_ohe = train_ohe.sparse.to_coo().tocsr()
test_ohe = test_ohe.sparse.to_coo().tocsr()

CPU times: total: 18.2 s
Wall time: 18.2 s


In [13]:
dummies.iloc[:train.shape[0], :]

Unnamed: 0,bin_0_1,bin_1_1,bin_2_1,bin_3_T,bin_4_Y,nom_0_Green,nom_0_Red,nom_1_Polygon,nom_1_Square,nom_1_Star,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,1,1,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
299996,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
299997,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
299998,0,1,0,0,1,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0


## 3. Sum Encoder (Deviation Encoder, Effect Encoder)  <a name="sumencoder"></a>

[**Sum Encoder**](https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8) compares the mean of the dependent variable (target) for a given level of a categorical column to the overall mean of the target. The difference between them is the interpretation of LR coefficients: whereas in OHE model the intercept represents the mean for the baseline condition and coefficients represents simple effects (the difference between one particular condition and the baseline), in Sum Encoder model the intercept represents the grand mean (across all conditions) and the coefficients can be interpreted directly as the main effects.

Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression (LR) types of models.

![sum encoding](https://miro.medium.com/v2/resize:fit:720/format:webp/1*6jGLBSyu93bpKs7xnnDD2g.png)

If you use `Category-Encoders` it will look like this code below.

In [14]:
# %%time
# this method didn't work because of RAM memory. 
# SE_encoder =SumEncoder(feature_list)
# train_se = SE_encoder.fit_transform(train[feature_list], target)
# test_se = SE_encoder.transform(test[feature_list])

## 4. Helmert Encoder  <a name="helmertencoder"></a>

**Helmert Encoding** is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding. 

It compares each level of a categorical variable to the mean of the subsequent levels (so-called `forward` Helmert) or previous levels (so-called `reverse` Helmert). 

This type of encoding can be useful in certain situations where levels of the categorical variable are ordered. (not this dataset)

The version in `category_encoders` is sometimes referred to as Reverse Helmert Coding.

![Helmert](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*I2C4uJP8lKmMVzmsREFv3g.png)

In [15]:
# %%time
# this method didn't work because of RAM memory. 
# HE_encoder = HelmertEncoder(feature_list)
# train_he = HE_encoder.fit_transform(train[feature_list], target)
# test_he = HE_encoder.transform(test[feature_list])

## 5. Frequency Encoder  <a name="freqencoder"></a>

Frequency Encoding counts the number of a category’s occurrences in the dataset. New categories in test dataset encoded with either “1” or counts of category in a test dataset, which makes this encoder a little bit tricky: encoding for different sizes of test batch might be different. You should think about it beforehand and make preprocessing of the train as close to the test as possible.

To avoid such problem, you might also consider using a Frequency Encoder variation — Rolling Frequency Encoder (RFE). RFE counts the number a category’s occurrences for the last dt timesteps from a given observation (for example, for dt= 24 hours).

Nevertheless, Frequency Encoding and RFE are especially efficient when your categorical column has “long tails”, i.e. several frequent values and the remaining ones have only a few examples in the dataset. In such a case, Frequency Encoding would catch the similarity between rare columns.

![Frequency Encoder](https://miro.medium.com/v2/resize:fit:720/format:webp/1*WfnICPX4S2BrQ4WtZ3GqDQ.png)

## 6. Target Encoder / Mean Encoding  <a name="targetencoder"></a>

Target Encoding has probably become the most popular encoding type because of Kaggle competitions. It takes information about the target to encode categories, which makes it extremely powerful. 
Basic idea behind:
1. Select a categorical variable you would like to transform.
2. Group by the categorical variable and obtain aggregated sum over the “Target” variable. (total number of 1’s for each category in ‘category’ column)
3. Group by the categorical variable and obtain aggregated count over “Target” variable
4. Divide the step 2 / step 3 results and join it back with the train.

The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning. However, Mean encoding is notorious for over-fitting; thus, a regularization with cross-validation or some other approach is a must-have on most occasions. 

Sample code:
``` python
mean_encode = df.groupby('category')['target'].mean()
df.loc[:, 'category_representation'] = df['category'].map(mean_encode)
df
```

![Target Encoding](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*pRl2Vk_ZR72VhuHi9S3ZuA.png)

### Issues
One really important effect is the *Target Leakage*. By using the probability of the target to encode the features we are feeding them with information of the very variable we are trying to model. This is like “cheating” since the model will learn from a variable that contains the target in itself.

You can think that this might not be an issue if this encoding reflects the real probability of the target, given a category. But if this is the case we might not even need a model. Instead, we can use this variable as a single powerful predictor for this target.

Also, the use of the mean as a predictor for the whole distribution is good, but not perfect. If all the different distributions and combinations of the data could be represented by a single mean, our life would be much easier.

Even if the mean is a good summary, we train models in a fraction of the data. The mean of this fraction may not be the mean of the full population (remember the central limit theorem?), so the encoding might not be correct. If the sample is different enough from the population, the model may even overfit the training data.

To reduce the effect of target leakage, 

- Increase regularization
- Add random noise to the representation of the category in train dataset (some sort of augmentation)
- Use Double Validation (using other validation)

### Target Encoding with prior smoothing
We can smooth the scores for each category by considering the mean of the whole population. 

$$ encoding = \alpha \cdot p(t = 1 | x = c_i) + (1 - \alpha) \cdot p(t=1) \quad ,$$
where $p(t=1)$ is called prior probability, $\alpha$ is smoothing parameter, usually defined:
$$ \alpha = \frac{1}{1+exp(-\frac{n-k}{f})} \quad ,$$ 
where 

- <i>k</i> means **'min_samples_leaf'**
- <i>f</i> means **'smooth parameter, power of regularization'**

Recommended values for <i>k</i> and <i>f</i> are in the range of 1 to 100. New values of category and values with just single appearance in train dataset are replaced with the prior ones.

The library `category_encoders` uses smoothed version.

In [16]:
%%time

TE_encoder = TargetEncoder()
train_te = TE_encoder.fit_transform(train[feature_list], target)
test_te = TE_encoder.transform(test[feature_list])

train_te.head()

CPU times: total: 13 s
Wall time: 13 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302537,0.290107,0.327145,0.360978,0.307162,0.242813,0.237743,...,0.372694,0.335588,2,0.403885,0.257877,0.306993,0.208354,0.401186,2,2
1,0,1,0,0.302537,0.290107,0.327145,0.290054,0.359209,0.289954,0.304164,...,0.189202,0.229909,1,0.403885,0.326315,0.206599,0.186877,0.30388,7,8
2,0,0,0,0.309384,0.290107,0.24179,0.290054,0.293085,0.289954,0.353951,...,0.223022,0.210992,1,0.317175,0.403126,0.306993,0.351864,0.206843,7,2
3,0,1,0,0.309384,0.290107,0.351052,0.290054,0.307162,0.339793,0.329472,...,0.325123,0.233811,1,0.403885,0.360961,0.330148,0.208354,0.355985,2,1
4,0,0,0,0.309384,0.333773,0.351052,0.290054,0.293085,0.339793,0.329472,...,0.376812,0.219315,1,0.403885,0.225214,0.206599,0.351864,0.404345,7,8


### Multiclass Approach
Let's assume we have following dataset.

In [17]:
df = pd.DataFrame({"genre" : ["Nonfiction", "Fantasy", "Nonfiction", "Nonfiction",
                             "Romance", "Nonfiction", "Nonfiction", "Fantasy", 
                             "Nonfiction", "Fantasy", "Nonfiction", "Fantasy", 
                             "Romance", "Fantasy", "Nonfiction", "Fantasy", 
                             "Romance", "Nonfiction", "Romance", "Fantasy"],
                  "target" : [0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 2, 1, 0, 2, 2, 2]})
df.head()

Unnamed: 0,genre,target
0,Nonfiction,0
1,Fantasy,0
2,Nonfiction,1
3,Nonfiction,1
4,Romance,0


By running target encoder from category_encoder package on it we will resice following results:

In [18]:
te = TargetEncoder()
df['genre_encoded_sklearn'] = te.fit_transform(df['genre'], df['target'])
df.head()

Unnamed: 0,genre,target,genre_encoded_sklearn
0,Nonfiction,0,0.849948
1,Fantasy,0,0.781643
2,Nonfiction,1,0.849948
3,Nonfiction,1,0.849948
4,Romance,0,0.749606


In [19]:
df[["genre", "target", "genre_encoded_sklearn"]].drop_duplicates().sort_values("genre")

Unnamed: 0,genre,target,genre_encoded_sklearn
1,Fantasy,0,0.781643
9,Fantasy,1,0.781643
19,Fantasy,2,0.781643
0,Nonfiction,0,0.849948
2,Nonfiction,1,0.849948
14,Nonfiction,2,0.849948
4,Romance,0,0.749606
18,Romance,2,0.749606


As you can see the sum of *Fantasy* genre does not add up to 1. For multiclass classification we have to calculate the numbers by ourselfs.

In [20]:
categories = df['genre'].unique()
targets = df['target'].unique()
cat_list = []

for cat in categories:
    aux_dict = {}
    aux_dict['category'] = cat
    aux_df = df[df['genre'] == cat]
    counts = aux_df['target'].value_counts()
    aux_dict['count'] = sum(counts)

    for t in targets:
        aux_dict['target_' + str(t)] = counts[t] if t in counts.keys() else 0

    cat_list.append(aux_dict)

cat_list = pd.DataFrame(cat_list)
for t in targets:
    cat_list['genre_encoded_target_' + str(t)] = cat_list['target_' + str(t)] / cat_list['count']
    
cat_list

Unnamed: 0,category,count,target_0,target_1,target_2,genre_encoded_target_0,genre_encoded_target_1,genre_encoded_target_2
0,Nonfiction,9,2,5,2,0.222222,0.555556,0.222222
1,Fantasy,7,3,3,1,0.428571,0.428571,0.142857
2,Romance,4,3,0,1,0.75,0.0,0.25


All the encodings now reflect correctly the posteriors, as expected. Even “Romance” will be encoded as “0” for target “1” since it never appeared for that category.

Now that we understand what needs to be done to use Target Encoding in multiclass problems, it is easy to create a simple code to use the `category_encoders.TargetEncoder` object in this scenario:

In [21]:
from category_encoders import TargetEncoder

targets = df['target'].unique()
for t in targets:
    target_aux = df['target'].apply(lambda x: 1 if x == t else 0)
    encoder = TargetEncoder()
    df['genre_encoded_sklearn_target_' + str(t)] = encoder.fit_transform(df['genre'], target_aux)

In [22]:
df[["genre", "target", "genre_encoded_sklearn_target_0", 
    "genre_encoded_sklearn_target_1", "genre_encoded_sklearn_target_2"]].drop_duplicates().sort_values("genre")

Unnamed: 0,genre,target,genre_encoded_sklearn_target_0,genre_encoded_sklearn_target_1,genre_encoded_sklearn_target_2
1,Fantasy,0,0.406119,0.406119,0.187762
9,Fantasy,1,0.406119,0.406119,0.187762
19,Fantasy,2,0.406119,0.406119,0.187762
0,Nonfiction,0,0.355602,0.438848,0.20555
2,Nonfiction,1,0.355602,0.438848,0.20555
14,Nonfiction,2,0.355602,0.438848,0.20555
4,Romance,0,0.458794,0.332807,0.208399
18,Romance,2,0.458794,0.332807,0.208399


If you are familiar with One-Hot Encoding, you know that now you may remove any of the encoded columns to avoid multicollinearity.

## 7. Leave-one-out Encoder (LOO or LOOE)  <a name="looencoder"></a>

**Leave-one-out Encoding** is another example of target-based encoders.

This encoder calculate mean target of category k for observation j if observation j is removed from the dataset:

$$\hat{x}^k_i = \frac{\sum_{j \neq i}(y_j * (x_j == k) )}{\sum_{j \neq i} x_j == k}$$

While encoding the test dataset, a category is replaced with the mean target of the category k in the train dataset:

$$\hat{x}^k = \frac{\sum y_j * (x_j == k)  }{\sum x_j == k}$$

Alternative expresion of above equation which is actually used in mentioned package [category_encoders](https://github.com/scikit-learn-contrib/category_encoders/blob/master/category_encoders/leave_one_out.py):

$$\hat{x}^k_i = \frac{\sum_{j}(y_j * (x_j == k) ) - y_i }{\sum_{j} (x_j == k) - 1}$$

One of the problems with LOO, just like with all other target-based encoders, is target leakage. But when it comes to LOO, this problem gets really dramatic, as far as we may perfectly classify the training dataset by making a single split.

Posible solution to prevent overfitting is to use regularization and randomness factor:
$$\hat{x}^k_i = \frac{\sum_{j \neq i}(y_j * (x_j == k) )}{\sum_{j} (x_j == k) - 1 + R} \cdot (1 + \epsilon_i) \quad ,$$

where `R` is a regularization factor to be determined by the validation dataset and $\epsilon_i$ is a randomness factor following a certain distribution (e.g. $N(0, \sigma^2)$), for example in Lesara Dataset it is $\epsilon \approx N(0, 0.03)$

[Implementetion of LOO on Lesara Customeres](https://medium.com/@o.xhelili/using-machine-learning-for-retaining-lesara-customers-f1b88f492e62).

Another problem with LOO is a shift between values in the train and the test samples. You could observe it from the picture above. Possible values for category “A” in the train sample are 0.67 and 0.33, while in the test one — 0.5. It is a result of the different number of counts in train and test datasets: for category “A” denominator is equal to n for test and n-1 for train dataset. Such a shift may gradually reduce the performance of tree-based models.

![LOOE](https://miro.medium.com/v2/resize:fit:720/format:webp/1*Be0SlDIsHeP27EFmeyWdRg.png)

In [30]:
%%time
LOOE_encoder = LeaveOneOutEncoder() # with potential parameter sigma=0.05
train_looe = LOOE_encoder.fit_transform(train[feature_list], target)
test_looe = LOOE_encoder.transform(test[feature_list])

CPU times: total: 11 s
Wall time: 11.1 s


In [31]:
train_looe

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302539,0.290108,0.327148,0.360990,0.307169,0.242820,0.237746,...,0.374074,0.388889,2,0.403890,0.257885,0.307005,0.208407,0.401980,2,2
1,0,1,0,0.302539,0.290108,0.327148,0.290057,0.359221,0.289957,0.304167,...,0.190909,0.083333,1,0.403890,0.326330,0.206605,0.186887,0.303997,7,8
2,0,0,0,0.309387,0.290108,0.241793,0.290057,0.293087,0.289957,0.353958,...,0.223827,0.178571,1,0.317188,0.403133,0.307005,0.351885,0.206923,7,2
3,0,1,0,0.309380,0.290103,0.351043,0.290047,0.307147,0.339780,0.329465,...,0.321782,0.209302,1,0.403877,0.360951,0.330124,0.208155,0.355736,2,1
4,0,0,0,0.309387,0.333776,0.351056,0.290057,0.293087,0.339800,0.329476,...,0.378641,0.205882,1,0.403890,0.225217,0.206605,0.351885,0.404487,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,0.302539,0.333776,0.351056,0.290057,0.307169,0.361323,0.329476,...,0.261905,0.348837,1,0.278540,0.225217,0.365225,0.266041,0.313116,3,8
299996,0,0,0,0.309387,0.290108,0.327148,0.290057,0.293087,0.289957,0.304167,...,0.289216,0.200000,2,0.242057,0.225217,0.307005,0.409526,0.225034,3,2
299997,0,0,0,0.309380,0.290103,0.241782,0.310612,0.318998,0.289947,0.329465,...,0.331034,0.275862,3,0.242049,0.360951,0.433607,0.186832,0.177994,7,9
299998,0,1,0,0.309380,0.290103,0.327140,0.338918,0.318998,0.314669,0.304155,...,0.275304,0.333333,1,0.355055,0.360951,0.306965,0.409417,0.437908,3,8


In [32]:
test_looe

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,1,0.302537,0.290107,0.241790,0.360978,0.319017,0.242813,0.304164,...,0.348315,0.181818,2,0.242055,0.288796,0.342476,0.324947,0.300588,5,11
1,0,0,0,0.302537,0.333773,0.351052,0.338932,0.293085,0.339793,0.304164,...,0.222707,0.288889,1,0.355078,0.403126,0.379277,0.186877,0.244795,7,5
2,1,0,1,0.309384,0.290107,0.241790,0.338932,0.245139,0.311724,0.304164,...,0.186667,0.090909,2,0.317175,0.225214,0.206599,0.236891,0.417726,1,12
3,0,0,1,0.302537,0.290107,0.351052,0.310627,0.335367,0.311724,0.304164,...,0.360656,0.320000,1,0.278533,0.403126,0.220460,0.336264,0.365151,2,3
4,0,1,1,0.309384,0.333773,0.351052,0.290054,0.245139,0.311724,0.304164,...,0.375000,0.294118,3,0.403885,0.403126,0.379277,0.409481,0.389864,4,11
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
199995,0,0,0,0.309384,0.333773,0.327145,0.338932,0.293085,0.339793,0.353951,...,0.333333,0.393939,1,0.242055,0.403126,0.342476,0.186877,0.219193,1,3
199996,1,0,0,0.309384,0.290107,0.327145,0.290054,0.293085,0.311724,0.304164,...,0.445122,0.276596,1,0.278533,0.403126,0.275018,0.359364,0.198573,2,2
199997,0,1,1,0.302537,0.290107,0.327145,0.290054,0.293085,0.339793,0.329472,...,0.330275,0.411765,1,0.242055,0.360961,0.289685,0.386269,0.273736,3,1
199998,1,0,0,0.302537,0.290107,0.241790,0.310627,0.359209,0.314688,0.237743,...,0.381295,0.277778,2,0.403885,0.360961,0.289685,0.407184,0.282507,2,1


## 8. M-Estimate Encoder  <a name="meestimateencoder"></a>

M-estimator encoding can be used in categorical encoding as a way to handle outliers or rare categories in a dataset. In this context, it can be used as a way to handle a class imbalance in a categorical variable. The idea is to assign a weight to each category based on its deviation from the overall class frequency. This weight is then used to adjust the encoding of the categorical variable, giving more importance to under-represented categories.

For example, suppose you have a categorical variable with 3 categories A, B, and C, and you want to encode it using one-hot encoding. The standard one-hot encoding will assign the same weight to each category. However, if category A is significantly under-represented compared to B and C, you should give it more weight in the encoding. In this case, you can use M-estimator encoding, which assigns weights to each category based on a weight function chosen to address the class imbalance.

It is worth noting that M-estimator encoding is just one of the many methods that can be used to handle a class imbalance in categorical variables, and it may not be the best method in every situation. Whether or not to use it, and how to use it, depends on the specific problem and dataset you are working with.

**M-Estimate Encoder** is a **another version of Target Encoder** and has one hyperparameter `m` in the formula (used to be wrong formula in the package, nonetheless, it gave good results)

$$\hat{x}^k = \frac{n^+ + prior * m}{y^+ + m}$$

The higher value of `m` results into stronger shrinking. Recommended values for `m` is in the range of 1 to 100.

The formula for M-Estimate Encoder in Categorical Encoders library [contained a bug](https://github.com/scikit-learn-contrib/categorical-encoding/issues/200). Instead of `y+` there is `n` in the denominator. However, both approaches show pretty good scores. 

$$\hat{x}^k = \frac{n^+ + prior * m}{n + m}$$

![M-Estimate Encoder](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*8F5KJnekzeWYo_x6jYOBTQ.png)

Code snippet how to calculate ME_encoder by yourself:

In [23]:
# sample data
data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
        'Color': ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
        'Target': [1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(data,columns = ['Temperature','Color','Target'])

# count the frequency of each category
category_counts = df['Temperature'].value_counts()

# calculate the weight for each category based on deviation from the mean frequency
weights = (category_counts - category_counts.mean()).abs()
weights = weights / weights.sum()

# create a dictionary to map each category to its weight
mapping = dict(zip(weights.index, weights.values))

# map the categories to their weights
df['weights'] = df['Temperature'].map(mapping)

# calculate weighted encoding for each category
encoded = df.groupby('Temperature')['weights'].sum()
encoded = encoded / encoded.sum()

# map the weighted encoding back to the categories
df['encoded_temperature'] = df['Temperature'].map(encoded)

df

Unnamed: 0,Temperature,Color,Target,weights,encoded_temperature
0,Hot,Red,1,0.375,0.6
1,Cold,Yellow,1,0.125,0.1
2,Very Hot,Blue,1,0.375,0.15
3,Warm,Blue,0,0.125,0.15
4,Hot,Red,1,0.375,0.6
5,Warm,Yellow,0,0.125,0.15
6,Warm,Red,1,0.125,0.15
7,Hot,Yellow,0,0.375,0.6
8,Hot,Yellow,1,0.375,0.6
9,Cold,Yellow,1,0.125,0.1


In this example, the weight variable is the weight assigned to each category based on the deviation from the mean frequency. The encoded_temperature variable is the weighted encoding of the Temperature category, calculated as the sum of the weights for each category and normalized to sum 1.

And by using category_encoder package:

In [24]:
%%time
MEE_encoder = MEstimateEncoder()
train_mee = MEE_encoder.fit_transform(train[feature_list], target)
test_mee = MEE_encoder.transform(test[feature_list])

CPU times: total: 14.8 s
Wall time: 14.9 s


In [25]:
train_mee.head()

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302537,0.290107,0.327145,0.360976,0.307162,0.242815,0.237744,...,0.372448,0.365294,2,0.403884,0.257879,0.306993,0.208379,0.400998,2,2
1,0,1,0,0.302537,0.290107,0.327145,0.290055,0.359207,0.289954,0.304164,...,0.190231,0.093277,1,0.403884,0.326314,0.206602,0.186884,0.303881,7,8
2,0,0,0,0.309384,0.290107,0.241791,0.290055,0.293085,0.289954,0.35395,...,0.223319,0.176863,1,0.317175,0.403125,0.306993,0.351861,0.206881,7,2
3,0,1,0,0.309384,0.290107,0.351051,0.290055,0.307162,0.339792,0.329472,...,0.325029,0.22902,1,0.403884,0.36096,0.330147,0.208379,0.355965,2,1
4,0,0,0,0.309384,0.333773,0.351051,0.290055,0.293085,0.339792,0.329472,...,0.376471,0.202941,1,0.403884,0.225215,0.206602,0.351861,0.40431,7,8


## 9. Weight of Evidence Encoder  <a name="woeencoder"></a>

**Weight Of Evidence** is a commonly used target-based encoder in credit scoring. It is a measure of the “strength” of a grouping for separating good and bad risk (default). 

It is calculated from the basic odds ratio:

``` python
a = Distribution of Good Credit Outcomes
b = Distribution of Bad Credit Outcomes
WoE = ln(a / b)
```

However, if we use formulas as is, it might lead to **target leakage** (and overfit).

To avoid that, regularization parameter <i>a</i> is induced and WoE is calculated in the following way:

$$nomiinator = \frac{n^+ + a}{y^+ + 2*a}$$

$$denomiinator = \frac{n - n^+ + a}{y - y^+ + 2*a}$$

$$\hat{x}^k = ln(\frac{nominator}{denominator})$$

![WoE](https://miro.medium.com/v2/resize:fit:1100/format:webp/1*wEFbHb-4a-2by1bVC_GTVQ.png)

In [26]:
%%time
WOE_encoder = WOEEncoder()
train_woe = WOE_encoder.fit_transform(train[feature_list], target)
test_woe = WOE_encoder.transform(test[feature_list])

CPU times: total: 15.1 s
Wall time: 15.3 s


In [27]:
train_woe

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,-0.015794,-0.075416,0.098327,0.248359,0.006058,-0.317803,-0.345614,...,0.302749,0.333932,2,0.430146,-0.237516,0.005297,-0.514545,0.420532,2,2
1,0,1,0,-0.015794,-0.075416,0.098327,-0.075660,0.240683,-0.076148,-0.008087,...,-0.600377,-1.052362,1,0.430146,0.094611,-0.526005,-0.650766,-0.008737,7,8
2,0,0,0,0.016454,-0.075416,-0.323420,-0.075660,-0.060990,-0.076148,0.217747,...,-0.417323,-0.607677,1,0.052724,0.426997,0.005297,0.208660,-0.523234,7,2
3,0,1,0,0.016454,-0.075416,0.205037,-0.075660,0.006058,0.155252,0.108884,...,0.096879,-0.338013,1,0.430146,0.248265,0.111980,-0.514545,0.227089,2,1
4,0,0,0,0.016454,0.128285,0.205037,-0.075660,-0.060990,0.155252,0.108884,...,0.321353,-0.468414,1,0.430146,-0.416062,-0.526005,0.208660,0.432324,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,-0.015794,0.128285,0.205037,-0.075660,0.006058,0.249803,0.108884,...,-0.192161,0.190831,1,-0.132258,-0.416062,0.266667,-0.195294,0.033989,3,8
299996,0,0,0,0.016454,-0.075416,0.098327,-0.075660,-0.060990,-0.076148,-0.008087,...,-0.076648,-0.502316,2,-0.321986,-0.416062,0.005297,0.453411,-0.415855,3,2
299997,0,0,0,0.016454,-0.075416,-0.323420,0.022286,0.061193,-0.076148,0.108884,...,0.146495,0.030982,3,-0.321986,0.248265,0.552543,-0.650766,-0.705175,7,9
299998,0,1,0,0.016454,-0.075416,0.098327,0.151411,0.061193,0.041196,-0.008087,...,-0.125022,0.288812,1,0.222693,0.248265,0.005297,0.453411,0.572812,3,8


## 10. Probability Ratio Encoder  <a name="probabilityratioencoder"></a>

Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only difference that the ratio of good and bad probability is used. 

- For each label, we calculate the mean of target=1, that is, the probability of being 1 ( P(1) ), 
- Also the probability of the target=0 ( P(0) ). 
- And then, we calculate the ratio P(1)/P(0) and replace the labels with that ratio. 

We need to add a minimal value with P(0) to avoid any divide-by-zero scenarios where for any particular category, there is no target=0.

In [35]:
df = pd.concat([train["month"], target], axis = 1)
df

Unnamed: 0,month,target
0,2,0
1,8,0
2,2,0
3,1,1
4,8,0
...,...,...
299995,8,0
299996,2,0
299997,9,1
299998,8,1


In [36]:
pr_df = df.groupby("month")["target"].mean()
pr_df = pd.DataFrame(pr_df)

pr_df = pr_df.rename(columns = {"target" : "good"})
pr_df["bad"] = 1-pr_df.good

pr_df["bad"] = np.where(pr_df["bad"] == 0, 0.000001, pr_df["bad"])

pr_df["PR"] = pr_df.good / pr_df.bad

In [37]:
pr_df

Unnamed: 0_level_0,good,bad,PR
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,0.255729,0.744271,0.343596
2,0.244432,0.755568,0.323508
3,0.280936,0.719064,0.390696
4,0.297352,0.702648,0.423187
5,0.317053,0.682947,0.464243
6,0.376554,0.623446,0.603989
7,0.344642,0.655358,0.525883
8,0.327496,0.672504,0.48698
9,0.345295,0.654705,0.527406
10,0.353157,0.646843,0.545969


In [38]:
df.loc[:, "PR_Encode"] = df["month"].map(pr_df["PR"])
df

Unnamed: 0,month,target,PR_Encode
0,2,0,0.323508
1,8,0,0.486980
2,2,0,0.323508
3,1,1,0.343596
4,8,0,0.486980
...,...,...,...
299995,8,0,0.486980
299996,2,0,0.323508
299997,9,1,0.527406
299998,8,1,0.486980


## 11. James-Stein Encoder  <a name="jamessteinencoder"></a>

[**James-Stein Encoder**](https://contrib.scikit-learn.org/category_encoders/jamesstein.html) is a target-based encoder. This encoder is inspired by James–Stein estimator — the technique named after Charles Stein and Willard James, who simplified Stein’s original Gaussian random vectors mean estimation method of 1956. Stein and James proved that a better estimator than the “perfect” (i.e. mean) estimator exists, which seems to be somewhat of a paradox. However, the James-Stein estimator outperforms the sample mean when there are several unknown populations means — not just one.

The idea behind James-Stein Encoder is simple. Estimation of the mean target for category k could be calculated according to the following formula:

$$\hat{x}^k = (1-B) * \frac{n^+}{n} + B * \frac{y^+}{y} $$

One way to select B is to tune it like a hyperparameter via cross-validation, but Charles Stein came up with another solution to the problem:

$$B = \frac{Var[y^k]}{Var[y^k] + Var[y]}$$

Seems quite fair, but James-Stein Estimator has a big disadvantage — it is defined only for normal distribution (which is not the case for any classification task). If you want to apply it for binary classification, which allows only values {0, 1}, it is better to first convert the mean target value from the bound interval <0,1> into an unbounded interval by replacing `mean(y)` = $\frac{y^+}{y}$ with log-odds ratio: `log-odds_ratio_i = log(mean(y_i)/mean(y_not_i))`

To avoid that, we can either convert binary targets with a log-odds ratio as it was done in WoE Encoder (which is used by default because it is simple) or use beta distribution.

![James-Stein](https://miro.medium.com/v2/resize:fit:720/format:webp/1*7pm_k0Vq47JrWxFBwl2Tiw.png)

In [28]:
%%time
JSE_encoder = JamesSteinEncoder()
train_jse = JSE_encoder.fit_transform(train[feature_list], target)
test_jse = JSE_encoder.transform(test[feature_list])

CPU times: total: 18 s
Wall time: 18.1 s


In [29]:
train_jse

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302537,0.290107,0.327145,0.343763,0.306777,0.260374,0.248201,...,0.337649,0.334882,2,0.377845,0.271531,0.306515,0.247588,0.351077,2,2
1,0,1,0,0.302537,0.290107,0.327145,0.294730,0.342564,0.294658,0.304449,...,0.238347,0.137804,1,0.377845,0.320078,0.243675,0.232549,0.304868,7,8
2,0,0,0,0.309384,0.290107,0.241790,0.294730,0.296876,0.294658,0.345642,...,0.260297,0.227178,1,0.314323,0.372130,0.306515,0.329955,0.249570,7,2
3,0,1,0,0.309384,0.290107,0.351052,0.294730,0.306777,0.329339,0.325462,...,0.315328,0.263300,1,0.377845,0.343752,0.319536,0.247588,0.330239,2,1
4,0,0,0,0.309384,0.333773,0.351052,0.294730,0.296876,0.329339,0.325462,...,0.339509,0.246247,1,0.377845,0.247048,0.243675,0.329955,0.352552,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,0.302537,0.333773,0.351052,0.294730,0.306777,0.343989,0.325462,...,0.279755,0.322701,1,0.285182,0.247048,0.338666,0.283590,0.309435,3,8
299996,0,0,0,0.309384,0.290107,0.327145,0.294730,0.296876,0.294658,0.304449,...,0.296697,0.242514,2,0.256848,0.247048,0.306515,0.358728,0.261030,3,2
299997,0,0,0,0.309384,0.290107,0.241790,0.309196,0.315031,0.294658,0.325462,...,0.320347,0.302973,3,0.256848,0.343752,0.374913,0.232549,0.229962,7,9
299998,0,1,0,0.309384,0.290107,0.327145,0.328749,0.315031,0.312025,0.304449,...,0.291684,0.331289,1,0.342313,0.343752,0.306515,0.358728,0.368007,3,8


## 12. Catboost Encoder  <a name="catboostencoder"></a>

**Catboost** is a recently created target-based categorical encoder. 

It is intended to overcome target leakage problems inherent in LOO. 

In order to do that, the authors of Catboost introduced the idea of “time”: the order of observations in the dataset. Clearly, the values of the target statistic for each example rely only on the observed history. To calculate the statistic for observation j in train dataset, we may use only observations, which are collected before observation j, i.e. i≤j:

$$\hat{x}^k_i = \frac{\sum_{j=0}^{j \leq i}(y_j * (x_j == k)) - y_i + prior}{\sum_{j=0}^{j\leq i} (x_j == k) + 1}$$

To prevent overfitting, the process of target encoding for train dataset is repeated several times on shuffled versions of the dataset and results are averaged. Encoded values of the test data are calculated the same way as in LOO Encoder:

$$\hat{x}^k = \frac{\sum(y_j * (x_j == k)) + prior}{\sum (x_j == k) + 1}$$

Catboost “on the fly” Encoding is one of the core advantages of CatBoost — library for gradient boosting, which showed state of the art results on several tabular datasets when it was presented by Yandex.

In [33]:
%%time
CBE_encoder = CatBoostEncoder()
train_cbe = CBE_encoder.fit_transform(train[feature_list], target)
test_cbe = CBE_encoder.transform(test[feature_list])

CPU times: total: 8.2 s
Wall time: 8.27 s


In [34]:
train_cbe

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.305880,0.305880,0.305880,0.305880,0.305880,0.305880,0.305880,...,0.305880,0.305880,2,0.305880,0.305880,0.305880,0.305880,0.305880,2,2
1,0,1,0,0.152940,0.152940,0.152940,0.305880,0.305880,0.305880,0.305880,...,0.305880,0.305880,1,0.152940,0.305880,0.305880,0.305880,0.305880,7,8
2,0,0,0,0.305880,0.101960,0.305880,0.152940,0.305880,0.152940,0.305880,...,0.305880,0.305880,1,0.305880,0.305880,0.152940,0.305880,0.305880,7,2
3,0,1,0,0.152940,0.076470,0.305880,0.101960,0.152940,0.305880,0.305880,...,0.305880,0.305880,1,0.101960,0.305880,0.305880,0.152940,0.305880,2,1
4,0,0,0,0.435293,0.305880,0.652940,0.326470,0.152940,0.652940,0.652940,...,0.305880,0.305880,1,0.326470,0.305880,0.152940,0.152940,0.305880,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,0.302539,0.333776,0.351056,0.290063,0.307169,0.361322,0.329468,...,0.262927,0.347861,1,0.278547,0.225222,0.365223,0.266043,0.313113,3,8
299996,0,0,0,0.309379,0.290102,0.327142,0.290060,0.293088,0.289953,0.304159,...,0.289297,0.202941,2,0.242051,0.225220,0.306977,0.409450,0.225089,3,2
299997,0,0,0,0.309377,0.290101,0.241786,0.310611,0.318979,0.289950,0.329465,...,0.330862,0.276863,3,0.242049,0.360939,0.433596,0.186839,0.178062,7,9
299998,0,1,0,0.309382,0.290105,0.327140,0.338918,0.318998,0.314669,0.304155,...,0.275427,0.332235,1,0.355053,0.360950,0.306965,0.409406,0.437765,3,8


## 13. Hashing Encoder  <a name="hashingencoder"></a>
Hashing converts categorical variables to a higher dimensional space of integers, where the distance between two vectors of categorical variables is approximately maintained by the transformed numerical dimensional space. With Hashing, the number of dimensions will be far less than the number of dimensions with encoding like One Hot Encoding. This method is advantageous when the cardinality of categorical is very high. You can control the number of numerical columns produced by the process.

However, Hash Encoding has two significant weaknesses. First, because we transform the data into fewer features, there would be an information loss. Second, since a high number of categorical values are represented into a smaller number of features, different categorical values could be represented by the same Hash values — this is called a collision.

But, many Kaggle competitors use Hash Encoding to win the competition, so it is worth a try.

This encoding is used in production when a category changes very frequently say in the case of an e-commerce site product category keeps on changing as new products are added at regular intervals.

In [39]:
data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
        'Color': ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
        'Target': [1,1,1,0,1,0,1,0,1,1]}
df = pd.DataFrame(data,columns = ['Temperature','Color','Target'])
df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Yellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


In [40]:
%%time
hencoder = HashingEncoder(cols='Temperature',n_components=3)
hash_res = hencoder.fit_transform(df['Temperature'])
hash_res.sample(5)

CPU times: total: 281 ms
Wall time: 54.6 s


Unnamed: 0,col_0,col_1,col_2
5,1,0,0
8,0,0,1
7,0,0,1
2,0,1,0
3,1,0,0


In [41]:
pd.concat([df, hash_res], axis = 1)

Unnamed: 0,Temperature,Color,Target,col_0,col_1,col_2
0,Hot,Red,1,0,0,1
1,Cold,Yellow,1,0,1,0
2,Very Hot,Blue,1,0,1,0
3,Warm,Blue,0,1,0,0
4,Hot,Red,1,0,0,1
5,Warm,Yellow,0,1,0,0
6,Warm,Red,1,1,0,0
7,Hot,Yellow,0,0,0,1
8,Hot,Yellow,1,0,0,1
9,Cold,Yellow,1,0,1,0


## 14. Generalized Linear Mixed Model (GLMM) Encoder  <a name="glmmencoder"></a>

This is a supervised encoder similar to TargetEncoder or MEstimateEncoder, but there are some advantages:

1. Solid statistical theory behind the technique. Mixed effects models are a mature branch of statistics.
2. No hyper-parameters to tune. The amount of shrinkage is automatically determined through the estimation process. In short, the less observations a category has and/or the more the outcome varies for a category then the higher the regularization towards "the prior" or "grand mean".
3. The technique is applicable for both continuous and binomial targets. If the target is continuous, the encoder returns regularized difference of the observation's category from the global mean.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
If the target is binomial, the encoder returns regularized log odds per category.
In comparison to JamesSteinEstimator, this encoder utilizes generalized linear mixed models from statsmodels library.
    
Roughly speaking, you can just regard this method as applying “linear regression” on target encoding.
The advantage of using GLMM is that we don’t need to adjust any hyperparameter, and we can still obtain a bunch of robust values.

However, this method is more time-consuming compare with other methods.

In [110]:
glmm_encoder = GLMMEncoder(cols=["month"], binomial_target=True)
# binomial_target = True (for Classification)
# binomial_target = False (for Regression)
glmm_encoder.fit(train["month"], target)
X_encoded = glmm_encoder.transform(train["month"])

In [111]:
pd.concat([train["month"], X_encoded], axis = 1)

Unnamed: 0,month,month.1
0,2,-0.372030
1,8,0.035854
2,2,-0.372030
3,1,-0.311873
4,8,0.035854
...,...,...
299995,8,0.035854
299996,2,-0.372030
299997,9,0.115373
299998,8,0.035854


# FAQ: <a name="faq"></a>

## Which method should I use?

There is no single method that works for every problem or dataset. You may have to try a few to see, which gives a better result. The general guideline is to refer to the cheat sheet shown at the end of the article.

## How do I create categorical encoding for a situation where there is no info about target in test data?

We need to use the mapping values created at the time of training. This process is the same as scaling or normalization, where we use the train data to scale or normalize the test data. Then map and use the same value in testing time pre-processing. We can even create a dictionary for each category and mapped the value and then use the dictionary at testing time.
The trained encoder must be used on test data / score data sets.

# Conclusion <a name="conclusion"></a>

It is essential to understand that all these encodings do not work well in all situations or for every dataset for all machine learning models. Data Scientists still need to experiment and find out which works best for their specific case. If test data has different classes, some of these methods won’t work as features won’t be similar. There are few benchmark publications by research communities, but it’s not conclusive which works best. 

My recommendation will be to try each of these with the smaller datasets and then decide where to focus on tuning the encoding process.

# Validation <a name="validation"></a>

Validation proceeds with single lr and lr with cv.

- More Fold get better score (my experience)
- you can try another solver and another parameter

### Single LR

`SumEncoder`, `HelmertEncoder`, `OneHotEncoder`, `GLMMEncoder` is not possible to use on whole dataset, due to insufficient RAM. The dimensionallity explodes.

In [156]:
%%time
import gc
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score as auc
from sklearn.linear_model import LogisticRegression

encoder_list = [OrdinalEncoder(), WOEEncoder(), TargetEncoder(), MEstimateEncoder(), 
                JamesSteinEncoder(), LeaveOneOutEncoder(), CatBoostEncoder(), 
               HashingEncoder(), OneHotEncoder()]

X_train, X_val, y_train, y_val = train_test_split(train, target, test_size=0.2, random_state=97)

results_LR_no_CV = list()

for encoder in encoder_list:
    label = str(encoder).split('(')[0]
    print("Test {} : ".format(label), end=" ")
        
    if label == "OneHotEncoder":
        traintest = pd.concat([X_train, X_val])
        dummies = pd.get_dummies(traintest, columns=traintest.columns, drop_first=True, sparse=True)
        
        train_enc = dummies.iloc[:X_train.shape[0], :]
        train_enc = train_enc.sparse.to_coo().tocsr()
        
        val_enc = dummies.iloc[X_train.shape[0]:, :]
        val_enc = val_enc.sparse.to_coo().tocsr()
    else: 
        train_enc = encoder.fit_transform(X_train[feature_list], y_train)
        val_enc = encoder.transform(X_val[feature_list])
        

    lr = LogisticRegression(C=0.1, solver="lbfgs", max_iter=1000)
    lr.fit(train_enc, y_train)
    lr_pred = lr.predict_proba(val_enc)[:, 1]
    score = auc(y_val, lr_pred)
    print("score: ", score)
    
    result = {label : score}
    
    results_LR_no_CV.append(result)
    
    del train_enc
    del val_enc
    gc.collect()



Test OrdinalEncoder :  score:  0.6051903651089664
Test WOEEncoder :  score:  0.7814069711093441
Test TargetEncoder :  score:  0.7853818959187079
Test MEstimateEncoder :  score:  0.7791367306799639
Test JamesSteinEncoder :  score:  0.7720013931096443
Test LeaveOneOutEncoder :  score:  0.7957331830901235
Test CatBoostEncoder :  score:  0.7916683715676662
Test HashingEncoder :  score:  0.6111714781778492
Test OneHotEncoder :  score:  0.8023432336810412
CPU times: total: 12min 18s
Wall time: 8min 2s


In [205]:
results_LR_no_CV = pd.DataFrame([list(x.values()) for x in results_LR_no_CV], index = [list(x.keys())[0] for x in results_LR_no_CV])
results_LR_no_CV.columnsmns = ["auc"]

In [208]:
results_LR_no_CV

Unnamed: 0,auc
OrdinalEncoder,0.60519
WOEEncoder,0.781407
TargetEncoder,0.785382
MEstimateEncoder,0.779137
JamesSteinEncoder,0.772001
LeaveOneOutEncoder,0.795733
CatBoostEncoder,0.791668
HashingEncoder,0.611171
OneHotEncoder,0.802343


### LR with CrossValidation

In [249]:
%%time
from sklearn.model_selection import KFold
import numpy as np

# CV function original : @Peter Hurford : Why Not Logistic Regression? https://www.kaggle.com/peterhurford/why-not-logistic-regression

def run_cv_model(train, target, model_fn, params={}, label='model'):
    kf = KFold(n_splits=5)
    fold_splits = kf.split(train, target)

    cv_scores = []
    pred_train = np.zeros((train.shape[0]))
    i = 1
    for dev_index, val_index in fold_splits:
#         print('Started {} fold {}/5'.format(label, i))
        
        if label == "OneHotEncoder":
            dev_X, val_X = train[dev_index], train[val_index]
        else: 
            dev_X, val_X = train.iloc[dev_index], train.iloc[val_index]

        dev_y, val_y = target[dev_index], target[val_index]
        pred_val_y = model_fn(dev_X, dev_y, val_X, val_y, params)
        pred_train[val_index] = pred_val_y
        cv_score = auc(val_y, pred_val_y)
        cv_scores.append(cv_score)
        print(label + ' cv score {}: {}'.format(i, cv_score))
        i += 1
        
#     print('{} cv scores : {}'.format(label, cv_scores))
    print('{} cv mean score : {}'.format(label, np.mean(cv_scores)))
#     print('{} cv std score : {}'.format(label, np.std(cv_scores)))
    results = {'label': label, 'train': pred_train, 'cv': cv_scores}
    return results


def runLR(train_X, train_y, test_X, test_y, params):
    model = LogisticRegression(**params)
    model.fit(train_X, train_y)
    pred_test_y = model.predict_proba(test_X)[:, 1]
    return pred_test_y


CPU times: total: 0 ns
Wall time: 0 ns


In [250]:
if TEST:

    lr_params = {'solver': 'lbfgs', 'C': 0.1}

    results = list()

    encoder_list = [OrdinalEncoder(), WOEEncoder(), TargetEncoder(), MEstimateEncoder(), 
                JamesSteinEncoder(), LeaveOneOutEncoder(), CatBoostEncoder(), 
               HashingEncoder(), OneHotEncoder()]
    
    for encoder in encoder_list:
        label = str(encoder).split('(')[0]
        print("Test {} : ".format(label), end=" ")
        
        if label == "OneHotEncoder":
            train_enc = pd.get_dummies(train[feature_list], columns=train[feature_list].columns, drop_first=True, sparse=True)
            train_enc = train_enc.sparse.to_coo().tocsr()
        else: 
            train_enc = encoder.fit_transform(train[feature_list], target)

        result = run_cv_model(train_enc, target, runLR, lr_params, str(encoder).split('(')[0])
        results.append(result)
        print("---------------------------------------")

Test OrdinalEncoder :  OrdinalEncoder cv score 1: 0.5463535542979797
OrdinalEncoder cv score 2: 0.5555460245407903
OrdinalEncoder cv score 3: 0.5794534024913255
OrdinalEncoder cv score 4: 0.5776819583084554
OrdinalEncoder cv score 5: 0.5401276800700991
OrdinalEncoder cv mean score : 0.5598325239417299
---------------------------------------
Test WOEEncoder :  WOEEncoder cv score 1: 0.8295985620614866
WOEEncoder cv score 2: 0.8276777900970927
WOEEncoder cv score 3: 0.8343622218375102
WOEEncoder cv score 4: 0.8315850232200183
WOEEncoder cv score 5: 0.8307571758201473
WOEEncoder cv mean score : 0.8307961546072511
---------------------------------------
Test TargetEncoder :  TargetEncoder cv score 1: 0.8205863085284983
TargetEncoder cv score 2: 0.822410137469694
TargetEncoder cv score 3: 0.8291619396291113
TargetEncoder cv score 4: 0.8211984279335196
TargetEncoder cv score 5: 0.8260040789298855
TargetEncoder cv mean score : 0.8238721784981419
---------------------------------------
Test ME

In [251]:
results = pd.DataFrame(results)
results['train_mean'] = results['train'].apply(lambda l : np.mean(l))
results['cv_mean'] = results['cv'].apply(lambda l : np.mean(l))
results['cv_std'] = results['cv'].apply(lambda l : np.std(l))

In [266]:
results.loc[:, ["label", "train_mean", "cv_mean", "cv_std"]]

Unnamed: 0,label,train_mean,cv_mean,cv_std
0,OrdinalEncoder,0.313107,0.559833,0.016074
1,WOEEncoder,0.305869,0.830796,0.002214
2,TargetEncoder,0.307218,0.823872,0.003244
3,MEstimateEncoder,0.307542,0.824774,0.002931
4,JamesSteinEncoder,0.303279,0.823625,0.005343
5,LeaveOneOutEncoder,0.306003,0.789534,0.00247
6,CatBoostEncoder,0.307326,0.772151,0.021332
7,HashingEncoder,0.30588,0.614298,0.002283
8,OneHotEncoder,0.30572,0.803166,0.002456


### XGBOOST with CrossValidation

Native XGB CV Method

In [253]:
%%time
import gc
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score as auc

encoder_list = [OrdinalEncoder(), WOEEncoder(), TargetEncoder(), MEstimateEncoder(), 
                JamesSteinEncoder(), LeaveOneOutEncoder(), CatBoostEncoder(), 
                HashingEncoder(), OneHotEncoder()]

X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.15, random_state=97)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.15, random_state=97)

params = {"booster": "gbtree",
         "objective": "binary:logistic",
         "eval_metric": 'logloss',
         "eta": 0.2,
         "max_depth": 6,
         "min_child_weight": 2,
         "subsample": 0.8,
         "colsample_bytree": 0.8
        }

results_xgb = []

for encoder in encoder_list:
    label = str(encoder).split('(')[0]
    print("Test {} : ".format(label), end=" ")
    
    if label == "OneHotEncoder":
        traintest = pd.concat([X_train, X_val, X_test])
        dummies = pd.get_dummies(traintest, columns=traintest.columns, drop_first=True, sparse=True)
        
        train_enc = dummies.iloc[:X_train.shape[0], :]
        train_enc = train_enc.sparse.to_coo().tocsr()
        
        val_enc = dummies.iloc[X_train.shape[0]:X_train.shape[0] + X_val.shape[0], :]
        val_enc = val_enc.sparse.to_coo().tocsr()
        
        test_enc = dummies.iloc[X_train.shape[0] + X_val.shape[0]:, :]
        test_enc = test_enc.sparse.to_coo().tocsr()
    else: 
        train_enc = encoder.fit_transform(X_train[feature_list], y_train)
        val_enc = encoder.transform(X_val[feature_list])
        test_enc = encoder.transform(X_test[feature_list])
        
    
    dtrain = xgb.DMatrix(train_enc, label=list(y_train))
    dval = xgb.DMatrix(val_enc, label=list(y_val))
    dtest = xgb.DMatrix(test_enc, label=list(y_test))
    
    evallist = [(dtrain, 'train'), (dval, 'eval')]

    xgb_cv = xgb.cv(dtrain=dtrain, 
                    params=params, 
                    nfold=5,
                    num_boost_round = 600,
                    early_stopping_rounds=10, 
                    metrics="auc", 
                    as_pandas=True, 
                    seed=123)
    
    score = xgb_cv.loc[np.argmax(xgb_cv["test-auc-mean"]), :]
    result = {'label': label}
    result.update(dict(score))

    print("score: ", score["test-auc-mean"])

    random.seed(1)
    bst = xgb.train(params,
                    dtrain,
                    num_boost_round = np.argmax(xgb_cv["test-auc-mean"]),
                    maximize = False)

    y_test_pred = bst.predict(dtest)
    score_on_test = {"score_on_test" : auc(y_test, y_test_pred)}
    
    result.update(score_on_test)
    results_xgb.append(result)
    
    del train_enc
    del dtrain
    del val_enc
    del dval
    del test_enc
    del dtest
    
    gc.collect()

Test OrdinalEncoder :  score:  0.7658873720901466
Test WOEEncoder :  score:  0.8332229230591096
Test TargetEncoder :  score:  0.8298533675227574
Test MEstimateEncoder :  score:  0.8342752269196391
Test JamesSteinEncoder :  score:  0.8346933985034319
Test LeaveOneOutEncoder :  score:  1.0
Test CatBoostEncoder :  score:  0.7812759348108153
Test HashingEncoder :  score:  0.6223358105968899
Test OneHotEncoder :  score:  0.7743959487540527
CPU times: total: 3h 33min 9s
Wall time: 37min 36s


In [256]:
results_xgb = pd.DataFrame(results_xgb)

In [257]:
results_xgb

Unnamed: 0,label,train-auc-mean,train-auc-std,test-auc-mean,test-auc-std,score_on_test
0,OrdinalEncoder,0.856113,0.000989,0.765887,0.002421,0.764593
1,WOEEncoder,0.864339,0.000446,0.833223,0.002625,0.772963
2,TargetEncoder,0.862202,0.000437,0.829853,0.002259,0.778668
3,MEstimateEncoder,0.86547,0.000552,0.834275,0.002081,0.770577
4,JamesSteinEncoder,0.866362,0.000638,0.834693,0.002009,0.767493
5,LeaveOneOutEncoder,1.0,0.0,1.0,0.0,0.518845
6,CatBoostEncoder,0.826444,0.000458,0.781276,0.00177,0.788501
7,HashingEncoder,0.63969,0.000745,0.622336,0.002469,0.616188
8,OneHotEncoder,0.857592,0.000232,0.774396,0.002149,0.773301


## Save Output <a name="output"></a>

Even CVs did not solve the target based encoder's overfit problem.

In [259]:
results_LR_no_CV

Unnamed: 0,auc
OrdinalEncoder,0.60519
WOEEncoder,0.781407
TargetEncoder,0.785382
MEstimateEncoder,0.779137
JamesSteinEncoder,0.772001
LeaveOneOutEncoder,0.795733
CatBoostEncoder,0.791668
HashingEncoder,0.611171
OneHotEncoder,0.802343


In [269]:
results.loc[:, ["label", "train_mean", "cv_mean", "cv_std"]]

Unnamed: 0,label,train_mean,cv_mean,cv_std
0,OrdinalEncoder,0.313107,0.559833,0.016074
1,WOEEncoder,0.305869,0.830796,0.002214
2,TargetEncoder,0.307218,0.823872,0.003244
3,MEstimateEncoder,0.307542,0.824774,0.002931
4,JamesSteinEncoder,0.303279,0.823625,0.005343
5,LeaveOneOutEncoder,0.306003,0.789534,0.00247
6,CatBoostEncoder,0.307326,0.772151,0.021332
7,HashingEncoder,0.30588,0.614298,0.002283
8,OneHotEncoder,0.30572,0.803166,0.002456


In [278]:
results[["label", "train_mean", "cv_mean", "cv_std"]].sort_values("cv_mean", ascending = False)

Unnamed: 0,label,train_mean,cv_mean,cv_std
1,WOEEncoder,0.305869,0.830796,0.002214
3,MEstimateEncoder,0.307542,0.824774,0.002931
2,TargetEncoder,0.307218,0.823872,0.003244
4,JamesSteinEncoder,0.303279,0.823625,0.005343
8,OneHotEncoder,0.30572,0.803166,0.002456
5,LeaveOneOutEncoder,0.306003,0.789534,0.00247
6,CatBoostEncoder,0.307326,0.772151,0.021332
7,HashingEncoder,0.30588,0.614298,0.002283
0,OrdinalEncoder,0.313107,0.559833,0.016074


In [284]:
results_xgb[["label", "train-auc-mean", "test-auc-mean", "score_on_test"]].sort_values("score_on_test", ascending = False)

Unnamed: 0,label,train-auc-mean,test-auc-mean,score_on_test
6,CatBoostEncoder,0.826444,0.781276,0.788501
2,TargetEncoder,0.862202,0.829853,0.778668
8,OneHotEncoder,0.857592,0.774396,0.773301
1,WOEEncoder,0.864339,0.833223,0.772963
3,MEstimateEncoder,0.86547,0.834275,0.770577
4,JamesSteinEncoder,0.866362,0.834693,0.767493
0,OrdinalEncoder,0.856113,0.765887,0.764593
7,HashingEncoder,0.63969,0.622336,0.616188
5,LeaveOneOutEncoder,1.0,1.0,0.518845


In [268]:
results.loc[:, ["label", "train_mean", "cv_mean", "cv_std"]].to_csv("LR_cv.csv")
results_xgb.to_csv("XGB_cv.csv")

In [None]:
# if SAVE_RESULTS:
#     for idx, label in enumerate(results['label']):
#         sub_df = pd.DataFrame({'id': test_id, 'target' : results.iloc[idx]['test']})
#         sub_df.to_csv("LR_{}.csv".format(label), index=False)