# Introduction

Recently, my attention got caught by an article about mRMR method for feature selection on
[Towards Data Science](https://towardsdatascience.com/mrmr-explained-exactly-how-you-wished-someone-explained-to-you-9cf4ed27458b). I started to investigate a [git repo from smazzanti](https://github.com/smazzanti/mrmr) which can
be installed by
``` python
pip install mrmr_selection
```

Because it can take and process even categorical variables I was curious, what method is used.
 In the source code, there are 3 categorical encoders from package
 ``` python
pip install category-encoders
```

Those are Leave One Out, James-Stein, Target-Encoder. The features selection itself worked quite
well on my work dataset which I know back-to-forth. So, I have created this workshop to summarize
 my knowledge about categorical variables encoding.

# Sources
I'll try to post all resources I used for this tutorial, however, if I missed one, I'm sorry to
the author. There is a lot of information on the internet, it is easy to forget some citations.

- [Benchmarking Categorical Encoders](https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8)

- [CategoricalEncodingBenchmark](https://github.com/DenisVorotyntsev/CategoricalEncodingBenchmark)

- no feature preprocessing
- Use KFold(5) for CV (+ more fold get better score)
- LR (C=0.1, solver=lbfgs)
- RF
- XGB

|Encoder|LB Score|
|-|-|
|TE|0.78018|
|WOE|0.78861|
|LOOE|0.79382|
|James-Stein|0.77843|
|Catboost|0.79164|
|One-Hot|0.77973|



### Category-Encoders

1. Label Encoder
2. One-Hot Encoder
3. Sum Encoder
4. Helmert Encoder
5. Frequency Encoder
6. Target Encoder
7. M-Estimate Encoder
8. Weight Of Evidence Encoder
9. James-Stein Encoder
10. Leave-one-out Encoder
11. Catboost Encoder
---
- Validation (Benchmark)
    - single LR
    - LR with Cross Validation


## Category-Encoders 

A set of scikit-learn-style transformers for encoding categorical variables into numeric by means of different techniques.

In [None]:
# If you want to test this on your local notebook
# http://contrib.scikit-learn.org/categorical-encoding/
# !pip install category-encoders

In [1]:
import os

import pandas as pd

from category_encoders.ordinal import OrdinalEncoder
from category_encoders.woe import WOEEncoder
from category_encoders.target_encoder import TargetEncoder
from category_encoders.sum_coding import SumEncoder
from category_encoders.m_estimate import MEstimateEncoder
from category_encoders.leave_one_out import LeaveOneOutEncoder
from category_encoders.helmert import HelmertEncoder
from category_encoders.cat_boost import CatBoostEncoder
from category_encoders.james_stein import JamesSteinEncoder
from category_encoders.one_hot import OneHotEncoder

TEST = False

read csv and doing some preprocessing

In [2]:
%%time
train = pd.read_csv('../../../data/cat-in-the-dat/train.csv')
test = pd.read_csv('../../../data/cat-in-the-dat/test.csv')
target = train['target']
train_id = train['id']
test_id = test['id']
train.drop(['target', 'id'], axis=1, inplace=True)
test.drop('id', axis=1, inplace=True)

CPU times: user 646 ms, sys: 98.6 ms, total: 745 ms
Wall time: 950 ms


In [3]:
train.shape

(300000, 23)

In [4]:
train

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,T,Y,Green,Triangle,Snake,Finland,Bassoon,...,c389000ab,2f4cb3d51,2,Grandmaster,Cold,h,D,kr,2,2
1,0,1,0,T,Y,Green,Trapezoid,Hamster,Russia,Piano,...,4cd920251,f83c56c21,1,Grandmaster,Hot,a,A,bF,7,8
2,0,0,0,F,Y,Blue,Trapezoid,Lion,Russia,Theremin,...,de9c9f684,ae6800dd0,1,Expert,Lava Hot,h,R,Jc,7,2
3,0,1,0,F,Y,Red,Trapezoid,Snake,Canada,Oboe,...,4ade6ab69,8270f0d71,1,Grandmaster,Boiling Hot,i,D,kW,2,1
4,0,0,0,F,N,Red,Trapezoid,Lion,Canada,Oboe,...,cb43ab175,b164b72a7,1,Grandmaster,Freezing,a,R,qP,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,T,N,Red,Trapezoid,Snake,India,Oboe,...,7508f4ef1,e027decef,1,Contributor,Freezing,k,K,dh,3,8
299996,0,0,0,F,Y,Green,Trapezoid,Lion,Russia,Piano,...,397dd0274,80f1411c8,2,Novice,Freezing,h,W,MO,3,2
299997,0,0,0,F,Y,Blue,Star,Axolotl,Russia,Oboe,...,5d7806f53,314dcc15b,3,Novice,Boiling Hot,o,A,Bn,7,9
299998,0,1,0,F,Y,Green,Square,Axolotl,Costa Rica,Piano,...,1f820c7ce,ab0ce192b,1,Master,Boiling Hot,h,W,uJ,3,8


In [5]:
feature_list = list(train.columns) # you can custumize later.

### notation

- $y$ and $y+$ — the total number of observations and the total number of positive observations (y=1);
- $x_i$, $y_i$ — the i-th value of category and target;
- $n$ and $n+$ — the number of observations and the number of positive observations (y=1) for a given value of a categorical column;
- $a$ — a regularization hyperparameter (selected by a user), prior — an average value of the target.

## 1. Label Encoder (LE), Ordinary Encoder(OE)

One of the most common encoding methods.

An encoding method that converts categorical data into numbers. For M categories in one column
will be one column with M numbers.

The disadvantage is that the labels are ordered randomly (in the existing order of the data),
which can add noise while assigning an unexpected order between labels. In other words, the data
becomes ordinary (ordinal, ordered) data, which can lead to unintended consequences.

The difference between LabelEncoder and OrdinalEncoder is that both do the same thing, however,
LabelEncoder works on 1D tupple ```(n, )```, while OrdinalEncoder works on 2D array ``` (n_samples,
m_columns)```. Historacaly, LabelEncoder was used to encode targer (hence the tuple).

The code is very simple, and when you encode a specific column you can proceed as follows:

``` python
from sklearn.preprocessing import LabelEncoder
label = LabelEncoder()

train[column_name] = label.fit_transform(train[column_name])
```

If you use `Category-Encoders` it will look like this code below.

In [6]:
%%time
LE_encoder = OrdinalEncoder(feature_list)
train_le = LE_encoder.fit_transform(train)
test_le = LE_encoder.transform(test)

CPU times: user 2.31 s, sys: 242 ms, total: 2.56 s
Wall time: 2.8 s


In [7]:
train_le.shape

(300000, 23)

In [8]:
train_le

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,1,1,1,1,1,1,1,...,1,1,2,1,1,1,1,1,2,2
1,0,1,0,1,1,1,2,2,2,2,...,2,2,1,1,2,2,2,2,7,8
2,0,0,0,2,1,2,2,3,2,3,...,3,3,1,2,3,1,3,3,7,2
3,0,1,0,2,1,3,2,1,3,4,...,4,4,1,1,4,3,1,4,2,1
4,0,0,0,2,2,3,2,3,3,4,...,5,5,1,1,5,2,3,5,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,1,2,3,2,1,6,4,...,2110,5077,1,4,5,9,6,27,3,8
299996,0,0,0,2,1,1,2,3,2,2,...,1288,7421,2,3,5,1,21,150,3,2
299997,0,0,0,2,1,2,5,6,2,4,...,44,4958,3,3,4,13,2,75,7,9
299998,0,1,0,2,1,1,4,6,4,2,...,792,9383,1,5,4,1,21,146,3,8


## 2. One-Hot Encoder (OHE, dummy encoder)


So what can you do to give values by category instead of ordering them?

If you have data with specific category values, you can create a column. For M categories OHE
creates M columns with zeroes everywhere except for the row with given category.

Since only the row containing the content is given as 1, it is called one-hot encoding. Also
called dummy encoding in the sense of creating a dummy. There is alternative, that you can drop
the first category (or the most frequent) to reduce the dimensionality. Some models cannot cope
with full N columns which are basically collinear.

Using pandas for this type of encoding since sklearn had problems with memory.

``` python
traintest = pd.concat([train, test])
dummies = pd.get_dummies(traintest, columns=traintest.columns, drop_first=True, sparse=True)
train_ohe = dummies.iloc[:train.shape[0], :]
test_ohe = dummies.iloc[train.shape[0]:, :]
train_ohe = train_ohe.sparse.to_coo().tocsr()
test_ohe = test_ohe.sparse.to_coo().tocsr()
```

If you use `Category-Encoders` it will look like this code below.

In [None]:
# %%time
# this method didn't work because of RAM memory.
# so we have to use pd.dummies
# OHE_encoder = OneHotEncoder(feature_list)
# train_ohe = OHE_encoder.fit_transform(train)
# test_ohe = OHE_encoder.transform(test)

In [9]:
%%time
traintest = pd.concat([train, test])
dummies = pd.get_dummies(traintest, columns=traintest.columns, drop_first=True, sparse=True)
train_ohe = dummies.iloc[:train.shape[0], :]
test_ohe = dummies.iloc[train.shape[0]:, :]
train_ohe = train_ohe.sparse.to_coo().tocsr()
test_ohe = test_ohe.sparse.to_coo().tocsr()

CPU times: user 4.41 s, sys: 240 ms, total: 4.65 s
Wall time: 4.67 s


In [11]:
dummies.iloc[:train.shape[0], :]

Unnamed: 0,bin_0_1,bin_1_1,bin_2_1,bin_3_T,bin_4_Y,nom_0_Green,nom_0_Red,nom_1_Polygon,nom_1_Square,nom_1_Star,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,0,0,0,1,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,1,0,1,1,1,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,1,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,1,0,0,1,0,0,0,...,0,0,0,0,0,1,0,0,0,0
299996,0,0,0,0,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
299997,0,0,0,0,1,0,0,0,0,1,...,0,0,0,0,0,0,1,0,0,0
299998,0,1,0,0,1,1,0,0,1,0,...,0,0,0,0,0,1,0,0,0,0


## 3. Sum Encoder (Deviation Encoder, Effect Encoder)

**Sum Encoder** compares the mean of the dependent variable (target) for a given level of a categorical column to the overall mean of the target. 

Sum Encoding is very similar to OHE and both of them are commonly used in Linear Regression (LR) types of models.

If you use `Category-Encoders` it will look like this code below.

In [None]:
# %%time
# this method didn't work because of RAM memory. 
# SE_encoder =SumEncoder(feature_list)
# train_se = SE_encoder.fit_transform(train[feature_list], target)
# test_se = SE_encoder.transform(test[feature_list])

## 4. Helmert Encoder

**Helmert Encoding** is a third commonly used type of categorical encoding for regression along with OHE and Sum Encoding. 

It compares each level of a categorical variable to the mean of the subsequent levels. 

This type of encoding can be useful in certain situations where levels of the categorical variable are ordered. (not this dataset)

If you use `Category-Encoders` it will look like this code below.

In [None]:
# %%time
# this method didn't work because of RAM memory. 
# HE_encoder = HelmertEncoder(feature_list)
# train_he = HE_encoder.fit_transform(train[feature_list], target)
# test_he = HE_encoder.transform(test[feature_list])

## 5. Frequency Encoder

This method encodes by frequency.

Create a new feature with the number of categories from the training data.

I will not proceed separately in this data.

## 6. Target Encoder

This is a work in progress for many kernels.

The encoded category values are calculated according to the following formulas:

$$s = \frac{1}{1+exp(-\frac{n-mdl}{a})}$$

$$\hat{x}^k = prior * (1-s) + s * \frac{n^{+}}{n}$$

- mdl means **'min data in leaf'**
- a means **'smooth parameter, power of regularization'**

Target Encoder is a powerful, but it has a huuuuuge disadvantage 

> **target leakage**: it uses information about the target. 

To reduce the effect of target leakage, 

- Increase regularization
- Add random noise to the representation of the category in train dataset (some sort of augmentation)
- Use Double Validation (using other validation)

Let's use while being careful about overfitting.

If you use `Category-Encoders` it will look like this code below.

In [9]:
%%time

TE_encoder = TargetEncoder()
train_te = TE_encoder.fit_transform(train[feature_list], target)
test_te = TE_encoder.transform(test[feature_list])

train_te.head()

CPU times: user 10.2 s, sys: 1.67 s, total: 11.9 s
Wall time: 10.2 s


Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302537,0.290107,0.327145,0.360978,0.307162,0.242813,0.237743,...,0.372694,0.368421,2,0.403885,0.257877,0.306993,0.208354,0.401186,2,2
1,0,1,0,0.302537,0.290107,0.327145,0.290054,0.359209,0.289954,0.304164,...,0.189189,0.076924,1,0.403885,0.326315,0.206599,0.186877,0.30388,7,8
2,0,0,0,0.309384,0.290107,0.24179,0.290054,0.293085,0.289954,0.353951,...,0.223022,0.172414,1,0.317175,0.403126,0.306993,0.351864,0.206843,7,2
3,0,1,0,0.309384,0.290107,0.351052,0.290054,0.307162,0.339793,0.329472,...,0.325123,0.227273,1,0.403885,0.360961,0.330148,0.208354,0.355985,2,1
4,0,0,0,0.309384,0.333773,0.351052,0.290054,0.293085,0.339793,0.329472,...,0.376812,0.2,1,0.403885,0.225214,0.206599,0.351864,0.404345,7,8


## 7. M-Estimate Encoder

**M-Estimate Encoder** is a **simplified version of Target Encoder**. It has only one hyperparameter (Wrong Fomular but did good work?!)

$$\hat{x}^k = \frac{n^+ + prior * m}{y^+ + m}$$

The higher value of m results into stronger shrinking. Recommended values for m is in the range of 1 to 100.

If you use `Category-Encoders` it will look like this code below.

In [10]:
%%time
MEE_encoder = MEstimateEncoder()
train_mee = MEE_encoder.fit_transform(train[feature_list], target)
test_mee = MEE_encoder.transform(test[feature_list])

CPU times: user 9.54 s, sys: 1.11 s, total: 10.7 s
Wall time: 9.51 s


In [11]:
train_mee.head()

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302537,0.290107,0.327145,0.360976,0.307162,0.242815,0.237744,...,0.372448,0.365294,2,0.403884,0.257879,0.306993,0.208379,0.400998,2,2
1,0,1,0,0.302537,0.290107,0.327145,0.290055,0.359207,0.289954,0.304164,...,0.190231,0.093277,1,0.403884,0.326314,0.206602,0.186884,0.303881,7,8
2,0,0,0,0.309384,0.290107,0.241791,0.290055,0.293085,0.289954,0.35395,...,0.223319,0.176863,1,0.317175,0.403125,0.306993,0.351861,0.206881,7,2
3,0,1,0,0.309384,0.290107,0.351051,0.290055,0.307162,0.339792,0.329472,...,0.325029,0.22902,1,0.403884,0.36096,0.330147,0.208379,0.355965,2,1
4,0,0,0,0.309384,0.333773,0.351051,0.290055,0.293085,0.339792,0.329472,...,0.376471,0.202941,1,0.403884,0.225215,0.206602,0.351861,0.40431,7,8


- UPDATED : error founded in libarary https://github.com/scikit-learn-contrib/category_encoders/issues/200

$$\hat{x}^k = \frac{n^+ + prior * m}{n+ + m}$$

Thanks to [@ansh422](https://www.kaggle.com/ansh422)

## 8. Weight of Evidence Encoder 

**Weight Of Evidence** is a commonly used target-based encoder in credit scoring. 

It is a measure of the “strength” of a grouping for separating good and bad risk (default). 

It is calculated from the basic odds ratio:

``` python
a = Distribution of Good Credit Outcomes
b = Distribution of Bad Credit Outcomes
WoE = ln(a / b)
```

However, if we use formulas as is, it might lead to **target leakage**(and overfit).

To avoid that, regularization parameter a is induced and WoE is calculated in the following way:

$$nomiinator = \frac{n^+ + a}{y^+ + 2*a}$$

$$denominator = ln(\frac{nominator}{denominator})$$

If you use `Category-Encoders` it will look like this code below.

In [12]:
%%time
WOE_encoder = WOEEncoder()
train_woe = WOE_encoder.fit_transform(train[feature_list], target)
test_woe = WOE_encoder.transform(test[feature_list])

CPU times: user 9.59 s, sys: 1.16 s, total: 10.8 s
Wall time: 9.6 s


In [13]:
train_woe

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,-0.015794,-0.075416,0.098327,0.248359,0.006058,-0.317803,-0.345614,...,0.302749,0.333932,2,0.430146,-0.237516,0.005297,-0.514545,0.420532,2,2
1,0,1,0,-0.015794,-0.075416,0.098327,-0.075660,0.240683,-0.076148,-0.008087,...,-0.600377,-1.052362,1,0.430146,0.094611,-0.526005,-0.650766,-0.008737,7,8
2,0,0,0,0.016454,-0.075416,-0.323420,-0.075660,-0.060990,-0.076148,0.217747,...,-0.417323,-0.607677,1,0.052724,0.426997,0.005297,0.208660,-0.523234,7,2
3,0,1,0,0.016454,-0.075416,0.205037,-0.075660,0.006058,0.155252,0.108884,...,0.096879,-0.338013,1,0.430146,0.248265,0.111980,-0.514545,0.227089,2,1
4,0,0,0,0.016454,0.128285,0.205037,-0.075660,-0.060990,0.155252,0.108884,...,0.321353,-0.468414,1,0.430146,-0.416062,-0.526005,0.208660,0.432324,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,-0.015794,0.128285,0.205037,-0.075660,0.006058,0.249803,0.108884,...,-0.192161,0.190831,1,-0.132258,-0.416062,0.266667,-0.195294,0.033989,3,8
299996,0,0,0,0.016454,-0.075416,0.098327,-0.075660,-0.060990,-0.076148,-0.008087,...,-0.076648,-0.502316,2,-0.321986,-0.416062,0.005297,0.453411,-0.415855,3,2
299997,0,0,0,0.016454,-0.075416,-0.323420,0.022286,0.061193,-0.076148,0.108884,...,0.146495,0.030982,3,-0.321986,0.248265,0.552543,-0.650766,-0.705175,7,9
299998,0,1,0,0.016454,-0.075416,0.098327,0.151411,0.061193,0.041196,-0.008087,...,-0.125022,0.288812,1,0.222693,0.248265,0.005297,0.453411,0.572812,3,8


## 9. James-Stein Encoder

**James-Stein Encoder** is a target-based encoder.

The idea behind James-Stein Encoder is simple. Estimation of the mean target for category k could be calculated according to the following formula:

$$\hat{x}^k = (1-B) * \frac{n^+}{n} + B * \frac{y^+}{y} $$

One way to select B is to tune it like a hyperparameter via cross-validation, but Charles Stein came up with another solution to the problem:

$$B = \frac{Var[y^k]}{Var[y^k] + Var[y]}$$

Seems quite fair, but James-Stein Estimator has a big disadvantage — it is defined only for normal distribution (which is not the case for any classification task). 

To avoid that, we can either convert binary targets with a log-odds ratio as it was done in WoE Encoder (which is used by default because it is simple) or use beta distribution.

If you use `Category-Encoders` it will look like this code below.

In [14]:
%%time
JSE_encoder = JamesSteinEncoder()
train_jse = JSE_encoder.fit_transform(train[feature_list], target)
test_jse = JSE_encoder.transform(test[feature_list])

CPU times: user 9.64 s, sys: 1.07 s, total: 10.7 s
Wall time: 9.65 s


In [15]:
train_jse

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302537,0.290107,0.327145,0.343763,0.306777,0.260374,0.248201,...,0.337649,0.334882,2,0.377845,0.271531,0.306515,0.247588,0.351077,2,2
1,0,1,0,0.302537,0.290107,0.327145,0.294730,0.342564,0.294658,0.304449,...,0.238347,0.137804,1,0.377845,0.320078,0.243675,0.232549,0.304868,7,8
2,0,0,0,0.309384,0.290107,0.241790,0.294730,0.296876,0.294658,0.345642,...,0.260297,0.227178,1,0.314323,0.372130,0.306515,0.329955,0.249570,7,2
3,0,1,0,0.309384,0.290107,0.351052,0.294730,0.306777,0.329339,0.325462,...,0.315328,0.263300,1,0.377845,0.343752,0.319536,0.247588,0.330239,2,1
4,0,0,0,0.309384,0.333773,0.351052,0.294730,0.296876,0.329339,0.325462,...,0.339509,0.246247,1,0.377845,0.247048,0.243675,0.329955,0.352552,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,0.302537,0.333773,0.351052,0.294730,0.306777,0.343989,0.325462,...,0.279755,0.322701,1,0.285182,0.247048,0.338666,0.283590,0.309435,3,8
299996,0,0,0,0.309384,0.290107,0.327145,0.294730,0.296876,0.294658,0.304449,...,0.296697,0.242514,2,0.256848,0.247048,0.306515,0.358728,0.261030,3,2
299997,0,0,0,0.309384,0.290107,0.241790,0.309196,0.315031,0.294658,0.325462,...,0.320347,0.302973,3,0.256848,0.343752,0.374913,0.232549,0.229962,7,9
299998,0,1,0,0.309384,0.290107,0.327145,0.328749,0.315031,0.312025,0.304449,...,0.291684,0.331289,1,0.342313,0.343752,0.306515,0.358728,0.368007,3,8


## 10. Leave-one-out Encoder (LOO or LOOE)

**Leave-one-out Encoding** is another example of target-based encoders.

This encoder calculate mean target of category k for observation j if observation j is removed from the dataset:

$$\hat{x}^k_i = \frac{\sum_{j \neq i}(y_j * (x_j == k) ) - y_i }{\sum_{j \neq i} x_j == k}$$

While encoding the test dataset, a category is replaced with the mean target of the category k in the train dataset:

$$\hat{x}^k = \frac{\sum y_j * (x_j == k)  }{\sum x_j == k}$$

If you use `Category-Encoders` it will look like this code below.

In [16]:
%%time
LOOE_encoder = LeaveOneOutEncoder()
train_looe = LOOE_encoder.fit_transform(train[feature_list], target)
test_looe = LOOE_encoder.transform(test[feature_list])

CPU times: user 9.96 s, sys: 1.06 s, total: 11 s
Wall time: 9.53 s


In [17]:
train_looe

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.302539,0.290108,0.327148,0.360990,0.307169,0.242820,0.237746,...,0.374074,0.388889,2,0.403890,0.257885,0.307005,0.208407,0.401980,2,2
1,0,1,0,0.302539,0.290108,0.327148,0.290057,0.359221,0.289957,0.304167,...,0.190909,0.083333,1,0.403890,0.326330,0.206605,0.186887,0.303997,7,8
2,0,0,0,0.309387,0.290108,0.241793,0.290057,0.293087,0.289957,0.353958,...,0.223827,0.178571,1,0.317188,0.403133,0.307005,0.351885,0.206923,7,2
3,0,1,0,0.309380,0.290103,0.351043,0.290047,0.307147,0.339780,0.329465,...,0.321782,0.209302,1,0.403877,0.360951,0.330124,0.208155,0.355736,2,1
4,0,0,0,0.309387,0.333776,0.351056,0.290057,0.293087,0.339800,0.329476,...,0.378641,0.205882,1,0.403890,0.225217,0.206605,0.351885,0.404487,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,0.302539,0.333776,0.351056,0.290057,0.307169,0.361323,0.329476,...,0.261905,0.348837,1,0.278540,0.225217,0.365225,0.266041,0.313116,3,8
299996,0,0,0,0.309387,0.290108,0.327148,0.290057,0.293087,0.289957,0.304167,...,0.289216,0.200000,2,0.242057,0.225217,0.307005,0.409526,0.225034,3,2
299997,0,0,0,0.309380,0.290103,0.241782,0.310612,0.318998,0.289947,0.329465,...,0.331034,0.275862,3,0.242049,0.360951,0.433607,0.186832,0.177994,7,9
299998,0,1,0,0.309380,0.290103,0.327140,0.338918,0.318998,0.314669,0.304155,...,0.275304,0.333333,1,0.355055,0.360951,0.306965,0.409417,0.437908,3,8


## 11. Catboost Encoder

**Catboost** is a recently created target-based categorical encoder. 

It is intended to overcome target leakage problems inherent in LOO. 

If you use `Category-Encoders` it will look like this code below.

In [18]:
%%time
CBE_encoder = CatBoostEncoder()
train_cbe = CBE_encoder.fit_transform(train[feature_list], target)
test_cbe = CBE_encoder.transform(test[feature_list])

CPU times: user 13.9 s, sys: 1.24 s, total: 15.1 s
Wall time: 13.3 s


In [19]:
train_cbe

Unnamed: 0,bin_0,bin_1,bin_2,bin_3,bin_4,nom_0,nom_1,nom_2,nom_3,nom_4,...,nom_8,nom_9,ord_0,ord_1,ord_2,ord_3,ord_4,ord_5,day,month
0,0,0,0,0.305880,0.305880,0.305880,0.305880,0.305880,0.305880,0.305880,...,0.305880,0.305880,2,0.305880,0.305880,0.305880,0.305880,0.305880,2,2
1,0,1,0,0.152940,0.152940,0.152940,0.305880,0.305880,0.305880,0.305880,...,0.305880,0.305880,1,0.152940,0.305880,0.305880,0.305880,0.305880,7,8
2,0,0,0,0.305880,0.101960,0.305880,0.152940,0.305880,0.152940,0.305880,...,0.305880,0.305880,1,0.305880,0.305880,0.152940,0.305880,0.305880,7,2
3,0,1,0,0.152940,0.076470,0.305880,0.101960,0.152940,0.305880,0.305880,...,0.305880,0.305880,1,0.101960,0.305880,0.305880,0.152940,0.305880,2,1
4,0,0,0,0.435293,0.305880,0.652940,0.326470,0.152940,0.652940,0.652940,...,0.305880,0.305880,1,0.326470,0.305880,0.152940,0.152940,0.305880,7,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
299995,0,0,0,0.302539,0.333776,0.351056,0.290063,0.307169,0.361322,0.329468,...,0.262927,0.347861,1,0.278547,0.225222,0.365223,0.266043,0.313113,3,8
299996,0,0,0,0.309379,0.290102,0.327142,0.290060,0.293088,0.289953,0.304159,...,0.289297,0.202941,2,0.242051,0.225220,0.306977,0.409450,0.225089,3,2
299997,0,0,0,0.309377,0.290101,0.241786,0.310611,0.318979,0.289950,0.329465,...,0.330862,0.276863,3,0.242049,0.360939,0.433596,0.186839,0.178062,7,9
299998,0,1,0,0.309382,0.290105,0.327140,0.338918,0.318998,0.314669,0.304155,...,0.275427,0.332235,1,0.355053,0.360950,0.306965,0.409406,0.437765,3,8


## Validation

Validation proceeds with single lr and lr with cv.

- I will add OneHotEncoder, etc later.
- More Fold get better score (my experience)
- you can try another solver and another parameter

### Single LR

In [20]:
%%time
import gc
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score as auc
from sklearn.linear_model import LogisticRegression

encoder_list = [ OrdinalEncoder(), WOEEncoder(), TargetEncoder(), MEstimateEncoder(), JamesSteinEncoder(), LeaveOneOutEncoder() ,CatBoostEncoder()]

X_train, X_val, y_train, y_val = train_test_split(train, target, test_size=0.2, random_state=97)

for encoder in encoder_list:
    print("Test {} : ".format(str(encoder).split('(')[0]), end=" ")
    train_enc = encoder.fit_transform(X_train[feature_list], y_train)
    #test_enc = encoder.transform(test[feature_list])
    val_enc = encoder.transform(X_val[feature_list])
    lr = LogisticRegression(C=0.1, solver="lbfgs", max_iter=1000)
    lr.fit(train_enc, y_train)
    lr_pred = lr.predict_proba(val_enc)[:, 1]
    score = auc(y_val, lr_pred)
    print("score: ", score)
    del train_enc
    del val_enc
    gc.collect()



Test OrdinalEncoder :  



score:  0.6058952245541614
Test WOEEncoder :  score:  0.7814115175120097
Test TargetEncoder :  score:  0.7790232741835907
Test MEstimateEncoder :  score:  0.7792025608680453
Test JamesSteinEncoder :  score:  0.7719623937439537
Test LeaveOneOutEncoder :  score:  0.79576549204064
Test CatBoostEncoder :  score:  0.7917094405644212
CPU times: user 3min 52s, sys: 2.48 s, total: 3min 55s
Wall time: 3min 5s


### LR with CrossValidation

In [21]:
%%time
from sklearn.model_selection import KFold
import numpy as np

# CV function original : @Peter Hurford : Why Not Logistic Regression? https://www.kaggle.com/peterhurford/why-not-logistic-regression

def run_cv_model(train, test, target, model_fn, params={}, label='model'):
    kf = KFold(n_splits=5)
    fold_splits = kf.split(train, target)

    cv_scores = []
    pred_full_test = 0
    pred_train = np.zeros((train.shape[0]))
    i = 1
    for dev_index, val_index in fold_splits:
        print('Started {} fold {}/5'.format(label, i))
        dev_X, val_X = train.iloc[dev_index], train.iloc[val_index]
        dev_y, val_y = target[dev_index], target[val_index]
        pred_val_y, pred_test_y = model_fn(dev_X, dev_y, val_X, val_y, test, params)
        pred_full_test = pred_full_test + pred_test_y
        pred_train[val_index] = pred_val_y
        cv_score = auc(val_y, pred_val_y)
        cv_scores.append(cv_score)
        print(label + ' cv score {}: {}'.format(i, cv_score))
        i += 1
        
    print('{} cv scores : {}'.format(label, cv_scores))
    print('{} cv mean score : {}'.format(label, np.mean(cv_scores)))
    print('{} cv std score : {}'.format(label, np.std(cv_scores)))
    pred_full_test = pred_full_test / 5.0
    results = {'label': label, 'train': pred_train, 'test': pred_full_test, 'cv': cv_scores}
    return results


def runLR(train_X, train_y, test_X, test_y, test_X2, params):
    model = LogisticRegression(**params)
    model.fit(train_X, train_y)
    pred_test_y = model.predict_proba(test_X)[:, 1]
    pred_test_y2 = model.predict_proba(test_X2)[:, 1]
    return pred_test_y, pred_test_y2


CPU times: user 37 µs, sys: 1e+03 ns, total: 38 µs
Wall time: 41 µs


In [23]:
TEST = True

In [24]:
if TEST:

    lr_params = {'solver': 'lbfgs', 'C': 0.1}

    results = list()

    for encoder in  [ OrdinalEncoder(), WOEEncoder(), TargetEncoder(), MEstimateEncoder(), JamesSteinEncoder(), LeaveOneOutEncoder() ,CatBoostEncoder()]:
        train_enc = encoder.fit_transform(train[feature_list], target)
        test_enc = encoder.transform(test[feature_list])
        result = run_cv_model(train_enc, test_enc, target, runLR, lr_params, str(encoder).split('(')[0])
        results.append(result)
    results = pd.DataFrame(results)
    results['cv_mean'] = results['cv'].apply(lambda l : np.mean(l))
    results['cv_std'] = results['cv'].apply(lambda l : np.std(l))
    results[['label','cv_mean','cv_std']].head(8)

Started OrdinalEncoder fold 1/5




OrdinalEncoder cv score 1: 0.5478087630083479
Started OrdinalEncoder fold 2/5




OrdinalEncoder cv score 2: 0.5785368145198041
Started OrdinalEncoder fold 3/5




OrdinalEncoder cv score 3: 0.5794711812720389
Started OrdinalEncoder fold 4/5




OrdinalEncoder cv score 4: 0.5776419118948795
Started OrdinalEncoder fold 5/5




OrdinalEncoder cv score 5: 0.5530765797041557
OrdinalEncoder cv scores : [0.5478087630083479, 0.5785368145198041, 0.5794711812720389, 0.5776419118948795, 0.5530765797041557]
OrdinalEncoder cv mean score : 0.5673070500798453
OrdinalEncoder cv std score : 0.01388216518854039
Started WOEEncoder fold 1/5




WOEEncoder cv score 1: 0.8296010840892494
Started WOEEncoder fold 2/5




WOEEncoder cv score 2: 0.8276778463251253
Started WOEEncoder fold 3/5




WOEEncoder cv score 3: 0.8343609323413513
Started WOEEncoder fold 4/5




WOEEncoder cv score 4: 0.8315878216378239
Started WOEEncoder fold 5/5




WOEEncoder cv score 5: 0.8307581723870281
WOEEncoder cv scores : [0.8296010840892494, 0.8276778463251253, 0.8343609323413513, 0.8315878216378239, 0.8307581723870281]
WOEEncoder cv mean score : 0.8307971713561155
WOEEncoder cv std score : 0.002213045618434726
Started TargetEncoder fold 1/5




TargetEncoder cv score 1: 0.8228825225372465
Started TargetEncoder fold 2/5




TargetEncoder cv score 2: 0.819640037288339
Started TargetEncoder fold 3/5




TargetEncoder cv score 3: 0.8270373654953843
Started TargetEncoder fold 4/5




TargetEncoder cv score 4: 0.8279529191167299
Started TargetEncoder fold 5/5




TargetEncoder cv score 5: 0.8270456093348808
TargetEncoder cv scores : [0.8228825225372465, 0.819640037288339, 0.8270373654953843, 0.8279529191167299, 0.8270456093348808]
TargetEncoder cv mean score : 0.8249116907545162
TargetEncoder cv std score : 0.003169511807334321
Started MEstimateEncoder fold 1/5




MEstimateEncoder cv score 1: 0.8239353389041686
Started MEstimateEncoder fold 2/5




MEstimateEncoder cv score 2: 0.8225460353941083
Started MEstimateEncoder fold 3/5




MEstimateEncoder cv score 3: 0.8250872944298472
Started MEstimateEncoder fold 4/5




MEstimateEncoder cv score 4: 0.8290057049907481
Started MEstimateEncoder fold 5/5




MEstimateEncoder cv score 5: 0.8257572319360416
MEstimateEncoder cv scores : [0.8239353389041686, 0.8225460353941083, 0.8250872944298472, 0.8290057049907481, 0.8257572319360416]
MEstimateEncoder cv mean score : 0.8252663211309826
MEstimateEncoder cv std score : 0.002164601755855072
Started JamesSteinEncoder fold 1/5




JamesSteinEncoder cv score 1: 0.825785651337012
Started JamesSteinEncoder fold 2/5




JamesSteinEncoder cv score 2: 0.8213704510704314
Started JamesSteinEncoder fold 3/5




JamesSteinEncoder cv score 3: 0.8279921606609264
Started JamesSteinEncoder fold 4/5




JamesSteinEncoder cv score 4: 0.8144299433590861
Started JamesSteinEncoder fold 5/5




JamesSteinEncoder cv score 5: 0.8265248336422055
JamesSteinEncoder cv scores : [0.825785651337012, 0.8213704510704314, 0.8279921606609264, 0.8144299433590861, 0.8265248336422055]
JamesSteinEncoder cv mean score : 0.8232206080139323
JamesSteinEncoder cv std score : 0.004918616364474788
Started LeaveOneOutEncoder fold 1/5




LeaveOneOutEncoder cv score 1: 0.7870271221339217
Started LeaveOneOutEncoder fold 2/5




LeaveOneOutEncoder cv score 2: 0.7883871851867641
Started LeaveOneOutEncoder fold 3/5




LeaveOneOutEncoder cv score 3: 0.7940299539122266
Started LeaveOneOutEncoder fold 4/5




LeaveOneOutEncoder cv score 4: 0.7901332335441938
Started LeaveOneOutEncoder fold 5/5




LeaveOneOutEncoder cv score 5: 0.7883777050213145
LeaveOneOutEncoder cv scores : [0.7870271221339217, 0.7883871851867641, 0.7940299539122266, 0.7901332335441938, 0.7883777050213145]
LeaveOneOutEncoder cv mean score : 0.7895910399596842
LeaveOneOutEncoder cv std score : 0.0024287055632732923
Started CatBoostEncoder fold 1/5




CatBoostEncoder cv score 1: 0.7297166731892779
Started CatBoostEncoder fold 2/5




CatBoostEncoder cv score 2: 0.7805421729205533
Started CatBoostEncoder fold 3/5




CatBoostEncoder cv score 3: 0.7839670066128355
Started CatBoostEncoder fold 4/5




CatBoostEncoder cv score 4: 0.7795226775861617
Started CatBoostEncoder fold 5/5
CatBoostEncoder cv score 5: 0.786081368408643
CatBoostEncoder cv scores : [0.7297166731892779, 0.7805421729205533, 0.7839670066128355, 0.7795226775861617, 0.786081368408643]
CatBoostEncoder cv mean score : 0.7719659797434942
CatBoostEncoder cv std score : 0.021255246501015294




## Submit

Even CVs did not solve the target based encoder's overfit problem.

In [None]:
if TEST:
    for idx, label in enumerate(results['label']):
        sub_df = pd.DataFrame({'id': test_id, 'target' : results.iloc[idx]['test']})
        sub_df.to_csv("LR_{}.csv".format(label), index=False)



In [25]:
results

Unnamed: 0,label,train,test,cv,cv_mean,cv_std
0,OrdinalEncoder,"[0.48505482333487315, 0.4917464196614169, 0.42...","[0.3380805366198249, 0.3306798714776061, 0.334...","[0.5478087630083479, 0.5785368145198041, 0.579...",0.567307,0.013882
1,WOEEncoder,"[0.3576322095435018, 0.01605304315035642, 0.02...","[0.23103000402027601, 0.5454760958772756, 0.03...","[0.8296010840892494, 0.8276778463251253, 0.834...",0.830797,0.002213
2,TargetEncoder,"[0.40135938852395303, 0.027724824015846892, 0....","[0.24365513471906883, 0.4716806571929366, 0.05...","[0.8228825225372465, 0.819640037288339, 0.8270...",0.824912,0.00317
3,MEstimateEncoder,"[0.43698261221403517, 0.023900335752031814, 0....","[0.24434920814860037, 0.4873201437201871, 0.05...","[0.8239353389041686, 0.8225460353941083, 0.825...",0.825266,0.002165
4,JamesSteinEncoder,"[0.38073758773828714, 0.01792667770063288, 0.0...","[0.26601283716262103, 0.4008785887852655, 0.04...","[0.825785651337012, 0.8213704510704314, 0.8279...",0.823221,0.004919
5,LeaveOneOutEncoder,"[0.39689094271812525, 0.06536588173953757, 0.0...","[0.32221034919543445, 0.5207705882316336, 0.11...","[0.7870271221339217, 0.7883871851867641, 0.794...",0.789591,0.002429
6,CatBoostEncoder,"[0.27363762814222814, 0.10252561953328132, 0.0...","[0.33701225943510055, 0.4841749143409252, 0.16...","[0.7297166731892779, 0.7805421729205533, 0.783...",0.771966,0.021255
