## <font color='darkblue'>Preface</font>
([article source](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)) <font size='3ptx'>**Convert a categorical variable to number for Machine Learning Model Building**</font>

**Most of the Machine learning algorithms can not handle categorical variables unless we convert them to numerical values.** Many algorithm’s performances vary based on how Categorical variables are encoded.

Categorical variables can be divided into two categories: Nominal (<font color='brown'>No particular order</font>) and Ordinal (<font color='brown'>some ordered</font>):
![1.png](images/1.png)
<br/>
Few examples as below for Nominal variable:
* Red, Yellow, Pink, Blue
* Singapore, Japan, USA, India, Korea
* Cow, Dog, Cat, Snake

Example of Ordinal variables:
* High, Medium, Low
* “Strongly agree,” Agree, Neutral, Disagree, and “Strongly Disagree.”
* Excellent, Okay, Bad

There are many ways we can encode these categorical variables as numbers and use them in an algorithm. **I will cover most of them from basic to more advanced ones in this post. I will be comprising these encoding:**
1. <font size='3ptx'>[**One Hot Encoding**](#sect1)</font>
2. <font size='3ptx'>[**Label Encoding**](#sect2)</font>
3. <font size='3ptx'>[**Ordinal Encoding**](#sect3)</font>
4. <font size='3ptx'>[**Helmert Encoding**](#sect4)</font>
5. <font size='3ptx'>[**Binary Encoding**](#sect5)</font>
6. <font size='3ptx'>[**Frequency Encoding**](#sect6)</font>
7. <font size='3ptx'>[**Mean Encoding**](#sect7)</font>
8. <font size='3ptx'>[**Weight of Evidence Encoding**](#sect8)</font>
9. <font size='3ptx'>[**Probability Ratio Encoding**](#sect9)</font>
10. <font size='3ptx'>[**Hashing Encoding**](#sect10)</font>
11. <font size='3ptx'>[**Backward Difference Encoding**](#sect11)</font>
12. <font size='3ptx'>[**Leave One Out Encoding**](#sect12)</font>
13. <font size='3ptx'>[**James-Stein Encoding**](#sect13)</font>
14. <font size='3ptx'>[**M-estimator Encoding**](#sect14)</font>
15. <font size='3ptx'>[**Thermometer Encoder (To be updated)**](#sect15)</font>

### <font color='darkgreen'>Experimental Data</font>
For explanation, I will use this data-frame, which has two independent variables or features (<font color='brown'>`Temperature` and `Color`</font>) and one `label` (<font color='brown'>Target</font>). It also has Rec-No, which is a sequence number of the record. There is a total of 10 records in this data-frame. Python code would look as below.

In [19]:
#!pip install category_encoders

In [1]:
import pandas as pd
import numpy as np

In [2]:
def get_data():
    data = {
        'Temperature': ['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
        'Color': ['Red', 'Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Yellow'],
        'Target': [1, 1, 1, 0, 1, 0, 1, 0, 1, 1]
    }
    df = pd.DataFrame(data=data, columns=['Temperature', 'Color', 'Target'])
    return df

We will use Pandas and scikit-learn and [**category_encoders**](https://contrib.scikit-learn.org/category_encoders/) (<font color='brown'>Scikit-learn contribution library</font>) to show different encoding methods in Python.

<a id='sect1'></a>
## <font color='darkblue'>One Hot Encoding</font>
**In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for features**. This method produces a lot of columns that slows down the learning significantly if the number of the category is very high for the feature. Pandas has [get_dummies](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html) function, which is quite easy to use. For the sample data-frame code would be as below:

In [4]:
df = get_data()
df_with_dummies = pd.get_dummies(df, prefix=['Temp'], columns=['Temperature'])
df_with_dummies

Unnamed: 0,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0,1,0,0
1,Yellow,1,1,0,0,0
2,Blue,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Red,1,0,1,0,0
5,Yellow,0,0,0,0,1
6,Red,1,0,0,0,1
7,Yellow,0,0,1,0,0
8,Yellow,1,0,1,0,0
9,Yellow,1,1,0,0,0


Scikit-learn has [**OneHotEncoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) for this purpose, but it does not create an additional feature column (<font color='brown'>another code is needed, as shown in the below code sample</font>).

In [5]:
from sklearn.preprocessing import OneHotEncoder

ohc = OneHotEncoder()
ohe = ohc.fit_transform(df.Temperature.values.reshape(-1, 1)).toarray()
ohe

array([[0., 1., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 1.],
       [0., 1., 0., 0.],
       [0., 1., 0., 0.],
       [1., 0., 0., 0.]])

In [6]:
df_onehot = pd.DataFrame(data=ohe, columns = ["Temp_"+str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))])
df_onehot

Unnamed: 0,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,0.0
5,0.0,0.0,0.0,1.0
6,0.0,0.0,0.0,1.0
7,0.0,1.0,0.0,0.0
8,0.0,1.0,0.0,0.0
9,1.0,0.0,0.0,0.0


In [7]:
df_with_onehot = pd.concat([df, df_onehot], axis=1)
# df_with_onehot = df_with_onehot.drop(['Temperature'],axis=1)
df_with_onehot

Unnamed: 0,Temperature,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Hot,Red,1,0.0,1.0,0.0,0.0
1,Cold,Yellow,1,1.0,0.0,0.0,0.0
2,Very Hot,Blue,1,0.0,0.0,1.0,0.0
3,Warm,Blue,0,0.0,0.0,0.0,1.0
4,Hot,Red,1,0.0,1.0,0.0,0.0
5,Warm,Yellow,0,0.0,0.0,0.0,1.0
6,Warm,Red,1,0.0,0.0,0.0,1.0
7,Hot,Yellow,0,0.0,1.0,0.0,0.0
8,Hot,Yellow,1,0.0,1.0,0.0,0.0
9,Cold,Yellow,1,1.0,0.0,0.0,0.0


One Hot Encoding is very popular. We can represent all categories by `N-1` (<font color='brown'>`N`=No of Category</font>) as that is sufficient to encode the one that is not included. Usually, for Regression, we use N-1 (<font color='brown'>drop first or last column of One Hot Coded new feature</font>), but for classification, the recommendation is to use all N columns without as most of the tree-based algorithm builds a tree based on all available variables. **One hot encoding with N-1 binary variables should be used in linear Regression, to ensure the correct number of degrees of freedom** (`N-1`). The linear Regression has access to all of the features as it is being trained, and therefore examines the whole set of dummy variables altogether. This means that N-1 binary variables give complete information about (<font color='brown'>represent completely</font>) the original categorical variable to the linear Regression. This approach can be adopted for any machine learning algorithm that looks at ALL the features at the same time during training. For example, support vector machines and neural networks as well and clustering algorithms.

In tree-based methods, we will never consider that additional label if we drop. Thus, if we use the categorical variables in a tree-based learning algorithm, it is good practice to encode it into N binary variables and don’t drop.

<a id='sect2'></a>
## <font color='darkblue'>Label Encoding</font>
In this encoding, each category is assigned a value from 1 through `N` (<font color='brown'>here `N` is the number of categories for the feature</font>). **One major issue with this approach is there is no relation or order between these classes, but the algorithm might consider them as some order, or there is some relationship. In below example it may look like (<font color='brown'>Cold < Hot< Very Hot < Warm ….0 < 1 < 2 < 3 </font>) .Scikit-learn code for the data-frame as follows:

In [9]:
from sklearn.preprocessing import LabelEncoder

df['Temp_Label_Encoder'] = LabelEncoder().fit_transform(df['Temperature'])
df

Unnamed: 0,Temperature,Color,Target,Temp_Label_Encoder
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


Pandas [factorize](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.factorize.html) also perform the same function.

In [10]:
df = get_data()
df.loc[:, 'Temp_factorize_encode'] = pd.factorize(df['Temperature'])[0].reshape(-1, 1)
df

Unnamed: 0,Temperature,Color,Target,Temp_factorize_encode
0,Hot,Red,1,0
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,0
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,0
8,Hot,Yellow,1,0
9,Cold,Yellow,1,1


<a id='sect3'></a>
## <font color='darkblue'>Ordinal Encoding</font>
**We do Ordinal encoding to ensure the encoding of variables retains the ordinal nature of the variable**. This is reasonable only for ordinal variables, as I mentioned at the beginning of this article. This encoding looks almost similar to Label Encoding but slightly different as Label coding would not consider whether variable is ordinal or not and it will assign sequence of integers:
* As per the order of data (Pandas assigned Hot (0), Cold (1), “Very Hot” (2) and Warm (3)) or
* As per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1), “Very Hot” (2) and Warm (3)).

If we consider in the temperature scale as the order, then the ordinal value should from cold to “Very Hot. “ Ordinal encoding will assign values as (<font color='brown'> Cold(1) < Warm(2) < Hot(3) < Very Hot(4)</font>). **Usually, we Ordinal Encoding is done starting from 1.**
    
Refer to this code using Pandas, where first, we need to assign the original order of the variable through a dictionary. Then we can map each row for the variable as per the dictionary.

In [15]:
temp_dict = {
    "Cold": 1,
    "Warm": 2,
    "Hot": 3,
    "Very Hot": 4
}

df = get_data()
df['temperature_ordinal'] = df.Temperature.map(temp_dict)
df

Unnamed: 0,Temperature,Color,Target,temperature_ordinal
0,Hot,Red,1,3
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,4
3,Warm,Blue,0,2
4,Hot,Red,1,3
5,Warm,Yellow,0,2
6,Warm,Red,1,2
7,Hot,Yellow,0,3
8,Hot,Yellow,1,3
9,Cold,Yellow,1,1


Though it’s very straight forward it **requires coding to tell ordinal values and what is the actual mapping from text to an integer as per the order**.

<a id='sect4'></a>
## <font color='darkblue'>Helmert Encoding</font>
**In this encoding, the mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.**

**The version in [category_encoders](https://github.com/scikit-learn-contrib/category_encoders) is sometimes referred to as Reverse Helmert Coding**. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name ‘reverse’ is used to differentiate from forward Helmert coding.

In [27]:
import category_encoders as ce

df = get_data()
encoder = ce.HelmertEncoder(cols=['Temperature'], drop_invariant=True)
dfh = encoder.fit_transform(df['Temperature'])
encoded_df = pd.concat([df, dfh], axis=1)
encoded_df

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,Temperature,Color,Target,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,-1.0,-1.0,-1.0
1,Cold,Yellow,1,1.0,-1.0,-1.0
2,Very Hot,Blue,1,0.0,2.0,-1.0
3,Warm,Blue,0,0.0,0.0,3.0
4,Hot,Red,1,-1.0,-1.0,-1.0
5,Warm,Yellow,0,0.0,0.0,3.0
6,Warm,Red,1,0.0,0.0,3.0
7,Hot,Yellow,0,-1.0,-1.0,-1.0
8,Hot,Yellow,1,-1.0,-1.0,-1.0
9,Cold,Yellow,1,1.0,-1.0,-1.0


<a id='sect5'></a>
## <font color='darkblue'>Binary Encoding</font>
**Binary encoding converts a category into binary digits**. Each binary digit creates one feature column. **If there are `n` unique categories, then binary encoding results in the only $log(base 2)^n$ features**. In this example, we have four features; thus, the total number of the binary encoded features will be three features. Compared to One Hot Encoding, this will require fewer feature columns (<font color='brown'>For 100 categories One Hot Encoding will have 100 features while for Binary encoding, we will need just seven features</font>).

For Binary encoding, one has to follow the following steps:
1. The categories are first converted to numeric order starting from 1 (<font color='brown'>order is created as categories appear in a dataset and do not mean any ordinal nature</font>)
2. Then those integers are converted into binary code, so for example 3 becomes 011, 4 becomes 100
3. Then the digits of the binary number form separate columns.

Refer to the below diagram for better intuition:
![2.png](images/2.png)
<br/>

We will use the [**category_encoders**](https://github.com/scikit-learn-contrib/category_encoders) package for this, and the function name is BinaryEncoder.

In [28]:
import category_encoders as ce

df = get_data()
encoder = ce.BinaryEncoder(cols=['Temperature'])
bin_encode = encoder.fit_transform(df['Temperature'])
encoded_df = pd.concat([df, bin_encode], axis=1)
encoded_df

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,Temperature,Color,Target,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,0,0,1
1,Cold,Yellow,1,0,1,0
2,Very Hot,Blue,1,0,1,1
3,Warm,Blue,0,1,0,0
4,Hot,Red,1,0,0,1
5,Warm,Yellow,0,1,0,0
6,Warm,Red,1,1,0,0
7,Hot,Yellow,0,0,0,1
8,Hot,Yellow,1,0,0,1
9,Cold,Yellow,1,0,1,0


<a id='sect6'></a>
## <font color='darkblue'>Frequency Encoding</font>
**It is a way to utilize the frequency of the categories as labels.** In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data. Three-step for this :
* Select a categorical variable you would like to transform
* Group by the categorical variable and obtain counts of each category
* Join it back with the training dataset

Pandas code can be constructed as below:

In [29]:
df = get_data()
fe = df.groupby('Temperature').size() / len(df)
fe

Temperature
Cold        0.2
Hot         0.4
Very Hot    0.1
Warm        0.3
dtype: float64

In [30]:
df.loc[:, 'temperature_freq_encode'] = df['Temperature'].map(fe)
df

Unnamed: 0,Temperature,Color,Target,temperature_freq_encode
0,Hot,Red,1,0.4
1,Cold,Yellow,1,0.2
2,Very Hot,Blue,1,0.1
3,Warm,Blue,0,0.3
4,Hot,Red,1,0.4
5,Warm,Yellow,0,0.3
6,Warm,Red,1,0.3
7,Hot,Yellow,0,0.4
8,Hot,Yellow,1,0.4
9,Cold,Yellow,1,0.2


<a id='sect7'></a>
## <font color='darkblue'>Mean Encoding</font>
**Mean Encoding or Target Encoding is one viral encoding approach followed by Kagglers**. There are many variations of this. Here I will cover the basic version and smoothing version. Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, **in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data**. This encoding method brings out the relation between similar categories, but the connections are bounded within the categories and target itself. The advantages of the mean target encoding are that it **does not affect the volume of the data and helps in faster learning**. Usually, <font color='darkred'>**Mean encoding is notorious for over-fitting; thus, a regularization with cross-validation or some other approach is a must on most occasions**</font>. Mean encoding approach is as below:
1. Select a categorical variable you would like to transform
2. Group by the categorical variable and obtain aggregated sum over the "Target" variable. (<font color='brown'>total number of 1’s for each category in `Temperature`</font>)
3. Group by the categorical variable and obtain aggregated count over “Target” variable
4. Divide the step 2 / step 3 results and join it back with the train.

![3.png](images/3.png)
<br/>

Sample code for the data-frame :

In [31]:
df = get_data()
mean_encode = df.groupby('Temperature')['Target'].mean()
mean_encode

Temperature
Cold        1.000000
Hot         0.750000
Very Hot    1.000000
Warm        0.333333
Name: Target, dtype: float64

In [32]:
df.loc[:, 'temperature_mean_enc'] = df['Temperature'].map(mean_encode)
df

Unnamed: 0,Temperature,Color,Target,temperature_mean_enc
0,Hot,Red,1,0.75
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,1.0
3,Warm,Blue,0,0.333333
4,Hot,Red,1,0.75
5,Warm,Yellow,0,0.333333
6,Warm,Red,1,0.333333
7,Hot,Yellow,0,0.75
8,Hot,Yellow,1,0.75
9,Cold,Yellow,1,1.0


**Mean encoding can embody the target in the label, whereas label encoding does not correlate with the target**. In the case of a large number of features, mean encoding could prove to be a much simpler alternative. Mean encoding tends to group the classes, whereas the grouping is random in case of label encoding.

There are many variations of this target encoding in practice, like smoothing. Smoothing can implement as below:

In [36]:
df = get_data()

# Compute the global mean
mean = df['Target'].mean()
mean

0.7

In [37]:
# Count the number of values and mean of each group
agg = df.groupby('Temperature')['Target'].agg(['count', 'mean'])
agg

Unnamed: 0_level_0,count,mean
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1
Cold,2,1.0
Hot,4,0.75
Very Hot,1,1.0
Warm,3,0.333333


In [38]:
counts = agg['count']
means = agg['mean']
weight = 100

# Computed the smoothed mean
smooth = (counts * means + weight * mean) / (counts + weight)
smooth

Temperature
Cold        0.705882
Hot         0.701923
Very Hot    0.702970
Warm        0.689320
dtype: float64

In [40]:
df.loc[:, 'temp_smean_enc'] = df['Temperature'].map(smooth)
df

Unnamed: 0,Temperature,Color,Target,temp_smean_enc
0,Hot,Red,1,0.701923
1,Cold,Yellow,1,0.705882
2,Very Hot,Blue,1,0.70297
3,Warm,Blue,0,0.68932
4,Hot,Red,1,0.701923
5,Warm,Yellow,0,0.68932
6,Warm,Red,1,0.68932
7,Hot,Yellow,0,0.701923
8,Hot,Yellow,1,0.701923
9,Cold,Yellow,1,0.705882


<a id='sect8'></a>
## <font color='darkblue'>Weight of Evidence Encoding</font>
**Weight of Evidence** (<font color='brown'>WoE</font>) **is a measure of the “strength” of a grouping technique to separate good and bad**. This method was developed primarily to build a predictive model to evaluate the risk of loan default in the credit and financial industry. **Weight of evidence** (<font color='brown'>WoE</font>) **is a measure of how much the evidence supports or undermines a hypothesis:** ([reference](https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html))
> The weight of evidence tells the predictive power of an independent variable in relation to the dependent variable. Since it evolved from credit scoring world, it is generally described as a measure of the separation of good and bad customers. "Bad Customers" refers to the customers who defaulted on a loan. and "Good Customers" refers to the customers who paid back loan.

It is computed as below:
![4.png](images/4.png)
<br/>
WoE will be 0 if the `P(Goods) / P(Bads) = 1`. That is if the outcome is random for that group. If `P(Bads) > P(Goods)` the odds ratio will be < 1 and the WoE will be < 0; if, on the other hand, `P(Goods) > P(Bads)` in a group, then `WoE > 0`.

WoE is well suited for Logistic Regression because the Logit transformation is simply the log of the odds, i.e., $ln(P(Goods)/P(Bads))$. Therefore, by using WoE-coded predictors in Logistic Regression, the predictors are all prepared and coded to the same scale. The parameters in the linear logistic regression equation can be directly compared.

The WoE transformation has (at least) three advantage:
* **It can transform an independent variable so that it establishes a monotonic relationship to the dependent variable**. It does more than this — to secure monotonic relationship it would be enough to “recode” it to any ordered measure (<font color='brown'>for example 1,2,3,4…</font>), but the WoE transformation orders the categories on a “logistic” scale which is natural for Logistic Regression
* For variables with too many (<font color='brown'>sparsely populated</font>) discrete values, these can be grouped into categories (<font color='brown'>densely populated</font>), and the WoE can be used to express information for the whole category
* **The** (<font color='brown'>univariate</font>) **effect of each category on the dependent variable can be compared across categories and variables because WoE is a standardized value** (<font color='brown'>for example you can compare WoE of married people to WoE of manual workers</font>)

It also has (<font color='brown'>at least</font>) three drawbacks:
* Loss of information (variation) due to binning to a few categories
* It is a “univariate” measure, so it does not take into account the correlation between independent variables
* It is easy to manipulate (over-fit) the effect of variables according to how categories are created

Below code, snippets explain how one can build code to calculate WoE.

In [49]:
# We calculate the probability of target = 1 
df = get_data()
woe_df = pd.DataFrame(df.groupby('Temperature')['Target'].mean())
woe_df

Unnamed: 0_level_0,Target
Temperature,Unnamed: 1_level_1
Cold,1.0
Hot,0.75
Very Hot,1.0
Warm,0.333333


In [50]:
# Rename 'Target' to 'Good' for better understanding of formula
woe_df = woe_df.rename(columns={'Target': 'Good'})

In [51]:
# Calculate the 'Bad' Probability
woe_df['Bad'] = 1 - woe_df['Good']
woe_df

Unnamed: 0_level_0,Good,Bad
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1
Cold,1.0,0.0
Hot,0.75,0.25
Very Hot,1.0,0.0
Warm,0.333333,0.666667


In [52]:
# To avoid divide by zero issue
woe_df['Bad'] = np.where(woe_df['Bad'] == 0, pow(10, -6), woe_df['Bad'])
woe_df['Bad']

Temperature
Cold        0.000001
Hot         0.250000
Very Hot    0.000001
Warm        0.666667
Name: Bad, dtype: float64

In [53]:
# Compute WoE
woe_df['WoE'] = np.log(woe_df['Good'] / woe_df['Bad'])
woe_df

Unnamed: 0_level_0,Good,Bad,WoE
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cold,1.0,1e-06,13.815511
Hot,0.75,0.25,1.098612
Very Hot,1.0,1e-06,13.815511
Warm,0.333333,0.666667,-0.693147


Once we calculate WoE for each group, we can map back this to Data-frame.

In [55]:
df.loc[:, 'temperature_woe_enc'] = df['Temperature'].map(woe_df['WoE'])
df

Unnamed: 0,Temperature,Color,Target,temperature_woe_enc
0,Hot,Red,1,1.098612
1,Cold,Yellow,1,13.815511
2,Very Hot,Blue,1,13.815511
3,Warm,Blue,0,-0.693147
4,Hot,Red,1,1.098612
5,Warm,Yellow,0,-0.693147
6,Warm,Red,1,-0.693147
7,Hot,Yellow,0,1.098612
8,Hot,Yellow,1,1.098612
9,Cold,Yellow,1,13.815511


<a id='sect9'></a>
## <font color='darkblue'>Probability Ratio Encoding</font>
**Probability Ratio Encoding is similar to Weight Of Evidence(WoE), with the only difference is the only ratio of good and bad probability is used.** For each label, we calculate the mean of target=1, that is the probability of being 1 ( `P(1)` ), and also the probability of the target=0 ( `P(0)` ). And then, we calculate the ratio `P(1)/P(0)` and replace the labels by that ratio. **We need to add a minimal value with `P(0)` to avoid any divide by zero scenarios where for any particular category, there is no target=0**.

In [58]:
# Calculate P(1) 
df = get_data()
pr_df = pd.DataFrame(df.groupby('Temperature')['Target'].mean())
pr_df = pr_df.rename(columns={'Target':'Good'})
pr_df['Bad'] = 1 - pr_df['Good']
pr_df['Bad'] = np.where(pr_df['Bad'] == 0, pow(10, -6), pr_df['Bad'])
pr_df['PR'] = pr_df['Good'] / pr_df['Bad']
pr_df

Unnamed: 0_level_0,Good,Bad,PR
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cold,1.0,1e-06,1000000.0
Hot,0.75,0.25,3.0
Very Hot,1.0,1e-06,1000000.0
Warm,0.333333,0.666667,0.5


In [60]:
df['temperature_pr_enc'] = df['Temperature'].map(pr_df['PR'])
df

Unnamed: 0,Temperature,Color,Target,temperature_pr_enc
0,Hot,Red,1,3.0
1,Cold,Yellow,1,1000000.0
2,Very Hot,Blue,1,1000000.0
3,Warm,Blue,0,0.5
4,Hot,Red,1,3.0
5,Warm,Yellow,0,0.5
6,Warm,Red,1,0.5
7,Hot,Yellow,0,3.0
8,Hot,Yellow,1,3.0
9,Cold,Yellow,1,1000000.0


<a id='sect10'></a>
## <font color='darkblue'>Hashing</font>
**[Hashing](https://contrib.scikit-learn.org/category_encoders/hashing.html) converts categorical variables to a higher dimensional space of integers, where the distance between two vectors of categorical variables in approximately maintained the transformed numerical dimensional space**. With Hashing, the number of dimensions will be far less than the number of dimensions with encoding like [**One Hot Encoding**](#sect1). This method is advantageous when the cardinality of categorical is very high.

In [62]:
import category_encoders as ce

df = get_data()
encoder = ce.hashing.HashingEncoder(cols=['Temperature'], drop_invariant=True)
dfh = encoder.fit_transform(df['Temperature'])
encoded_df = pd.concat([df, dfh], axis=1)
encoded_df

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,Temperature,Color,Target,col_0,col_1,col_2
0,Hot,Red,1,1,0,0
1,Cold,Yellow,1,0,0,1
2,Very Hot,Blue,1,0,1,0
3,Warm,Blue,0,0,1,0
4,Hot,Red,1,1,0,0
5,Warm,Yellow,0,0,1,0
6,Warm,Red,1,0,1,0
7,Hot,Yellow,0,1,0,0
8,Hot,Yellow,1,1,0,0
9,Cold,Yellow,1,0,0,1


<a id='sect11'></a>
## <font color='darkblue'>Backward Difference Encoding</font>
In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.

This technique falls under the contrast coding system for categorical features. A feature of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables.

<a id='sect12'></a>
## <font color='darkblue'>Leave One Out Encoding</font>
[**LeaveOneOutEncoder**](https://contrib.scikit-learn.org/category_encoders/leaveoneout.html) is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.

In [68]:
import category_encoders as ce

df = get_data()
encoder = ce.leave_one_out.LeaveOneOutEncoder(cols=['Temperature'], drop_invariant=True)
encoder.fit(df['Temperature'], df['Target'])
dfh = encoder.transform(df['Temperature']).rename(columns={'Temperature': 'temperature_loo_enc'})
encoded_df = pd.concat([df, dfh], axis=1)
encoded_df

  elif pd.api.types.is_categorical(cols):


Unnamed: 0,Temperature,Color,Target,temperature_loo_enc
0,Hot,Red,1,0.75
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,0.7
3,Warm,Blue,0,0.333333
4,Hot,Red,1,0.75
5,Warm,Yellow,0,0.333333
6,Warm,Red,1,0.333333
7,Hot,Yellow,0,0.75
8,Hot,Yellow,1,0.75
9,Cold,Yellow,1,1.0


<a id='sect13'></a>
## <font color='darkblue'>James-Stein Encoding</font>
For feature value, [**James-Stein**](https://contrib.scikit-learn.org/category_encoders/jamesstein.html) estimator returns a weighted average of:
1. The mean target value for the observed feature value.
2. The mean target value (regardless of the feature value).

The [**James-Stein encoder**](https://contrib.scikit-learn.org/category_encoders/jamesstein.html) shrinks the average toward the overall average. It is a target based encoder. **James-Stein estimator has, however, one practical limitation — it was defined only for normal distributions**.

In [66]:
import category_encoders as ce

df = get_data()
encoder = ce.james_stein.JamesSteinEncoder(cols=['Temperature'], drop_invariant=True)
encoder.fit(df['Temperature'], df['Target'])
dfh = encoder.transform(df['Temperature']).rename(columns={'Temperature': 'temperature_js_enc'})
encoded_df = pd.concat([df, dfh], axis=1)
encoded_df

Unnamed: 0,Temperature,Color,Target,temperature_js_enc
0,Hot,Red,1,0.741379
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,1.0
3,Warm,Blue,0,0.405229
4,Hot,Red,1,0.741379
5,Warm,Yellow,0,0.405229
6,Warm,Red,1,0.405229
7,Hot,Yellow,0,0.741379
8,Hot,Yellow,1,0.741379
9,Cold,Yellow,1,1.0


<a id='sect14'></a>
## <font color='darkblue'>M-estimator Encoding</font>
M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyper-parameter — `m`, which represents the power of regularization. **The higher the value of m results, into stronger shrinking**. Recommended values for `m` is in the range of 1 to 100

## <font color='darkblue'>Conclusion</font>
**It is essential to understand, for all machine learning models, all these encodings do not work well in all situations or for every dataset. Data Scientists still need to experiment and find out which works best for their specific case.** If test data has different classes, then some of these methods won’t work as features won’t be similar. There are few benchmark publications by research communities, but it’s not conclusive, which works best. My recommendation will be to try each of these with the smaller datasets and then decide where to put more focus on tuning the encoding process. You can use the below cheat-sheet as a guiding tool.
![5.png](images/5.png)
<br/>