# Encoding Categorical Variables

## Convert categorical variable to number for Machine Learning Model Building



Most of the Machine learning algorithms can not handle categorical variables unless they are converted to numerical values and many algorithm’s performance varies based on how Categorical variables are encoded.
Categorical variables can be divided in two categories:
* _Nominal_ No particular order
    * Red , Yellow, Pink, Blue
    * Singapore, Japan, USA , India , Korea
    * Cow , Dog, Cat , Snake
* _Ordinal_ some kind of ordered
    * High, Medium, Low
    * “Strongly agree”, Agree , Neutral, Disagree and “Strongly Disagree”
    * Excellent , Okay, Bad

There are many ways we can encode these categorical variables as numbers and use them in algorithm:

    1) One Hot Encoding
    2) Label Encoding
    3) Ordinal Encoding
    4) Helmert Encoding
    5) Binary Encoding
    6) Frequency Encoding
    7) Mean Encoding
    8) Weight of Evidence Encoding
    9) Probability Ratio Encoding
    10) Hashing Encoding
    11) Backward Difference Encoding
    12) Leave One Out Encoding
    13) James-Stein Encoding
    14) M-estimator Encoding
    
For the purpose of explanation , I will use this data-frame which has two independent variables or features (Temperature and Color) and one label (Target). It also has Rec-No which is sequence number of the record. There are total 10 record in this data-frame. Python code would look as below.


In [8]:
import pandas as pd
import numpy as np
data = {'Temperature':['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
        'Color':['Red', 'Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Yellow'],
        'Target':[1, 1, 1, 0, 1, 0, 1, 0, 1, 1]
       }
df = pd.DataFrame(data=data, columns = ['Temperature', 'Color', 'Target'])
df

Unnamed: 0,Temperature,Color,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Yellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


We will use _Pandas_ and _Scikit-learn_ and _category_encoders_ to show different encoding method in _Python_.

# 1) One Hot Encoding

In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature. The number of vectors depends on the number of categories for a feature. This method produces a lot of columns that slows down the learning significantly if number of category is very high for the feature. Pandas has get_dummies function which is quite easy to use. For the sample data-frame code would be as below:

One Hot Encoding is very popular . We can represent all category by N-1 (N= No of Category) as that is sufficient to encode the one that is not included. Usually for Regression we use N-1 (drop first or last column of One Hot Coded new feature ) but for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available

One hot encoding with N-1 binary variables should be used in linear regression, to ensure the correct number of degrees of freedom (N-1). The linear regression has access to all of the features as it is being trained, and therefore examines altogether the whole set of dummy variables. This means that N-1 binary variables give the whole information about (represent completely) the original categorical variable to the linear regression. This approach can be adopted for any machine learning algorithm that look at ALL the features at the same time during training. For example, support vector machines and neural networks as well and clustering algorithms.

While tree based methods will never consider that additional label, the one if dropped. Thus, if the categorical variables will be used in a tree based learning algorithm, it is good practice to encode it into N binary variables and don’t drop.

In [32]:
df0 = pd.get_dummies(df, prefix=['Temp'], columns=['Temperature'])
df0[0:len(df0)]

Unnamed: 0,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0,1,0,0
1,Yellow,1,1,0,0,0
2,Blue,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Red,1,0,1,0,0
5,Yellow,0,0,0,0,1
6,Red,1,0,0,0,1
7,Yellow,0,0,1,0,0
8,Yellow,1,0,1,0,0
9,Yellow,1,1,0,0,0


Scikit-learn has OneHotEncoder for this purpose but it does not create additional feature column (additional code is needed as shown in the below code sample).

In [144]:
X_ = df[['Temperature', 'Target']].values.reshape(1,-1)
X_

array([['Hot', 1, 'Cold', 1, 'Very Hot', 1, 'Warm', 0, 'Hot', 1, 'Warm',
        0, 'Warm', 1, 'Hot', 0, 'Hot', 1, 'Cold', 1]], dtype=object)

In [145]:
type(X_)

numpy.ndarray

In [201]:
X_list = X_.tolist()
X_list


[['Hot',
  1,
  'Cold',
  1,
  'Very Hot',
  1,
  'Warm',
  0,
  'Hot',
  1,
  'Warm',
  0,
  'Warm',
  1,
  'Hot',
  0,
  'Hot',
  1,
  'Cold',
  1]]

In [179]:
# we have to do it manually, besides we can create an algorithm to automatise this task
X = [['Hot', 1],
     ['Cold', 1],
     ['Very Hot', 1],
     ['Warm', 0],
     ['Hot', 1],
     ['Warm', 0],
     ['Warm', 1],
     ['Hot', 0],
     ['Hot', 1],
     ['Cold', 1]
    ]

In [180]:
type(X)

list

In [None]:
# we need X data in list format

In [181]:
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')

In [182]:
enc.fit(X)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)

In [148]:
# enc.categories_

[array(['Cold', 'Hot', 'Very Hot', 'Warm'], dtype=object),
 array([0, 1], dtype=object)]

In [183]:
X_enconded = enc.transform(X).toarray()
X_enconded

array([[0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 1.],
       [0., 0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 0., 1.],
       [0., 0., 0., 1., 1., 0.],
       [0., 0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1.]])

In [186]:
# the same in one step
X_enconded_ = enc.fit_transform(X).toarray()
X_enconded_

array([[0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 0., 1.],
       [0., 0., 0., 1., 1., 0.],
       [0., 1., 0., 0., 0., 1.],
       [0., 0., 0., 1., 1., 0.],
       [0., 0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 1.]])

In [187]:
enc.inverse_transform(X_enconded)

array([['Hot', 1],
       ['Cold', 1],
       ['Very Hot', 1],
       ['Warm', 0],
       ['Hot', 1],
       ['Warm', 0],
       ['Warm', 1],
       ['Hot', 0],
       ['Hot', 1],
       ['Cold', 1]], dtype=object)

In [178]:
columns_ = ["Temp_"+str(enc.categories_[0][i]) for i in range(len(enc.categories_[0]))]
columns_

['Temp_Cold', 'Temp_Hot', 'Temp_Very Hot', 'Temp_Warm']

In [195]:
dfOneHot = pd.DataFrame(X_enconded[:,0:4], columns = columns_ )
dfOneHot

Unnamed: 0,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,0.0,1.0,0.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0
4,0.0,1.0,0.0,0.0
5,0.0,0.0,0.0,1.0
6,0.0,0.0,0.0,1.0
7,0.0,1.0,0.0,0.0
8,0.0,1.0,0.0,0.0
9,1.0,0.0,0.0,0.0


In [197]:
df0_new = pd.concat([df[['Color','Target']], dfOneHot], axis=1)
df0_new

Unnamed: 0,Color,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0.0,1.0,0.0,0.0
1,Yellow,1,1.0,0.0,0.0,0.0
2,Blue,1,0.0,0.0,1.0,0.0
3,Blue,0,0.0,0.0,0.0,1.0
4,Red,1,0.0,1.0,0.0,0.0
5,Yellow,0,0.0,0.0,0.0,1.0
6,Red,1,0.0,0.0,0.0,1.0
7,Yellow,0,0.0,1.0,0.0,0.0
8,Yellow,1,0.0,1.0,0.0,0.0
9,Yellow,1,1.0,0.0,0.0,0.0


# 2) Label Encoding

In this encoding each category is assigned a value from 1 through N (here N is the number of category for the feature. One major issue with this approach is there is no relation or order between these classes but algorithm might consider them as some kind of order or there is some kind of relationship . In below example it may look like (Cold<Hot<Very Hot<Warm….0 < 1 < 2 < 3 ) .Scikit-learn code for the data-frame as follows:

In [209]:
from sklearn.preprocessing import LabelEncoder
df['Temp_label_encoded'] = LabelEncoder().fit_transform(df.Temperature)
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


__Pandas factorize also perform the same function__

In [212]:
df.loc[:,'Temp_factorize_encoded'] = pd.factorize(df['Temperature'])[0].reshape(-1,1)
df

Unnamed: 0,Temperature,Color,Target,Temp_label_encoded,Temp_factorize_encoded
0,Hot,Red,1,1,0
1,Cold,Yellow,1,0,1
2,Very Hot,Blue,1,2,2
3,Warm,Blue,0,3,3
4,Hot,Red,1,1,0
5,Warm,Yellow,0,3,3
6,Warm,Red,1,3,3
7,Hot,Yellow,0,1,0
8,Hot,Yellow,1,1,0
9,Cold,Yellow,1,0,1


# 3. Ordinal Encoding

Ordinal encoding is done to ensure encoding of variable retains ordinal nature of the variable. This is reasonable only for ordinal variables as I mentioned in the beginning of this article.

This encoding looks almost similar to _Label Encoding_ but slightly different as the last would not consider whether variable is ordinal or not and it will assign sequence of integers as per the order of data (Pandas assigned Hot (0), Cold (1) , “Very Hot” (2) and Warm (3)) or as per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1) , “Very Hot” (2) and Warm (3)) .

If we consider in the temperature scale as the order then ordinal value should from cold to “Very Hot “ . Ordinal encoding will assign values as ( Cold(1) <Warm(2)<Hot(3)<”Very Hot(4)). Usually Ordinal Encoding is done starting from 1.

Refer to this code using Pandas, where first we need to assign the real order of the variable through a dictionary and then we can map each row for the variable as per the dictionary.


In [222]:
Temp_dict = {'Cold':1, 
             'Warm':2,
             'Hot':3,
             'Very Hot':4 
            }

In [223]:
data = {'Temperature':['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
        'Color':['Red', 'Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Yellow'],
        'Target':[1, 1, 1, 0, 1, 0, 1, 0, 1, 1]
       }
df = pd.DataFrame(data=data, columns = ['Temperature', 'Color', 'Target'])

In [224]:
df['Temp_ordinal'] = df.Temperature.map(Temp_dict)
df

Unnamed: 0,Temperature,Color,Target,Temp_ordinal
0,Hot,Red,1,3
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,4
3,Warm,Blue,0,2
4,Hot,Red,1,3
5,Warm,Yellow,0,2
6,Warm,Red,1,2
7,Hot,Yellow,0,3
8,Hot,Yellow,1,3
9,Cold,Yellow,1,1


__Though its very straight forward but it requires coding to tell ordinal values and what is the actual mapping from text to integer as per the order.__

# 4. Helmert Encoding

In this encoding, mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels.

The version in category_encoders is sometimes referred to as Reverse Helmert Coding. The mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous levels. Hence, the name ‘reverse’ is used to differentiate from forward Helmert coding.

In [2]:
import categort_encoders as ce

ModuleNotFoundError: No module named 'categort_encoders'

In [3]:
encoder = ce.HelmertEncoder(cols=['Temperature'].drop_invariant=True)
dfh = encoder.fit_transform(df['Temperature'])
df = pd.concat([df,dfh], axis=1)
df

SyntaxError: invalid syntax (<ipython-input-3-683c8e828b79>, line 1)

# 5. Binary Encoding

Binary encoding convert a category into a binary digits. Each binary digit creates one feature column. If there are n unique categories, then binary encoding results in only log(base 2)ⁿ features. In this example we have 4 feature, thus total number of binary encoded feature will be 3 features. Compared to One Hot Encoding this will require less feature columns (for 100 categories One Hot Encoding will have 100 features while for Binary encoding we will require just 7 features).

For Binary encoding one has to follow following steps:

    1) The categories are first converted to numeric order starting from 1 (order is created as categories appear in dataset and does not mean any ordinal nature)
    2) Then those integers are converted into binary code, so for example 3 becomes 011 , 4 becomes 100
    3) Then digits of the binary number form separate columns.

Refer to below diagram for better intuition. 

We will use category_encoders package for this and the function name is BinaryEncoder.

In [4]:
import category_encoders as ce 
encoder = ce.BinaryEncoder(cols = ['Temperature'])
dfbin = encoder.fit_transform(df['Temperature'])
df = pd.concat([df, dfbin]. axis=1)
df

SyntaxError: keyword can't be an expression (<ipython-input-4-de1b6311c931>, line 4)

# 6. Frequency Encoding
It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data. 

Three step for this :

    1) Select a categorical variable you would like to transform
    2) Group by the categorical variable and obtain counts of each category
    3) Join it back with the train dataset
    
Pandas code can be constructed as below:

In [9]:
data = {'Temperature':['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
        'Color':['Red', 'Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Yellow'],
        'Target':[1, 1, 1, 0, 1, 0, 1, 0, 1, 1]
       }
df = pd.DataFrame(data=data, columns = ['Temperature', 'Color', 'Target'])

In [10]:
fe = df.groupby('Temperature').size()/len(df)
df.loc[:,'Temp_freq_encode'] = df['Temperature'].map(fe)
df

Unnamed: 0,Temperature,Color,Target,Temp_freq_encode
0,Hot,Red,1,0.4
1,Cold,Yellow,1,0.2
2,Very Hot,Blue,1,0.1
3,Warm,Blue,0,0.3
4,Hot,Red,1,0.4
5,Warm,Yellow,0,0.3
6,Warm,Red,1,0.3
7,Hot,Yellow,0,0.4
8,Hot,Yellow,1,0.4
9,Cold,Yellow,1,0.2


# 7. Mean Encoding

Mean Encoding or Target Encoding is one very popular encoding approach followed by Kagglers. There are many variations of this, here it there are the basic version and smoothing version.

Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data. This encoding method brings out the relation between similar categories, but the relations are bounded within the categories and target itself.

The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning. Usually Mean Encoding is notorious for over-fitting , thus a regularization with cross validation or some other approach is a must in most occasion.

Mean encoding approach is as below:

    1) Select a categorical variable you would like to transform
    2) Group by the categorical variable and obtain aggregated sum over “Target” variable. (total number of 1’s for each category in ‘Temperature’)
    3) Group by the categorical variable and obtain aggregated count over “Target” variable
    4) Divide the step 2 / step 3 results and join it back with the train.

In [15]:
data = {'Temperature':['Hot', 'Cold', 'Very Hot', 'Warm', 'Hot', 'Warm', 'Warm', 'Hot', 'Hot', 'Cold'],
        'Color':['Red', 'Yellow', 'Blue', 'Blue', 'Red', 'Yellow', 'Red', 'Yellow', 'Yellow', 'Yellow'],
        'Target':[1, 1, 1, 0, 1, 0, 1, 0, 1, 1]
       }
df = pd.DataFrame(data=data, columns = ['Temperature', 'Color', 'Target'])

In [30]:
mean_encode = df.groupby('Temperature')['Target'].mean()
print(mean_encode)

Temperature
Cold        1.000000
Hot         0.750000
Very Hot    1.000000
Warm        0.333333
Name: Target, dtype: float64


In [31]:
df.loc[:,'Temperature_mean_enc'] = df['Temperature'].map(mean_encode)
df

Unnamed: 0,Temperature,Color,Target,Temperature_mean_enc
0,Hot,Red,1,0.75
1,Cold,Yellow,1,1.0
2,Very Hot,Blue,1,1.0
3,Warm,Blue,0,0.333333
4,Hot,Red,1,0.75
5,Warm,Yellow,0,0.333333
6,Warm,Red,1,0.333333
7,Hot,Yellow,0,0.75
8,Hot,Yellow,1,0.75
9,Cold,Yellow,1,1.0


Mean encoding can embody the target in the label whereas label encoding has no correlation with the target.

In case of large number of features, mean encoding could prove to be a much simpler alternative.

Mean encoding tend to group the classes together whereas the grouping is random in case of label encoding.

There are many variation of this target encoding in practice like, smoothing. Smoothing can be implemented as below:

In [36]:
#compute the global mean
mean_ = df['Target'].mean()
mean_

0.7

In [37]:
# compute the number of values and the mean of each group
agg = df.groupby('Temperature')['Target'].agg(['count','mean'])
agg

Unnamed: 0_level_0,count,mean
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1
Cold,2,1.0
Hot,4,0.75
Very Hot,1,1.0
Warm,3,0.333333


$$
encode = \frac{no*\mu + w*\mu}{no+w} = \mu \left( \frac{no}{no + w} + \frac{w}{no + w} \right) 
$$

In [39]:
# compute the smooth mean
counts = agg['count']
means = agg['mean']
weight = 100
smooth_encode = (counts*means+weight*mean)/(counts+weight)
print(smooth_encode)

Temperature
Cold        0.705882
Hot         0.701923
Very Hot    0.702970
Warm        0.689320
dtype: float64


In [40]:
df.loc[:, 'Temperature_smean_enc'] = df['Temperature'].map(smooth_encode)
df

Unnamed: 0,Temperature,Color,Target,Temperature_mean_enc,Temperature_smean_enc
0,Hot,Red,1,0.75,0.701923
1,Cold,Yellow,1,1.0,0.705882
2,Very Hot,Blue,1,1.0,0.70297
3,Warm,Blue,0,0.333333,0.68932
4,Hot,Red,1,0.75,0.701923
5,Warm,Yellow,0,0.333333,0.68932
6,Warm,Red,1,0.333333,0.68932
7,Hot,Yellow,0,0.75,0.701923
8,Hot,Yellow,1,0.75,0.701923
9,Cold,Yellow,1,1.0,0.705882


# 8. Weight of Evidence Encoding

Weight of Evidence (WoE) is a measure of the “strength” of a grouping technique to separate good and bad . This method was developed primarily to build predictive model to evaluate risk of loan default in credit and financial industry.

Weight of evidence (WOE) is a measure of how much the evidence supports or undermines a hypothesis.

It is computed as below:

$$
WoE = \left(ln\left(\frac{Distr Goods}{Distr bads} \right) \right) \cdot 100
$$

WoE will be 0 if the $P(Goods) / P(Bads) = 1$. That is, if the outcome is random for that group. 
If $P(Bads) > P(Goods)$ the odds ratio will be < 1 and the WoE will be < 0; if, on the other hand, $P(Goods) > P(Bads)$ in a group, then $WoE > 0$.

WoE is well suited for Logistic Regression, because the Logit transformation is simply the log of the odds, i.e., $ln(P(Goods)/P(Bads))$. Therefore, by using WoE-coded predictors in logistic regression, the predictors are all prepared and coded to the same scale, and the parameters in the linear logistic regression equation can be directly compared.

The WoE transformation has (at least) three advantage:

    1) It can transform an independent variable so that it establishes monotonic relationship to the dependent variable. Actually it does more than this — to secure monotonic relationship it would be enough to “recode” it to any ordered measure (for example 1,2,3,4…) but the WoE transformation actually orders the categories on a “logistic” scale which is natural for logistic regression
    2) For variables with too many (sparsely populated) discrete values, these can be grouped into categories (densely populated) and the WoE can be used to express information for the whole category
    3) The (univariate) effect of each category on dependent variable can be simply compared across categories and across variables because WoE is standardized value (for example you can compare WoE of married people to WoE of manual workers)
    
It also has (at least) three drawbacks:

    a) Loss of information (variation) due to binning to few categories
    b) It is a “univariate” measure so it does not take into account correlation between independent variables
    c) It is easy to manipulate (over-fit) the effect of variables according to how categories are created
    
Below code snippets explains how one can build code to calculate WoE.

In [46]:
woe_df = pd.DataFrame(df.groupby('Temperature')['Target'].mean())
woe_df 

Unnamed: 0_level_0,Target
Temperature,Unnamed: 1_level_1
Cold,1.0
Hot,0.75
Very Hot,1.0
Warm,0.333333


In [49]:
woe_df = woe_df.rename(columns = {'Target':'Good'})
woe_df['Bad'] = 1-woe_df.Good
woe_df

Unnamed: 0_level_0,Good,Bad
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1
Cold,1.0,0.0
Hot,0.75,0.25
Very Hot,1.0,0.0
Warm,0.333333,0.666667


In [50]:
woe_df['Bad'] = np.where(woe_df['Bad'] == 0, 0.000001, woe_df['Bad'])
woe_df

Unnamed: 0_level_0,Good,Bad
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1
Cold,1.0,1e-06
Hot,0.75,0.25
Very Hot,1.0,1e-06
Warm,0.333333,0.666667


In [51]:
woe_df['WoE'] = np.log(woe_df.Good/woe_df.Bad)
woe_df

Unnamed: 0_level_0,Good,Bad,WoE
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cold,1.0,1e-06,13.815511
Hot,0.75,0.25,1.098612
Very Hot,1.0,1e-06,13.815511
Warm,0.333333,0.666667,-0.693147


Once we calculate WoE for each group we can map back this to Data-frame .

In [45]:
df.loc[:, 'WoE_Encode'] = df['Temperature'].map(woe_df['WoE'])
df

Unnamed: 0,Temperature,Color,Target,Temperature_mean_enc,Temperature_smean_enc,WoE_Encode
0,Hot,Red,1,0.75,0.701923,1.098612
1,Cold,Yellow,1,1.0,0.705882,13.815511
2,Very Hot,Blue,1,1.0,0.70297,13.815511
3,Warm,Blue,0,0.333333,0.68932,-0.693147
4,Hot,Red,1,0.75,0.701923,1.098612
5,Warm,Yellow,0,0.333333,0.68932,-0.693147
6,Warm,Red,1,0.333333,0.68932,-0.693147
7,Hot,Yellow,0,0.75,0.701923,1.098612
8,Hot,Yellow,1,0.75,0.701923,1.098612
9,Cold,Yellow,1,1.0,0.705882,13.815511


# 9. Probability Ratio Encoding

Probability Ratio Encoding is similar to Weight Of Evidence(WoE) with only difference is only ratio of good and bad probability is used.

For each label, we calculate the mean of target=1, that is the probability of being 1 ( P(1) ), and also the probability of the target=0 ( P(0) ). And then, we calculate the ratio P(1)/P(0), and replace the labels by that ratio. We need to add a very small value with P(0) , to avoid any divide by zero scenario where for any particular category there is no target=0.

In [57]:
pr_df = pd.DataFrame(df.groupby('Temperature')['Target'].mean())
pr_df = pr_df.rename(columns = {'Target':'Good'})
pr_df

Unnamed: 0_level_0,Good
Temperature,Unnamed: 1_level_1
Cold,1.0
Hot,0.75
Very Hot,1.0
Warm,0.333333


In [58]:
pr_df['Bad'] = 1-pr_df.Good
pr_df['Bad'] = np.where(pr_df['Bad']==0, 0.000001, pr_df['Bad'])
pr_df['PR'] = pr_df.Good/pr_df.Bad
pr_df

Unnamed: 0_level_0,Good,Bad,PR
Temperature,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Cold,1.0,1e-06,1000000.0
Hot,0.75,0.25,3.0
Very Hot,1.0,1e-06,1000000.0
Warm,0.333333,0.666667,0.5


In [59]:
df.loc[:, 'PR_Encode'] = df['Temperature'].map(pr_df['PR'])
df

Unnamed: 0,Temperature,Color,Target,Temperature_mean_enc,Temperature_smean_enc,WoE_Encode,PR_Encode
0,Hot,Red,1,0.75,0.701923,1.098612,3.0
1,Cold,Yellow,1,1.0,0.705882,13.815511,1000000.0
2,Very Hot,Blue,1,1.0,0.70297,13.815511,1000000.0
3,Warm,Blue,0,0.333333,0.68932,-0.693147,0.5
4,Hot,Red,1,0.75,0.701923,1.098612,3.0
5,Warm,Yellow,0,0.333333,0.68932,-0.693147,0.5
6,Warm,Red,1,0.333333,0.68932,-0.693147,0.5
7,Hot,Yellow,0,0.75,0.701923,1.098612,3.0
8,Hot,Yellow,1,0.75,0.701923,1.098612,3.0
9,Cold,Yellow,1,1.0,0.705882,13.815511,1000000.0


# 10. Hashing
Hashing converts categorical variables to a higher dimensional space of integers,
where the distance between two vectors of categorical variables in approximately maintained the transformed numerical dimensional space. With Hashing, the number of dimensions will be far less than the number of dimensions with encoding like One Hot Encoding.
This method is very effective when cardinality of categorical is very high.

# 11. Backward Difference Encoding
In backward difference coding, the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a nominal or an ordinal variable.
This technique falls under the contrast coding system for categorical features. A feature of K categories, or levels, usually enters a regression as a sequence of K-1 dummy variables.

# 12. Leave One Out Encoding
This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.

# 13. James-Stein Encoding
For feature value , James-Stein estimator returns a weighted average of:
The mean target value for the observed feature value .
The mean target value (regardless of the feature value).
The James-Stein encoder shrinks the average toward the overall average. It is a target based encoder.
James-Stein estimator has, however, one practical limitation — it was defined only for normal distributions.

# 14. M-estimator Encoding
M-Estimate Encoder is a simplified version of Target Encoder. It has only one hyper-parameter — m, which represents the power of regularization.
The higher value of m results into stronger shrinking. Recommended values for m is in the range of 1 to 100.

# Conclusion

It is important to understand, for all machine learning model, all these encoding do not work well, in all situation or for every dataset. Data Scientists still need to experiment and find out which works best for their specific case. If test data has different classes then some of these method won’t work as feature won’t be similar. There are few benchmark publication by research communities but its not conclusive which works best. My recommendation will be to try each of these with smaller dataset and then decide where to put more focus for tuning the encoding process. You can use below cheat-sheet as a guiding tool.

| No| Encoding   |      Advantage      |  Disadvantage | Regression | Classification |
| :---:|:----------:|:-------------:|------:|:---:|:---:|
| 1| One Hot  |  --- | --- |ok| No|
| 2| Label  |    ---   |   --- |ok| No|
| 3| Ordinal Encoding | --- |    --- |ok| No|
| 4| Helmert Encoding | --- |    --- |ok| No|
| 5| Binary Encoding | --- |    --- |ok| No|
| 6| Frequency Encoding | --- |    --- |ok| No|
| 7| Mean Encoding | --- |    --- |ok| No|
| 8| Weight of Evidence Encoding | --- |    --- |ok| No|
| 9| Probability Ratio Encoding | --- |    --- |ok| No|
| 10| Hashing Encoding | --- |    --- |ok| No|
| 11| Backward Difference Encoding | --- |    --- |ok| No|
| 12| Leave One Out Encoding | --- |    --- |ok| No|
| 13| James-Stein Encoding | --- |    --- |ok| No|
| 14| M-estimator Encoding | --- |    --- |ok| No|

### References

[1] [Baijayanta Roy's Article in Towards Data Science, July 2019](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)