# Categorical Variable Encoding

* **Individuals** are the people or things described by a data set.
* **Variables** are characteristics of the individuals that we measure or observe.
* **Categorical variables** take on values that are labels or categories, and quantitative variables take on numerical values.
* Categorical variables can be divided in two categories. **Nominal** (No particular order) and **Ordinal**(some kind of ordered).

<img src="https://miro.medium.com/max/2560/1*wYbTRM0dgnRzutwZq63xCg.png"
     alt="Categories of categorical variables"
     style="align:middle"
     height="50%" width="50%"/>

There are many ways we can encode these categorical variables as numbers and use them in algorithm:
1. One Hot Encoding
2. Label Encoding
3. Ordinal Encoding
4. Helmert Encoding
5. Binary Encoding
6. Frequency Encoding
7. Mean Encoding
8. Weight of Evidence Encoding
9. Probability Ratio Encoding
10. Hashing Encoding
11. Backward Difference Encoding
12. Leave One Out Encoding
13. James-Stein Encoding
14. M-estimator Encoding

For the purpose of explanation , I will use this data-frame which has two independent variables or features(Temperature and Color) and one label (Target). It also has Rec-No which is sequence number of the record. There are total 10 record in this data-frame:

<img src="https://miro.medium.com/max/3104/1*xYNQbmdziUenEQYXZKXoCg.png"
     alt="Categories of categorical variables"
     style="align:middle"
     height="40%" width="40%"/>

In [1]:
import pandas as pd
import numpy as np

data = {'Temperature': ['Hot','Cold','Very Hot','Warm','Hot','Warm','Warm','Hot','Hot','Cold'],
        'Cold': ['Red','Yellow','Blue','Blue','Red','Yellow','Red','Yellow','Yellow','Yellow'],
        'Target': [1, 1, 1, 0, 1, 0, 1, 0, 1, 1]}

df = pd.DataFrame(data, columns = ['Temperature', 'Cold', 'Target'])
df

Unnamed: 0,Temperature,Cold,Target
0,Hot,Red,1
1,Cold,Yellow,1
2,Very Hot,Blue,1
3,Warm,Blue,0
4,Hot,Red,1
5,Warm,Yellow,0
6,Warm,Red,1
7,Hot,Yellow,0
8,Hot,Yellow,1
9,Cold,Yellow,1


## One Hot Encoding

In this method, we map each category to a vector that contains 1 and 0 denoting the presence or absence of the feature.

In [2]:
# using get_dummies in pandas

ohe = pd.get_dummies(df, prefix = ['Temp'], columns = ['Temperature'])
ohe

Unnamed: 0,Cold,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Red,1,0,1,0,0
1,Yellow,1,1,0,0,0
2,Blue,1,0,0,1,0
3,Blue,0,0,0,0,1
4,Red,1,0,1,0,0
5,Yellow,0,0,0,0,1
6,Red,1,0,0,0,1
7,Yellow,0,0,1,0,0
8,Yellow,1,0,1,0,0
9,Yellow,1,1,0,0,0


In [3]:
# using sklearn

from sklearn.preprocessing import OneHotEncoder

ohc = OneHotEncoder()
ohe = ohc.fit_transform(df.Temperature.values.reshape((-1, 1))).toarray()
dfOneHot = pd.DataFrame(ohe, columns = 
                        ['Temp_' + str(ohc.categories_[0][i]) for i in range(len(ohc.categories_[0]))])
dfh = pd.concat([df, dfOneHot], axis = 1)
dfh

Unnamed: 0,Temperature,Cold,Target,Temp_Cold,Temp_Hot,Temp_Very Hot,Temp_Warm
0,Hot,Red,1,0.0,1.0,0.0,0.0
1,Cold,Yellow,1,1.0,0.0,0.0,0.0
2,Very Hot,Blue,1,0.0,0.0,1.0,0.0
3,Warm,Blue,0,0.0,0.0,0.0,1.0
4,Hot,Red,1,0.0,1.0,0.0,0.0
5,Warm,Yellow,0,0.0,0.0,0.0,1.0
6,Warm,Red,1,0.0,0.0,0.0,1.0
7,Hot,Yellow,0,0.0,1.0,0.0,0.0
8,Hot,Yellow,1,0.0,1.0,0.0,0.0
9,Cold,Yellow,1,1.0,0.0,0.0,0.0


**Note**: 
1. Usually for Regression we use N-1 (drop first or last column of One Hot Coded new feature ) 
2. for classification recommendation is to use all N columns without as most of the tree based algorithm builds tree based on all available

## Label Encoding

1. Each category is assigned a value from 1 through N (here N is the number of category for the feature.
2. Major issue with this approach is there is no relation or order between these classes but algorithm might consider them as some kind of order or there is some kind of relationship. 

E.g. Cold<Hot<Very Hot<Warm ... 0 < 1 < 2 < 3

In [4]:
from sklearn.preprocessing import LabelEncoder

df['Temp_label_encoded'] = LabelEncoder().fit_transform(df.Temperature)
df

Unnamed: 0,Temperature,Cold,Target,Temp_label_encoded
0,Hot,Red,1,1
1,Cold,Yellow,1,0
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,1
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,1
8,Hot,Yellow,1,1
9,Cold,Yellow,1,0


In [5]:
df.drop(['Temp_label_encoded'], axis = 1, inplace = True)

In [6]:
# Pandas factorize also perform the same function.
df.loc[:,'Temp_factorize_encode'] = pd.factorize(df['Temperature'])[0].reshape((-1,1))
df

Unnamed: 0,Temperature,Cold,Target,Temp_factorize_encode
0,Hot,Red,1,0
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,2
3,Warm,Blue,0,3
4,Hot,Red,1,0
5,Warm,Yellow,0,3
6,Warm,Red,1,3
7,Hot,Yellow,0,0
8,Hot,Yellow,1,0
9,Cold,Yellow,1,1


In [7]:
df.drop(['Temp_factorize_encode'], axis = 1, inplace = True)

## Ordinal Encoding

Ordinal encoding is done to ensure encoding of variable retains ordinal nature of the variable.This encoding looks almost similar to Label Encoding but slightly different as Label coding would not consider whether variable is ordinal or not and it will assign sequence of integers.

* as per the order of data (Pandas assigned Hot (0), Cold (1) , “Very Hot” (2) and Warm (3)) or
* as per alphabetical sorted order (scikit-learn assigned Cold(0), Hot(1) , “Very Hot” (2) and Warm (3))

In [8]:
# using pandas and an ordinal dictionary

Temp_dict = {'Cold': 1,
             'Warm': 2,
             'Hot': 3,
             'Very Hot': 4}
df['Temp_Ordinal'] = df.Temperature.map(Temp_dict)
df

Unnamed: 0,Temperature,Cold,Target,Temp_Ordinal
0,Hot,Red,1,3
1,Cold,Yellow,1,1
2,Very Hot,Blue,1,4
3,Warm,Blue,0,2
4,Hot,Red,1,3
5,Warm,Yellow,0,2
6,Warm,Red,1,2
7,Hot,Yellow,0,3
8,Hot,Yellow,1,3
9,Cold,Yellow,1,1


In [9]:
df.drop(['Temp_Ordinal'], axis = 1, inplace = True)

## Helmert Encoding

Mean of the dependent variable for a level is compared to the mean of the dependent variable over all previous level. Helmert coding and its inverse, difference coding, really only make sense when the variable is ordinal.

In [10]:
import category_encoders as ce

encoder = ce.HelmertEncoder(cols = ['Temperature'], drop_invariant = True)
dfh = encoder.fit_transform(df['Temperature'])
dfhe = pd.concat([df, dfh], axis = 1)
dfhe

Unnamed: 0,Temperature,Cold,Target,Temperature_0,Temperature_1,Temperature_2
0,Hot,Red,1,-1.0,-1.0,-1.0
1,Cold,Yellow,1,1.0,-1.0,-1.0
2,Very Hot,Blue,1,0.0,2.0,-1.0
3,Warm,Blue,0,0.0,0.0,3.0
4,Hot,Red,1,-1.0,-1.0,-1.0
5,Warm,Yellow,0,0.0,0.0,3.0
6,Warm,Red,1,0.0,0.0,3.0
7,Hot,Yellow,0,-1.0,-1.0,-1.0
8,Hot,Yellow,1,-1.0,-1.0,-1.0
9,Cold,Yellow,1,1.0,-1.0,-1.0


To-Do:

* [ ] Rest of encoding methods
* [ ] Read: https://stats.idre.ucla.edu/spss/faq/coding-systems-for-categorical-variables-in-regression-analysis/
* [ ] Learn Linear Regression
* [ ] Re-read: https://stats.stackexchange.com/questions/411134/how-to-calculate-helmert-coding

## References

1. [Towards Data Science: Categorical Variable Encoding](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)
2. [Khan Academy: Analyzing Categorical Variables](https://www.khanacademy.org/math/statistics-probability/analyzing-categorical-data)
3. [Stack Exchange: How to calculate Helmert Coding](https://stats.stackexchange.com/questions/411134/how-to-calculate-helmert-coding)