# Encoding of categorical variables

## Sources

* [Encoding categorical variables](https://kiwidamien.github.io/encoding-categorical-variables.html)
* [Benchmarking Categorical Encoders](https://towardsdatascience.com/benchmarking-categorical-encoders-9c322bd77ee8)
* [Smarter Ways to Encode Categorical Data for Machine Learning](https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159)
* [Encoding Categorical Variables in Practice](https://medium.com/epfl-extension-school/encoding-categorical-variables-in-practice-a536907f2013)
* [Types of Categorical Data Encoding Schemes](https://medium.com/analytics-vidhya/types-of-categorical-data-encoding-schemes-a5bbeb4ba02b)
* [Guide to Encoding Categorical Values in Python ](https://pbpython.com/categorical-encoding.html)
* [All about Categorical Variable Encoding](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)
* [Encoding categorical variables: one-hot and beyond](http://www.win-vector.com/blog/2017/04/encoding-categorical-variables-one-hot-and-beyond/)

## Code

In [1]:
import requests
import category_encoders as ce
import numpy as np
import pandas as pd

In [2]:
download_data = False

In [3]:
if download_data:
    url = "http://mlr.cs.umass.edu/ml/machine-learning-databases/autos/imports-85.data"
    r = requests.get(url)
    with open('imports-85.data', 'wb') as f:
        f.write(r.content)

In [4]:
# Define the headers since the data does not have any
headers = ["symboling", "normalized_losses", "make", "fuel_type", "aspiration",
           "num_doors", "body_style", "drive_wheels", "engine_location",
           "wheel_base", "length", "width", "height", "curb_weight",
           "engine_type", "num_cylinders", "engine_size", "fuel_system",
           "bore", "stroke", "compression_ratio", "horsepower", "peak_rpm",
           "city_mpg", "highway_mpg", "price"]
# Read in the CSV file and convert "?" to NaN
df = pd.read_csv("imports-85.data",
                  header=None, names=headers, na_values="?" )

In [5]:
df.head()

Unnamed: 0,symboling,normalized_losses,make,fuel_type,aspiration,num_doors,body_style,drive_wheels,engine_location,wheel_base,...,engine_size,fuel_system,bore,stroke,compression_ratio,horsepower,peak_rpm,city_mpg,highway_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,13495.0
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111.0,5000.0,21,27,16500.0
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154.0,5000.0,19,26,16500.0
3,2,164.0,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102.0,5500.0,24,30,13950.0
4,2,164.0,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115.0,5500.0,18,22,17450.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
symboling            205 non-null int64
normalized_losses    164 non-null float64
make                 205 non-null object
fuel_type            205 non-null object
aspiration           205 non-null object
num_doors            203 non-null object
body_style           205 non-null object
drive_wheels         205 non-null object
engine_location      205 non-null object
wheel_base           205 non-null float64
length               205 non-null float64
width                205 non-null float64
height               205 non-null float64
curb_weight          205 non-null int64
engine_type          205 non-null object
num_cylinders        205 non-null object
engine_size          205 non-null int64
fuel_system          205 non-null object
bore                 201 non-null float64
stroke               201 non-null float64
compression_ratio    205 non-null float64
horsepower           203 non-

# Unsupervised Encoding Methods

In [7]:
def encode_var(var, encoder, y=None):
    if y is None:
        encoder.fit(var)
    else:
        encoder.fit(var, y)
    new_var = encoder.transform(var)
    if isinstance(new_var, pd.DataFrame):
        new_var.insert(0, 'original', var)
        return new_var
    else:
        return pd.DataFrame({'original': var, 'encoder': new_var})

In [8]:
def print_res(res, rows_per_level=2):
    out = pd.DataFrame(columns=res.columns)
    for lvl in res.original.value_counts().index.values:
        out = out.append(res[res.original==lvl].head(rows_per_level))
    return out

## Label Encoding

**Description**

Label encoding replaces the *n* labels with values from *0* to *n-1*, in lexicographical order.

**When to use**

Never (only for ordinal variables, for which there is the `OrdinalEncoder` class.

In [13]:
df["body_style"].value_counts()

sedan          96
hatchback      70
wagon          25
hardtop         8
convertible     6
Name: body_style, dtype: int64

In [15]:
from sklearn.preprocessing import LabelEncoder

In [18]:
encoder = LabelEncoder()
res = encode_var(df["body_style"], encoder)

In [19]:
res.head()

Unnamed: 0,original,encoder
0,convertible,0
1,convertible,0
2,hatchback,2
3,sedan,3
4,sedan,3


## Ordinal Encoding

**Description**

Same as label encoding, but values are ordered in the order of labels.

**When to use**

For ordinal variables, where the difference between all labels represents the same distance.

In [52]:
df["num_cylinders"].value_counts(sort=False)

two         4
three       1
four      159
five       11
six        24
eight       5
twelve      1
Name: num_cylinders, dtype: int64

In [51]:
df["num_cylinders"] = df["num_cylinders"].astype('category').cat.reorder_categories(ordered=True, new_categories=['two', 'three', 'four', 'five', 'six', 'eight', 'twelve'])

In [None]:
df["num_cylinders"].value_counts(sort=False)

In [57]:
encoder = ce.ordinal.OrdinalEncoder()

In [64]:
res = encode_var(df[["num_cylinders"]], encoder)
res.head()

Unnamed: 0,original,num_cylinders
0,four,3
1,four,3
2,six,5
3,four,3
4,five,4


## Dummy/One Hot Encoding

**Description**

Creates a binary indicator for each level.

**When to use**

When the main interest is in differences in average values for each level and there are sufficient observations for each level.

In [36]:
df['drive_wheels'].value_counts()

fwd    120
rwd     76
4wd      9
Name: drive_wheels, dtype: int64

In [66]:
encoder = ce.one_hot.OneHotEncoder()
res = encode_var(df[["drive_wheels"]], encoder)
print_res(res)

Unnamed: 0,original,drive_wheels_1,drive_wheels_2,drive_wheels_3
3,fwd,0,1,0
5,fwd,0,1,0
0,rwd,1,0,0
1,rwd,1,0,0
4,4wd,0,0,1
9,4wd,0,0,1


## Simple Encoder

**Description**

Compares each level to the reference level, with the intercept as the grand mean.

**Construction**

For each level, other than the reference level, the coding has (n-1)/n for the level and -1/n for each other level.

**When to use**

Same as dummy encoding, but when the interest is in deviations from the grand mean rather than deviations from the reference level.

### Not available in Python

In [17]:
from simple_coding import SimpleEncoder

In [19]:
encoder = SimpleEncoder()
res = encode_var(df[["drive_wheels"]], encoder)
print_res(res)

Unnamed: 0,original,intercept,drive_wheels_0,drive_wheels_1
3,fwd,1,0.666667,-0.333333
5,fwd,1,0.666667,-0.333333
0,rwd,1,-0.333333,-0.333333
1,rwd,1,-0.333333,-0.333333
4,4wd,1,-0.333333,0.666667
9,4wd,1,-0.333333,0.666667


## Binary Encoding

**Description**

Similar to One-Hot Encoding, but values of categories are stored as binary bitstrings. Each binary digit create one encoding column, i.e. if there are $n$ levels then there are $\log_2n$ features.

**Construction**

Each level is associated with its order, which is then translated into bitstrings. The encoding variables are the different valeus in the bitstring.

**When to use**

For categorical variables with many levels.

In [16]:
encoder = ce.binary.BinaryEncoder()
res = encode_var(df[["drive_wheels"]], encoder)
print_res(res)

Unnamed: 0,original,drive_wheels_0,drive_wheels_1,drive_wheels_2
3,fwd,0,1,0
5,fwd,0,1,0
0,rwd,0,0,1
1,rwd,0,0,1
4,4wd,0,1,1
9,4wd,0,1,1


## BaseN Encoding

**Description**

Base-N encoder encodes the categories into arrays of their base-N representation. A base of $1$ is equivalent to one-hot encoding (not really base-1, but useful), a base of $2$ is equivalent to binary encoding. A base of $n$ is equivalent to vanilla ordinal encoding.

**Construction**

Each level is associated with its order, which is then translated into bitstrings. The encoding variables are the different valeus in the bitstring.

**When to use**

For categorical variables with many levels.

In [22]:
encoder = ce.basen.BaseNEncoder(base=4)
res = encode_var(df[["make"]], encoder)
print_res(res)

Unnamed: 0,original,make_0,make_1,make_2,make_3
150,toyota,0,1,1,0
151,toyota,0,1,1,0
89,nissan,0,0,3,1
90,nissan,0,0,3,1
50,mazda,0,0,2,1
51,mazda,0,0,2,1
76,mitsubishi,0,0,3,0
77,mitsubishi,0,0,3,0
30,honda,0,0,1,2
31,honda,0,0,1,2


## Sum encoding / Deviation Encoding / Effect Encoding

**Description**

Sum encoding compares each group effect to the grand mean, i.e. the mean of group means (which is not the overall mean). The encoding representation is the difference of indicator variables minus the indicator for one baseline group. Sum encoding is similar to One-Hot encoding, but the interpretation of effects is different (effect relative to the grand mean vs. effect relative to a baseline group). 

**When to use**

When the main interest is in differences in average values for each level compared to the grand mean and there are sufficient observations for each level.

In [9]:
encoder = ce.sum_coding.SumEncoder()
res = encode_var(df[["drive_wheels"]], encoder)
print_res(res)

Unnamed: 0,original,intercept,drive_wheels_0,drive_wheels_1
3,fwd,1,0.0,1.0
5,fwd,1,0.0,1.0
0,rwd,1,1.0,0.0
1,rwd,1,1.0,0.0
4,4wd,1,-1.0,-1.0
9,4wd,1,-1.0,-1.0


## (Orthogonal) Polynomial Encoding

**Description**
Orthogonal polynomial coding is a form of trend analysis in that it is looking for the linear, quadratic and cubic trends in the categorical variable.

**Construction**


**When to use**
This type of coding system should be used only with an ordinal variable in which the levels are equally spaced.

In [24]:
encoder = ce.polynomial.PolynomialEncoder()
res = encode_var(df[["num_cylinders"]], encoder)
print_res(res)

Unnamed: 0,original,intercept,num_cylinders_0,num_cylinders_1,num_cylinders_2,num_cylinders_3,num_cylinders_4,num_cylinders_5
0,four,1,-0.5669467,0.5455447,-0.4082483,0.241747,-0.1091089,0.032898
1,four,1,-0.5669467,0.5455447,-0.4082483,0.241747,-0.1091089,0.032898
2,six,1,-0.3779645,9.521795000000001e-17,0.4082483,-0.564076,0.4364358,-0.197386
12,six,1,-0.3779645,9.521795000000001e-17,0.4082483,-0.564076,0.4364358,-0.197386
4,five,1,-0.1889822,-0.3273268,0.4082483,0.080582,-0.5455447,0.493464
5,five,1,-0.1889822,-0.3273268,0.4082483,0.080582,-0.5455447,0.493464
71,eight,1,0.5669467,0.5455447,0.4082483,0.241747,0.1091089,0.032898
72,eight,1,0.5669467,0.5455447,0.4082483,0.241747,0.1091089,0.032898
55,two,1,0.3779645,1.1951220000000001e-17,-0.4082483,-0.564076,-0.4364358,-0.197386
56,two,1,0.3779645,1.1951220000000001e-17,-0.4082483,-0.564076,-0.4364358,-0.197386


## Helmert Encoding

**Description**

Helmert coding compares each level of a categorical variable to the mean of the subsequent levels. The first contrast compares the mean of the dependent variable for the second level with the mean of the dependent variable for the first level. The second contrast compares the mean for the third level with the mean for the first two levels, etc.

**Construction**

The representation for level k has -1 for each level before k and then k as value for the current level, starting at *k=2* up to *k=n*.

**When to use**

When levels of a categorical variable are ordered from lowest to highest or from smallest to largest. 

In [12]:
encoder = ce.helmert.HelmertEncoder()
res = encode_var(df[["drive_wheels"]], encoder)
print_res(res)

Unnamed: 0,original,intercept,drive_wheels_0,drive_wheels_1
3,fwd,1,1.0,-1.0
5,fwd,1,1.0,-1.0
0,rwd,1,-1.0,-1.0
1,rwd,1,-1.0,-1.0
4,4wd,1,0.0,2.0
9,4wd,1,0.0,2.0


## Backward Difference Encoding

**Description**

Similar to Helmert encoding, but differences are taken with respect to the prior adjacent level (not all prior levels).

**Construction**

The representation for level k has -(n-k)/n for each level up to k and k/n for each subsequent level.


**When to use**

When levels of a categorical variable are ordered from lowest to highest or from smallest to largest, but the interest is in step-wise differences.

In [13]:
encoder = ce.backward_difference.BackwardDifferenceEncoder()
res = encode_var(df[["drive_wheels"]], encoder)
print_res(res)

Unnamed: 0,original,intercept,drive_wheels_0,drive_wheels_1
3,fwd,1,0.333333,-0.333333
5,fwd,1,0.333333,-0.333333
0,rwd,1,-0.666667,-0.333333
1,rwd,1,-0.666667,-0.333333
4,4wd,1,0.333333,0.666667
9,4wd,1,0.333333,0.666667


In [15]:
df['drive_wheels'].value_counts()

fwd    120
rwd     76
4wd      9
Name: drive_wheels, dtype: int64

## Frequency / Count Encoding

**Description**

Frequency encoding replaces the levels of a categorical variable with their absolute or relative frequency.

**Construction**

$$x_k = \frac{n_k}{n}$$

**When to use**

When common or uncommon levels have similar influences.

In [27]:
encoder = ce.count.CountEncoder(normalize=True)
res = encode_var(df[["drive_wheels"]], encoder)
print_res(res)

Unnamed: 0,original,drive_wheels
3,fwd,0.585366
5,fwd,0.585366
0,rwd,0.370732
1,rwd,0.370732
4,4wd,0.043902
9,4wd,0.043902


## Hashing

**Description**

Hashing converts categorical variables to a higher dimensional space of integers, where the distance between two vectors of categorical variables in approximately maintained the transformed numerical dimensional space. 

**Construction**

Any hash funtion from the `hashlib` package.

**When to use**

This method is advantageous when the cardinality of categorical is very high.

In [None]:
encoder = ce.backward_difference.BackwardDifferenceEncoder()
res = encode_var(df[["drive_wheels"]], encoder)
print_res(res)

# Supervised Encoding Methods / Bayesian Encoders

## Target / Mean Encoding

**Description**

Target encoding replaces each category with the average value of all ovservations for that category, mixed with a prior.

For the case of *categorical target*: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.

For the case of *continuous target*: features are replaced with a blend of the expected value of the target given particular categorical value and the expected value of the target over all the training data.

**Construction**

The encoded values $x^k$ for category k are calculated as follows:
$$x^k = x_p (1-s) + s \frac{n_k}{n}$$
where $x_p$ is the prior calculated as the average value of the target, $n_k$ is the number of observations of level $k$, $n$ is the number of observations and $s$ is a smoothing parameter calculated as:
$$s=\frac{1}{1+\exp{-\frac{n-mdl}{a}}}$$
with 'mdl' standing for 'min data in leaf' and $a$ is a regularization parameter.

**When to use**

This encoding method brings out the relation between similar categories, but the connections are bounded within the categories and target itself. The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning. Target encoding is powerful in prediction tasks, but runs the risk of target leakage. The leakage can be controlled via regularization, data augmentation by adding noise to the encoding representation, or through double validation.

In [74]:
encoder = ce.target_encoder.TargetEncoder()
res = encode_var(df[["drive_wheels"]], encoder, df['price'])
print_res(res)

Unnamed: 0,original,drive_wheels
3,fwd,9244.779661
5,fwd,9244.779661
0,rwd,19757.613333
1,rwd,19757.613333
4,4wd,10243.702296
9,4wd,10243.702296


Unnamed: 0,original,drive_wheels
3,fwd,9244.779661
5,fwd,9244.779661
0,rwd,19757.613333
1,rwd,19757.613333
4,4wd,10241.0
9,4wd,10241.0


## James-Stein Encoding

**Description**

Target encoding based on the James-Stein mean estimator, rather than the sample mean. The idea is to improve the estimation of the category's mean target by shrinking them towards the central average.

**Construction**

$$x_k = (1-B)\frac{n_k}{n} + B\bar{y}_k$$
where $B$ is a shrinkage paramter. A common value is
$$B=\frac{Var\left[y_k\right]}{Var\left[y_k\right]+Var\left[y\right]}$$
but the value could also be set via cross-valdiation. The intuition behind the equation is that if the mean estiamte of a category is uncertain (high variance), then stronger shrinkage should be applied.

**When to use**

Same as for target encoder.

In [83]:
encoder = ce.james_stein.JamesSteinEncoder()
res = encode_var(df[["drive_wheels"]], encoder, df['price'])
print_res(res)

Unnamed: 0,original,drive_wheels
3,fwd,9244.779661
5,fwd,9244.779661
0,rwd,19757.613333
1,rwd,19757.613333
4,4wd,10241.0
9,4wd,10241.0


## M-estimator Encoding

**Description**
 
A simplified version of the target encoder as it has only one hyerparameter.

**Construction**

$$x_k = \frac{n_k + m \times prior}{y^+ + m}$$
where $m$ is a smoothing parameter and $y^+$ is ??? 

**When to use**

Same as for target encoder.

In [77]:
encoder = ce.m_estimate.MEstimateEncoder()
res = encode_var(df[["drive_wheels"]], encoder, df['price'])
print_res(res)

Unnamed: 0,original,drive_wheels
3,fwd,9278.076717
5,fwd,9278.076717
0,rwd,19671.422755
1,rwd,19671.422755
4,4wd,10570.569928
9,4wd,10570.569928


## Leave One Out Encoding

**Description**

The encoding is calculated for each observation $j$ by calculating the average value of the target of all observations with the same target value as observation $j$ except $j$.

**Construction**

$$ x_k^{(j)} = \frac{a}{b}\frac{\sum_{i\neq j}(y_i(x_i==k)-y_j}{\sum_{i\neq j}x_i==k}$$

**When to use**

Same as for target encoder.

In [86]:
encoder = ce.leave_one_out.LeaveOneOutEncoder()
res = encode_var(df[["drive_wheels"]], encoder, df['price'])
print_res(res)

Unnamed: 0,original,drive_wheels
3,fwd,9244.779661
5,fwd,9244.779661
0,rwd,19757.613333
1,rwd,19757.613333
4,4wd,10241.0
9,4wd,10241.0


## Catboost Encoding

**Decription**

Catboost is an improvement over the leave-one-out encoder. It is intended to overcome the target leackage problems.

**Construction**

**When to use**

In [87]:
encoder = ce.cat_boost.CatBoostEncoder()
res = encode_var(df[["drive_wheels"]], encoder, df['price'])
print_res(res)

Unnamed: 0,original,drive_wheels
3,fwd,9278.076717
5,fwd,9278.076717
0,rwd,19671.422755
1,rwd,19671.422755
4,4wd,10570.569928
9,4wd,10570.569928


## Weight of Evidence Encoder

**Description**

The encoding is a measure of the *strength* that seperates effect from no effect. The encoding can only be used for binary targets.

**Construction**

The encoding is calculated from a modified odds ratio, which is intended to prevent target leakage:
$$ \begin{align}
numerator =& \ \frac{n_k+a}{y^+ + 2a} \\
denominator =& \ \frac{n-n_k+a}{y-y^+ 2a} \\
x_k =& \ \log\left(\frac{numerator}{denominator}\right)
\end{align}
$$

**When to use**

For binary targets

In [23]:
encoder = ce.woe.WOEEncoder()
res = encode_var(df[["drive_wheels"]], encoder, df['fuel_type']=='gas')
print_res(res)

Unnamed: 0,original,drive_wheels
3,fwd,0.275848
5,fwd,0.275848
0,rwd,-0.435318
1,rwd,-0.435318
4,4wd,0.162519
9,4wd,0.162519


## Probability Ratio Encoding

**Description**

Probability Ratio Encoding is similar to Weight Of Evidence encoding, but the encoding is based on the ratio of probabilities of the positive to the negative class.

**Construction**

$$x_k = \frac{y_k^+/y_k}{y_k^-/y_k}=\frac{y_k^+}{y_k^-}$$
This is the same as WoE encoding with no regularization, i.e. $a=0$.

**When to use**

Same as weight of evidence encoding.

In [25]:
encoder = ce.woe.WOEEncoder(regularization=0.)
res = encode_var(df[["drive_wheels"]], encoder, df['fuel_type']=='gas')
print_res(res)

Unnamed: 0,original,drive_wheels
3,fwd,0.287682
5,fwd,0.287682
0,rwd,-0.448132
1,rwd,-0.448132
4,4wd,inf
9,4wd,inf


# Other encodings

* DRACuLa
* Reverse Helmert
* Forward Differences
* Thermometer Encoder

# Cheat Sheet

![Categorical Encoding Methods Cheat-Sheet](./images/decision_encoding.png)