## Handling missing values

Replacing using kNN (k nearest neighbors)
- use similiar observations
- predict missing values using known ones

Pros: May be pretty accurate  
Cons: Computational expensive

__example__

>Suppose that you have an ordinal feature which takes the following values: NaN, 0, 1, 2, 3, 4. Suppose that you want to fill missing values, using a special category. Which value should you use if you don’t want to change feature values distribution?

__feature values distribution will change anyway__

## Basic encoding methods

### Label encoding

- Good for ordinal features.
- Bad for nominal features. (Introduced order)

### One-hot encoding

- Bad for many nominal features with many categories. (result in too many columns)

### Frequency encoding

- Bad if frequecies are similar
- Helps to merge rare categories into one

## Target encoding

What is the main idea of target encoding? Encode each category with the mean target value of the observations with this category

Feature $x_j$ takes values $U_j=\{u_1, \dots, u_m\}$

__For each category, estimate:__

$u_p \to s_p \cong \mathbb{E}(y | x=u_p)$ (regression)

$u_p \to s_p \cong \mathbb{P}(y | x=u_p)$ (classification)

__For each category, compute count:__

$count(j, u_p) = \sum^l_{i=1}[x_{ij}=u_p]$

j index of attributes, $x_{ij}$ means that this is a value for observation $i$ and for attribute $j$ and this observation is a category $u_p$

__For each category, compute target sum:__

- regression:

$target(j, u_p) = \sum^{l}_{i=1}[x_{ij}=u_p]y_i$

- classification:

$target_k(j, u_p) = \sum^{l}_{i=1}[x_{ij}=u_p][y_i=k]$

__Encode__

- regression:

$\widetilde{x_{ij}} = \frac{target(j, x_{ij})}{count(j,x_{ij})}$

- classification:

$\widetilde{x_{ij}} = (\frac{target_1(j, x_{ij})}{count(j,x_{ij})}, \dots, \frac{target_K(j, x_{ij})}{count(j,x_{ij})} )$

- Store information about target variable. So it may improve your model performance significantly. 
- __Leads to overfitting (target leakage)__

(basically, we calculated features from target)

__example__

>Suppose that you are solving a regression task and you want to encode one categorical feature. Five observations in the data correspond to the category A and have the following target values: 2, 5, 6, 10, 12. What will be the number to encode the category A if you perform target encoding?

sum / count = 35/5 = 7

> Suppose that you have an ID feature, where each category appears only one time. Is it a good idea to use target encoding to encode it?

No - it will lead to target leakage. Due to the structure of this feature, you will simply get a target column as a new feature.

## Target encoding: modifications

__How to prevent overfitting?__

__Solution 1: add noise__

$\widetilde{x_{ij}} = \widetilde{x_{ij}} + \varepsilon$

- "Hides" the target signal by the encoding degradation
- Need to tune the amount of noise

__Solution 2: smoothing__

Use weighted sum of target encoding and target average values

$\widehat{x_{ij}} = \lambda(n) * \widetilde{x_{ij}} + (1 - \lambda(n)) * y_{mean}$

- $n = count(j, x_{ij})$
- $\lambda(n)$ - monotonically increasing function, bounded between 0 and 1
- e.g. $\lambda(n) = \frac{1}{1+exp(-\frac{n-k}{f})}$

$n \to \infty, \lambda(n) = 1: x_{ij} = \widetilde{x_{ij}}$

$n \to 0, \lambda(n) = 0: x_{ij} = y_{mean}$

If the category is frequent, in the categorical attributes, then we use regular target encoding without any other features.  
However, if the category is rare, that is the count, the frequency of this category is going to zero. this Lambda is going to zero. We have only an average of targets for the whole data sets.  
Because, for rare categories we suffer most from overfitting.

__Solution 2: cross validation__

Compute $\widetilde{x_{ij}}$ based on the target of other fold. We obtain k different encoders for k folds.

For each encoded parts of test data, you compute model output like probabilities and classification and then by using the average of these outputs, and then you obtain final prediction.

## Advanced encoding methods: M-estimate, Leave-One-Out

__M-estimate encoding:__

Use a specific function for smoothing target encoding

$\lambda(n) = \frac{n}{m+n}$

$\widehat{x_{ij}} = \frac{target(j, x_{ij}) + m * y_{mean}}{count(j, x_{ij}) + m}$

__Leave-one-out encoding:__

Here we compute the frequency of the category, not on the whole dataset, but excluding this one observation and the target is the same.

How do we represent this category if it appears in some tests observation? We average all encoding values for this single category to compute its representation for a test observation.

$count(i, j, u_p) = \sum^l_{k=1}[x_{kj} = u_p, k\ne i]$

$target(i, j, u_p) = \sum^l_{k=1}[x_{kj} = u_p, k\ne i]y_k$

$\widetilde{x_{ij}} = \frac{target(i,j,x_{ij})}{count(i,j,x_{ij})}$

$\widetilde{x_{ij}^{test}} = \frac{\sum^l_{k=1}[x_{ij} = x_{ij}^{test}] \widetilde{x_{ij}}}{count(i, x_{ij})}$

- Helps to deal with outliers
- Induces shift for rare categories
- __Actually it does not prevent target leakage__

__Leave-one-out encoding: overfitting__

$target(j, u_p) = \sum^l_{i=1}[x_{ij} = u_p]y_i = target(i,j,u_p) + y_i$

Specify $t=\frac{target(j, x_{ij}) - 0.5}{count(i,j,x_{ij})}$ (threshold)

- $y_k \ge 0.5 \iff \widetilde{x_{ij}} \le t$
- $y_k < 0.5 \iff \widetilde{x_{ij}} > t$

## Advanced encoding methods: Catboost, WoE

__Catboost encoding:__

$count(i, j, u_p) = \sum^l_{k=1}[x_{kj} = u_p, k \le i]$

$target(i, j, u_p) = \sum^l_{k=1}[x_{kj} = u_p, k \le i]y_k$

$\widetilde{x_{ij}} = \frac{target(i,j,x_{ij})}{count(i,j,x_{ij})}$

$\widetilde{x_{ij}^{test}} = \frac{\sum^l_{k=1}[x_{ij} = x_{ij}^{test}] \widetilde{x_{ij}}}{count(i, x_{ij})}$

Requires shuffling data several times (probably take average of all calculated categories, prevent some target leakage)

__WoE encoding__

Weight of Evidence encoding: (for binary classification)
- Binary classification: bad(0) vs good(1)
- $\mathbb{P}(x_{ij} | y=1) = \frac{count(y=1 | x_{ij})}{count(y=1)}$
- $\mathbb{P}(x_{ij} | y=0) = \frac{count(y=0 | x_{ij})}{count(y=0)}$
- $\widetilde{x_{ij}} = \ln(\frac{\mathbb{P}(x_{ij} | y=1)}{\mathbb{P}(x_{ij} | y=0)}) * 100$

The idea is:

- $\widetilde{x_{ij}} = 0 \iff \mathbb{P}(x_{ij} | y=1) = \mathbb{P}(x_{ij} | y=0)$: random (the category doesn't have much value)
- $\widetilde{x_{ij}} > 0 \iff \mathbb{P}(x_{ij} | y=1) > \mathbb{P}(x_{ij} | y=0)$: goods
- $\widetilde{x_{ij}} < 0 \iff \mathbb{P}(x_{ij} | y=1) < \mathbb{P}(x_{ij} | y=0)$: bads

Smoothing：

$\mathbb{P}(x_{ij} | y=1) = \frac{count(y=1 | x_{ij}) + \alpha}{count(y=1) + 2\alpha}$

$\mathbb{P}(x_{ij} | y=0) = \frac{count(y=0 | x_{ij}) + \alpha}{count(y=0) + 2\alpha}$