# Week 3 notes: Metrics optimization, advanced feature engineering

In [1]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

<div id="toc"></div>

# Metrics optimization

Metrics: Why are there so many? There are many ways to evaluate the quality of an analysis; we need to decide how to measure the quality. Example: A company will usually decide for itself what aspects of sales data are most important to it. 

If your model is scored with some metric, you should optimize exactly that metric to get best results. 

## Regression metrics review

Notation for this lesson:

- $N$ - number of objects  
- $y \in R^N$ - target values  
- $\hat{y} \in R^N$ - predictions  
- $\hat{y}_i \in R$ - prediction for i-th object  
- $y_i \in R$ - target for i-th object

Most common metric: **MSE Mean Square Error**

MSE = $\frac{1}{N} \sum_{i=1}^{N}(y_i-\hat{y}_i)^2$  

MSE = $\frac{1}{N} \sum_{i=1}^{N}(y_i-\alpha)^2$,   
where $\alpha$ is the best constant target predictor, the target mean.  

Grid search: Vary $\alpha$ and see how MSE varies. Find $\alpha$ with lowest MSE.  

To optimize: Most libraries support it, so all you have to do is turn it on. Just find an example on GitHub to see how it's implemented. 

Another metric: **RMSE Root Mean Square Error**

The square root is introduced to make the scale of the errors to be the same as the scale of the target. Now the error is linear, easier to comprehend. 

RMSE = $\sqrt{\frac{1}{N} \sum_{i=1}^{N}(y_i-\hat{y}_i)^2}$  

Similar to MSE:  
* Minimizers- any minimizer of MSE is a minimizer of RMSE  
* If our target model is RMSE, we can still compare MSE or optimize based on MSE  
* MSE is often easier to work with  

Differences from MSE:  
* Look at the gradient: Gradient of RMSE is the gradient of MSE times a value  
* Means that- traveling along RMSE gradient is same as traveling along MSE gradient but with a different flow rate. The flow rate depends on the MSE score itself. Thus, they are not immediately interchangeable for gradient based methods. Adjust some parameters like the learning rate.  

Good for comparing model performance: **$R^2$**

It's hard to determine if the model is good or not by looking at MSE or RMSE. We want to look at how much better the model is better than the baseline. For this reason, $R^2$ is often used.

$R^2 = 1-\frac{\frac{1}{N} \sum_{i=1}^{N}(y_i-\hat{y}_i)^2}{\frac{1}{N} \sum_{i=1}^{N}(y_i-\bar{y})^2} = 1 - 1-\frac{MSE}{\frac{1}{N} \sum_{i=1}^{N}(y_i-\bar{y})^2}$

To optimize $R^2$, we can optimize MSE. This is because $R^2$ is basically MSE divided by a constant and subtracted from another constant. These constants don't matter for optimization. 

Another metric: **MAE Mean Absolute Error**

The average of absolute differences between target values and the predictions.  
$MAE = \frac{1}{N}\sum_{i=1}^N|y_i-\hat{y}_i|$

It penalizes huge errors less than MSE, so less sensitive outliers. Best constant predictor is target median.

Different applications. MAE is widely used in finance, where a 10 dollar error is twice as bad as a 5 dollar error. (Compared to MSE which would make it four times as bad)

To implement: not available in all models, but you can use similar loss models like Huber Loss, Quantile Loss

**MAE vs. MSE**

* Do you have outliers in the data? Use MAE  
* Are you sure they are outliers? Use MAE  
* Or are they just unexpected values we should still care about? Use MSE  

(outliers result from measurement error, mistakes, etc.)

Metrics with relative error: **(R)MSPE, MAPE, (R)MSLE**

(R)MSPE: Like a weighted version of MSE  
MSPE = $\frac{100\%}{N}\sum_{i=1}^N( \frac{y_i-\hat{y}_i}{y_i}) ^2$  
Best constant prediction is the weighted mean of the target values  

MAPE: Like a weighted version of MAE  
MAPE = $\frac{100\%}{N}\sum_{i=1}^N |\frac{y_i-\hat{y}_i}{y_i}|$  
Best constant prediction is the weighted median of the target values  

(R)MSLE: Root mean square logarithmic error, RMSE calculated on a logarithmic scale.  
RMSLE = $\sqrt{ \frac{1}{N}\sum_{i=1}^N (\log(y_i+1)-\log(\hat{y}_i+1)^2 }$
Take log of target values and predictions and calculate RMSE between them.  
Error curves are asymmetric.  
Best constant prediction is target mean in log space

To implement: Use weights for samples, `sample_weights` and use MSE or MAE. Or, resample the training set with the weights. For RMSLE, transform the target, fit the model with MSE loss, and then reverse transform the predictions back. 

## Classification metrics review

Notation:

$N$- number of objects  
$L$- number of classes  
$y$ - ground truth  
$\hat{y}$ - predictions  
$|a=b|$ - indicator function  
"soft predictions" - classifier's scores  
"hard predictions" - a function of soft predictions, ex: a thresholding function

**Accuracy score**: How frequently our class prediction is correct.
Accuracy = $\frac{1}{N}\sum_{i=1}^N [\alpha = y_i ]$  
The best constant is to predict the most frequent class. 

**Logarithmic loss (logloss)**: Tries to get the classifier to output two posterior probabilities for their objects to be of a certain class.  
(Binary) LogLoss = $\frac{1}{N}\sum_{i=1}^N y_i \log (\hat{y}_i) + (1-y_i)\log (1-\hat{y}_i)$  
(Multiclass) LogLoss = $\frac{-1}{N}\sum_{i=1}^N\sum_{i=1}^L y_{il} \log(\hat{y}_{il})$
The elements $\hat{y}_{il}$ are the probabilities of belonging to each of the classes. $\hat{y}_{il}$ is a vector of length $L$ that sums to 1.   
Logloss penalizes completely wrong answers and prefers to make a lot of small mistakes instead of one severe mistake.

In [9]:
import numpy as np
fx = 0.5
yi = 0.5
yi*np.log10(fx)+(1-yi)*np.log10(1-fx)

-0.3010299956639812

In [10]:
np.log10(0.5)

-0.3010299956639812

To implement: Most libraries have LogLoss method in them, just need to figure out which arguments should be passed to the library. 

**Area under curve (AUC ROC)**  
- only for binary tasks  
- Depends on ordering of the predictions, not on absolute values  
- Performance measured by area under ROC curve  
- Best constant prediction is 0.5  

**Cohen's Kappa motivation**

* For accuracy of 1, it returns 1  
* For baseline prediction, it returns 0  
* This way you can't get misleadingly high scores for just guessing the baseline prediction  

Cohen's Kappa = $1 - \frac{1-\text{accuracy}}{1-p_e}$

* $p_e$- what accuracy would be on average, if we randomly permute our predictions  
* $p_e = \frac{1}{N^2}\sum n_{k1}n_{k2}$

Cohen's Kappa = $1 - \frac{\text{error}}{\text{baseline error}}$

*Weighted Kappa*

For weighted error, multiply confusion matrix $C$ and weight matrix $W$ element-wise. Weighted error = $\frac{1}{const}\sum_{i,j} C_{i,j} W_{i,j} $  
Weighted kappa = $1 - \frac{\text{weighted error}}{\text{weighted baseline error}}$

## General approaches for metrics optimization

* *Target metric* is what we want to optimize  
* *Optimization loss* is what the model optimizes  
* It's easy to define a custom loss function

# Mean Encoding
 
a.k.a "target encoding" or "likelihood encoding"  

General idea: Add new variables based on some feature related to target. The simplest case is to encode each level of categorical variables with the corresponding target mean.  

Frequency encoding: Instead of creating an arbitrary feature label, such as "0,1,2,...", encode each categorical feature with the corresponding mean target. Ex: If the target values for all rows with city "Moscow" are (0,1,1,0,0), you would encode "Moscow" with the feature mean 0.4. It's an easy concept, though there are pitfalls to be wary of.  

Mean encoding helps to separate 0s from 1s, so classes look way more separable. In general, the more complicated and non linear the feature target dependency is, the more effective mean encoding is.  

Datasets that could benefit from mean encoding often have categorical variables with a lot of levels. Regression tasks are more flexible for mean encoding than classification.    

Ways to use target variable:

* Likelihood = (goods)/(goods+bads) = mean(target)  
* Weight of evidence = ln(goods/bads) x 100  
* Count = Goods = sum(target)  
* Diff = Goods - Bads  

These can't be used as-is. We need to deal with overfitting first. Mean encodings require some kind of regularization.

# Regularization

Four kinds of regularization:  

**1. CV (cross-validation) loop inside training data**  
* Separate the data into k-node folds, usually 4-5 enough  
* To get mean encoding values for some subset, we don't use data points from that subset; we estimate the encoding only on the rest of the subset  
* Iteratively walk through all of the data subsets  
    
Take a look at their example:

![CV loop](fig/reg_CV_loop_snippet.png)

- `y_tr` is the target values in a Series  
- `skr` generates the 5 folds. Each is a pair of indices for the `tr` and `val` arrays    
- The loop iterates through the 5 folds. Within each fold, loop through columns  
    - First, generate `X_tr`: used to estimate the encoding, and `X_val` use to apply the encoding  
    - Calculate the means of each column in `X_val`, record in a new column  
    - Fill the `train_new` data frame with the encoded result  
    
What's happening is, you are encoding each fold with the mean of values across the other four folds. 
    

**2. Smoothing**

* If category is large, we can trusted the encoding because we have a lot of samples. But if the category is rare, then we can't.  
* $\alpha$ controls the amount of regularization  
* Only works together with some other regularization method- can combine with CV loop

Formula:  
$\frac{\bar{x}_T n_{\text{ROWS}} + \bar{x}_G*\alpha}{n_{\text{ROWS}} + \alpha}$

$\bar{x}_T$: Target mean  
$\bar{x}_G$: Global mean  
$n_{\text{ROWS}}$: Number of rows  

**3. Adding random noise**

Very finicky! Hard to make it work.

**4. Sorting and calculating expanding mean**

* Fix some sorting order for the data, and use only rows from zero to (n-1) to calculate the encoding for row n  
* Upside: Least amount of leakage, no hyper parameter tuning   
* Downside: Irregular encoding quality  
* Built-in in CatBoost- performs great on datasets with categorical features    

Code snippet:

![CV loop](fig/expanding_mean_snippet.png)


* `cumsum` stores the sum of the target variable up to the given row   
* `cumcnt` stores the cumulative count  