## Introduction

In classification tasks Propensity models return **probability**, not a class label.

Predicting probability is a more complex task as the model works on a continuous scale. It might not be enough to focus on decision boundaries - you have to make the output precise everywhere along the scale. 

- Well-calibrarted models estimate class probabilities accurately
- Poorly-calibrated models often miss. They could <u>overestimate</u> or <u>underestimate</u> probabilities

**Model calibration** = the process of correcting probability estimates to make them more precise.

When probabilty calibration is NOT necessary

- When you don't do scoring and return class labels immediately.
- When you need probabilities only to rank the output

When it is

- When you care about accurate absolute values
    
For example, there could be some kind of probability threshold - you might react only to high probability cases.

#### How to test calibration quality

By applying trained model to validation dataset and comparing predicted and real outputs. If model is well calibrated they must correlate.

First predicted probabilities are aggregated into buckets. Then for each bucket we computey average real probability.

**Reliability (calibration) plots** show how exactly error depends on probability value
- X = probability bucket
- Y = precision (positive class rate in the corresponding bucket) 

If probability estimates are perfect, positive class rate should exactly match dotted diagonal line.

<img src="img/calibration_curve.png" width=500>









## Calibration metrics

Let's denote

- acc = precision in bucket $B_m$
- conf = average probability in bucket $B_m$

Then calibration error = |acc - conf|

----------

**Expected Calibration Error** is just an average calibration error weighted by number of examples in each bucket

$$ ECE = \sum_{m=1}^{10} \frac{|B_m|}{10} |acc(B_m) - conf(B_n)| $$

It's a sampled variant. Populational variant would look like this:

$$ECE = E_p[ P(Y=y|P=p)] - p $$

----------

**Maximum calibration error** is a maximum calibration error among all bickets

$$ MCE = {max}_{m=1..10} \big( |acc(B_m) - conf(B_n)| \big) $$

----------

**Brier Score** is a mean squared error of prediction

Let's denote

- $p_i$ = predicted probability of target class (interval [0.0,1.0])
- $o_i$ = observed class label (0 for "negative" class and 1 for "target" class)

$$BS = \frac{1}{N}\sum_{i=1}^N (p_i - o_i)^2$$

Note that Brier score is a simpler alternative for the standard negative log-likelihood / logloss.

### How different models behave



Poorly calibrated models usually behave like sigmoids - they tend to 
- overestimate LOW probabilities 
- underestimate HIGH probabilities

Though that's not the rule

#### Research

In 2005 there was a [research](https://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf) released dedicated to how well different models are calibrated. They tested a number of SOTA models on different datasets.

<img src="img/calibration_models.png" width=700>

#### Poorly-calibrated examples

**Boosted Trees** demonstrate obvious sigmoid.
Suppose we have some observation that we are confident about. Since in boosting we deal with ensemble of models, ALL classifiers should return the highest score. Which is not achievable in practice.

**SVM** is a margin maximization method => it focuses on HARD examples, the ones that have probabilies around 0.5. It does nor pay too much attention to easier examples.

#### Well-calibrated examples

Becasuse of their nature **Logitic Regression** models usually are perferctly calibrated.

Older and simpler **neutral networks** were also perfectly calibrated. But with all of the advancements of Deep Learning they are not calibrated anymore.

## How to calibrate

### Platt Scaling

[paper](https://www.researchgate.net/publication/2594015_Probabilistic_Outputs_for_Support_Vector_Machines_and_Comparisons_to_Regularized_Likelihood_Methods), 1999, Microsoft

__Idea:__ train and apply auxiliairy output correcting model (specifically 1-d logistic regression) which will map outputs to [0,1] and make any output more logistic-like

$$f(x) \rightarrow \frac {1}{1+e^{a f(x)+b}}$$

The process is the same as applying logistic function to a logit (linear output of a model) in logistic regression. 

$$t=\beta _{0}+\beta _{1}x \rightarrow \frac {1}{1+e^{-f(x)}}$$

The only difference is there are also scaling parameters $a$ and $b$.

Sigmoid gives perfect calibration curve => it will correct any flaws in model output

Motivation was poor performance of SVM outputs (models were popular at the time).

```
gnb = GaussianNB()
```

### Isotonic regression

Isotonic regression is a piecewise constant regression that is used for modeling monotonically increasing data.

<img src="img/isotonic.png" width=500>

It is the same approach as in Platt, but isotonic regression does not assume sigmoid dependency => it is a more general aproach. The drawback - it is more succeptible to overfitting on small datasets.

You define the number of intervals of the model and them optimize the loss function
$$  $$

### Temperature Scaling

[paper](https://arxiv.org/pdf/1706.04599.pdf), 2017

Similar approach, but here it can be embedded right in the neural network.

T (temperature) is a single parameter of the Net that linearly modifies the logit output **z** of the network (right before applying softmax).

$$softmax = \frac{e^{\frac{z}{T}}}{\sum_i e^{\frac{z_i}{T}}}$$

<img src="img/temperature_scaling.png" width=500>


It acts as a modifier (smoother or sharpener) for the softmax function. Perfectly smoothened softmax represents constant. Perfectly sharpened softmax is just a regular argmax (indicator function). Somewhere in between there is a normalizer that approximates output probabilities in a best way.  

The optimal T value is achieved by stadard optimization process of the log-loss function during network training.

### Probability Calibration Trees

2017

