# Boosting Algorithms

Boosting is process of converting a set of weak classifiers to string classifier.

The problem was first posed in PAC (probably approximately correct) learning.

Shapire (AT&T) in 1989 came up with the first boosting algorithm. Later .



### Forward Stagewise Additive Modeling

Boosting process is based on the [additive modeling](https://en.wikipedia.org/wiki/Additive_model#:~:text=In%20statistics%2C%20an%20additive%20model,class%20of%20nonparametric%20regression%20models) concept, described by Friedman in 1981.

On each iteration a new (weak) estimator $b(x)$ is fitted to minimize the loss function for the whole function F(x).
We optimize both for parameters of the classifier $b_m(x)$ and it's relative weight in the ensemble $\beta_m$.

Loss function is arbitrary so we don't define the optimization process yet.

<img src="img/additive_model.png" width=500>

## AdaBoost

Adaboost described in 1995 by Schapire and Friedman.

Here we assume 
- h(x) is a binary classifier with 2 classes: {-1,+1}
- Loss function is exponential

In that case the additive model above turns to this algorithm:

<img src="img/adaboost.png" width=500>

### In plain english

0. Initialize weight for each case as $(\frac{1}{n},\frac{1}{n} \cdots \frac{1}{n})$

While not convergence_criteria:
1. Train a new "weak" classifier $h_{m}$ that minimizes the weighted loss function
2. Update the weights in the dataset
3. Renormalize weights

### Derivation

Let's consider the decision function as an additive model:

$$ F_m(x) = \alpha_1 h_1(x) + \alpha_2 h_2(x) + \alpha_1 h_1(x) + \cdots + \alpha_{m-1} h_{m-1}(x) $$

We can also rewrite it as an iterative process. Suppose we already have $F_{m-1}$:

$$F_m(x) = F_{m-1}(x) + \alpha_{m-1}h_{m-1}(x)$$

The loss function for this decision function would be $$-y(x)F_m(x) = 
\begin{cases}
    -1,& \text{when prediction is correct}\\
    1,& \text{when prediction is incorrect}
\end{cases}
$$

Loss function minimization = on each step we train a "weak" classifier $h_{m-1}(x)$ that minimizes the loss function.

Let's rewrite the loss function as an exponent so that it would be easier to optimize

$$L = \sum e^{-y(x) F_m(x)}$$

Plug in the decision function:
$$L = \sum e^{-yF_{m-1}(x) - \alpha_{m-1}h_{m-1}}$$

Let's factorize the exponent:
$$L = \sum e^{-yF_{m-1}(x)} e^{-y\alpha_{m-1}h_{m-1}}$$

Let's denote $e^{yF_{m-1}}$ as weight $w^m$;
$$L = \sum w^m e^{-y\alpha_{m-1}h_{m-1}} $$

Now we need to optimize for $(\alpha, h(x|\theta))$. 

Luckily we can do it separately - first for h(x), then for $\alpha$. Notice the necessary condition for minimum of L is 

$$h(x) = arg min \sum w_i I(h(x) != y_i)$$ regardless of $\alpha$.

This is equivalent to just fitting h(x) on a weighted dataset.

After that we can optimize for $\alpha$.

We notice that $yh = 1$ for correct classifications and $yh = -1$ for incorrect classifications. So we rewrite:

$$L = \sum_{correct} w^m e^{-\alpha_m} + \sum_{incorrect} w^m e^{\alpha_m}$$

$$L = \sum_{all} \big( w^m e^{-\alpha_m} \big) + \sum_{incorrect} {\big(w^m e^{\alpha_m} - w^m e^{-\alpha_m}\big)}$$

The alpha that minimizes the loss function is:

$$
\hat{\alpha} = \frac{1}{2} \log{ \frac{\sum_{correct}w^m}{\sum_{incorrect}w^m}} = \frac{1}{2} \log{\frac{1-\epsilon}{\epsilon}}
$$

So, the resulting prediction can be achieved using a Weighted Voting with weights $\alpha=(\alpha_1,\alpha_2 \cdots \alpha_m)$


Decision Tree Stump (Decision Stump) = 1-level decision Tree.

## Mutli-class Adaboost

### SAMME

[paper](https://web.stanford.edu/~hastie/Papers/samme.pdf)

SAMME = **S**tagewise **A**dditive **M**odeling with **M**ulticlass **E**xponential loss function

SAMME is an extension of Adaboost for multiple classes.

There are two flavours of the algorithm: 
- SAMME (for discrete estimators)
- SAMME.R (for real estimators)

SAMME



When K = 2 algorithm is the same as Adaboost

### Important algoritm parameters
- num_estimators
- learning rate ($\eta$) - a decreasing multiplier that shrinks the contribution ($\alpha$) of each new weak classifier

High learning rate helps fight overfitting.

https://stats.stackexchange.com/questions/82323/shrinkage-parameter-in-adaboost/355632#355632

# Implementations

# Sklearn

There are 3 versions:
- AdaBoost
- GradientBoosting
- HistogramBoosting

In Sklearn AdaBoost there is also Feature-importance calculation performed using permutation-based approach.

### AdaBoost classifier
This is the implementation of SAMME Adaboosting.

Basic Parameters:
- weak estimator (default = decision stump)
- n_estimators
- learning_rate

Algorithm params:
- SAMME
- SAMME.R

You can call <u>staged_predict_proba()</u> which returns probs for each of the estimators

### GradientBoostingClassifier

Basic params
- estimator = Decision Tree
- n_extimators
- learning_rate


- loss - loss function
    - deviance = negative log-likelihood, for classification
    - exponential, for binary (same as in AdaBoost)

Stochastic params
- subsample
- max_features

Tree params
- max_depth
- max_learff_nodes
- min_impurity_split

Process params
- warm_start - allows to continue training

### HistogramBoostingClassifier

Inspired by lightGBM




### GBM Regularization

Ways to reduce deviance on Test:
- set lower learning rate to enable shrinkage
- use random subsample for training (bagging) - This is what they do in Stochastic Gradient Descent
- use random feature selection (bagging)

<img src="img/regularization.png" width=500>

[Code here](https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_regularization.html)

### XGBoost

2014

#### lightGBM

2016

Microsoft implementation of GBM

https://papers.nips.cc/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf

### CatBoost, Yandex

2017

Basic params:
- n_estimators
- learning rate
- max_depth

Metric params:
- loss_function
- eval_metric

One of the main features of CatBoost is ability to use categorical features.


In [None]:
Dataset