```
______                 _   _
| ___ \               | | (_)
| |_/ / ___   ___  ___| |_ _ _ __   __ _
| ___ \/ _ \ / _ \/ __| __| | '_ \ / _` |
| |_/ / (_) | (_) \__ \ |_| | | | | (_| |
\____/ \___/ \___/|___/\__|_|_| |_|\__, |
                                    __/ |
                                   |___/
```

# Motivation
Boosting is to use when ensembling is not. The goal of boosting is to build on
a low-complexity hypothesis space. This is done combining a series of base
models (aka. weak learners) from the low-complexity space.

# Boosting as fitting residuals
Whereas ensembles are built with independent base models, boosting is not.
Actually, one way to approach boosting is to see it as adding a new model into
a bigger one by fitting the residuals of the big model.

More formally, at stage $t$ the following optimization problem is solved:
> $$f_{[t]}, w_{[t]} = \arg\min_{f, w} \sum_{i=1}^n \ell \left(y_i, \hat{y}_{[t]}(x_i) + w f(x_i) \right)$$

where $\{(x_i, y_i)\}_{i=1}^n$ is the training set, and $\hat{y}_{[t]}$ is the
big model at stage $t$. A new simple model $f_{[t]}$ is added, together with its
weight $w_{[t]}$ (sometimes the weight is omitted).

Therefore the final model looks like
>$$\hat{y}_{[t]}(x) = \hat{y}_{[0]}(x) + \sum_{\tau=1}^t \lambda w_{[\tau]} f_{[\tau]}(x)$$

$\lambda \leq 1$ is called the learning rate. Setting $\lambda < 1$ serves to
regularize learning. $\hat{y}_{[0]}(x)$ is usually either $0$ or the best
constant over the training set.


# Boosting in practice
Boosting as described in the previous section is a generic framework. To deploy
it in practice we need to
- choose the low-complexity hypothesis space;
- define the loss more precisely (and come up with a way to solve the
optimization program).

Most often, stumps (decision tree of depth 1) are chosen as base learners.
The choice of the loss depends on whether it is classification or regression
problem.

## Adaboost
[Adaboost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html)
is a specific instance of the boosting framework with stumps where the loss is
the `exponential loss`, a specific loss for classification which leads to a
closed-form solution of the optimization problem.

> TODO example

## Gradient boosting
Gradient boosting is a generic method for losses whose gradient can be computed.
For regression under the squared error loss, this becomes the
`least-square boosting`.

> TODO example
