## Ensembling methods

### Ensembling

Base algorithms $b_1, \dots, b_N$ (some machine learning models)

Regression:

$a(x)=\frac{1}{N}\sum^N_{n=1}b_n(x)$

Classification:

$a(x)=\text{argmax}_{y\in \mathbb{Y}}\sum^N_{n=1}[b_n(x) = y] = mode(b_1(x),\dots,b_N(x))$

__Why does ensembling work?__

Independent base algorithms $b_1, b_2, b_3$

Binary classification task

Probability that some $b_i$ makes a mistake: $p$

Probability that ensemble makes a mistake: $p^3(3-2p)$

if $p \le 0.5$, then $p^3(3-2p) \le p$

## Bagging and random subspaces

We want to obtain independent base algorithms

Train each $b_i$ on some subset of training sample

Bagging: use boostrap (taking observations with replacement) to form subsets

Random subspaces: use a random subset of features for each algorithm

## Random forest

__Main idea: use ensembling to reduce variance, while bias stays the same__ (smooth border)

Base algorithm: decision trees

Bagging to make unique train subsets for trees

For each split in trees' noeds, use a random subset of features

With growing more trees, overfitting is unlikely

## Boosting

__Main idea: construct ensemble iteratively, correcting the mistakes of previous models__

Thus we reduce bias, and by averaging base algorithms we also might reduce variance

We can easily overfitting by growing more trees

### Boosting: training

Boosting model:

$a_N(x) = \sum^N_{n=1}b_n(x)$

On the step N, we train a model $b_N$ the following way:

$\sum^l_{i=1}(y_i, a_{N-1}(x_i) + b_N(x_i))\to\text{min}_{b_N(x)}$

### Boosting: MSE

Mean Squared Error loss function:

$L(y, \hat{y}) = (y-\hat{y})^2$

Training boosting with MSE:

$\sum^l_{i=1}(b_N(x_i) - (y_i - a_{N-1}(x_i)))^2 \to\text{min}_{b_N(x)}$

How to construct $b_N(x)$?

Residuals: $s_i^{(N)} = y_i-a_{N-1}(x_i)$

Fit the decision tree $b_N$ on residuals and train with MSE:

$\sum^l_{i=1}(b_N(x_i) - s_i^{(N)})^2 \to\text{min}_{b_N(x)}$

We have some output of our already constructed composition a, with index (N - 1) and we want to know where to move from this answer of composition to approximate this target value. This is why we take residuals as the target value for decision tree, and fit decision tree on them. We can train this residual tree just with the Mean Squared Error.

The interesting thing to understand here is that each of this decision tree solves a regression task. It doesn't matter if your initial task was a regression or a classification, still in boosting you construct this boosting ensemble out of regression trees.