## Boosting idea

- Let us use very simple base models (weak learners)
- We will construct an ensemble successively and greedily
- Each successive model will try to correct the errors of the previous ones

$a_M(x) = \sum^{M}_{m=1}b_m(x)$

Training the __first__ base model:

$min_{b_1(x)}L(b_1, X) = min_{b_1(x)}\frac{1}{N} \sum^{N}_{i=1}l(y_i, b_1(x_i))$

Training the __M__th base model:

$min_{b_M(x)}\frac{1}{N} \sum^{N}_{i=1}l(y_i, a_{M-1}(x_i)+b_1(x_i))$

The main question is how to learn a tree with such loss in general case. (Very complex loss function)

## Boosting for MSE Loss

Training the __M__th base model:

$min_{b_M(x)}\frac{1}{N} \sum^{N}_{i=1}(b_M(x_i) - (a_{M-1}(x_i) - y_i))^2$

Residuals - $s^{(M)}_i = a_{M-1}(x_i) - y_i$

__First iteration__

$\frac{1}{N}\sum^{N}_{i=1}(b_1(x_i)-y_i)^2 \to min_{b_1(x)}$

__Second iteration__

$\frac{1}{N}\sum^{N}_{i=1}(b_2(x_i)-(y_i - b_1(x_i)))^2 \to min_{b_2(x)}$ (minimize the loss with respect to $b_2$)

__Third iteration__

$\frac{1}{N}\sum^{N}_{i=1}(b_3(x_i)-(y_i - b_1(x_i) - b_2(x_i)))^2 \to min_{b_3(x)}$

- Boosting is prone to overfitting, thus it is essential to use validation dataset.

## Difficulties with the General Loss Function

## Gradient Boosting: General Form

__Training a base model__

$M$th base model:

$\frac{1}{N}\sum^{N}{i=1}l(y_i, a_{M-1}(x_i)+b_M(x_i)) \to min_{b_M(x)}$

How to decide, in which direction and how hard to change $a_{M-1}(x_i)$ in order to reduce error.

- Calculate the gradient

$s^{(M)}_i = -\frac{\partial}{\partial z}l(y_i, z)|_{z=a_{M-1}(x_i)}$ - residuals

- The sign shows the direction in whcih we should change the prediction for the object $x_i$, to reduce the error of the composition on it.

- The absolute value shows how large the decrease in the error will be.

- If there is no decrease in the error, there is no point in changing the prediction.

Training $M$th base model:

$\frac{1}{N}\sum^{N}_{i=1}(b_M(x_i)-s_i^{(M)})^2 \to min_{b_M(x)}$

To sum up, we will calculate the residuals of the current composition for each training point as a gradient. Then we will train and the next base model to feed these residuals. In this case, we can always use mean squared error loss, because the specificity of the loss function was already included in the calculation of residuals. 

We can think of this idea as if we're doing gradient descent in the space of answers on the training data-set. 

Each new base model will correct the prediction of the algorithm so that the total error of our composition will become smaller in terms of the loss function that we want.

Residuals take into account specific loss function.

__Gradient boosting for MSE__

$s^{(M)}_i=-\frac{\partial}{\partial z}l(y_i,z)|_{z=a_{M-1}(x_i)} = -\frac{\partial}{\partial z}\frac{1}{2}(z-y_i)^2|_{z=a_{M-1}(x_i)} = -(a_{M-1}(x_i) - y_i) = y_i - a_{M-1}(x_i)$

$\frac{1}{N}\sum^{N}_{i=1}(b_M(x_i)-(y_i - a_{M-1}(x_i)))^2 \to min_{b_M(x)}$ (same as discussed before)

__Gradient boosting for logistic loss__

$s^{(M)}_i=-\frac{\partial}{\partial z}l(y_i,z)|_{z=a_{M-1}(x_i)} = -\frac{\partial}{\partial z} \log(1+\exp(-y_iz))|_{z=a_{M-1}(x_i)} = \frac{y_i}{1+\exp(y_i a_{M-1}(x_i))}$ (logistic loss)

$\frac{1}{N}\sum^{N}_{i=1}(b_M(x_i) - \frac{y_i}{1+\exp(y_i a_{M-1}(x_i))} )^2 \to min_{b_M(x)}$ (after calculated residuals, apply MSE, to train the new base model to fit this residuals)

- Large positive margin: $\frac{y_i}{1+\exp(y_i a_{M-1}(x_i))} \approx 0$  
($y_i$ and $a_{M-1}(x_i)$ has same sign, denominator is large. which means the previous prediction is accurate, we don't want to change it much)

- Large negative margin: $\frac{y_i}{1+\exp(y_i a_{M-1}(x_i))} \approx \pm1$  
(we are confident we are wrong, we will want to change the total composition by training our base classifier to move either to the positive or to the negative side. )

When we use gradient and boosting, we incorporate properties of a specific loss-function by calculating the gradient or fit for all training points. 

Then we will try to feed our base model so that it improves the total composition in terms of this loss. We can then train the base model using mean squared error with the residuals as a target variables.

## Boosting Hyperparameters

Tree depth

- Gradient boosting reduces the bias of base models
- Variance may increase
- Hence, it is worth using simple algorithms. e.g.trees with small depth (make it simple)
    - Fix the depth
    - Fix maximal number of leaves
    
- Residuals show the desired direction on the whole training dataset

- Base models are supposed to be simple and may not be powerful enough to fit the residuals.

    __Solution__: add new trees to the composition with small weight. In gradient descent, we're having learning rate which controls the size of the steps that we're making. The same happens here, when we add the new base model to a composition, we can use a step size $\mu$, which is usually less than one, in order to somehow regularize our composition.
    
    __Step size__
    
    $a_M(x)=a_{M-1}(x_i) + \mu b_M(x_i)$
    
    - $\mu \in (0,1]$ step size
    - One can think it is a regularization of the composition
    - We reduce the influence of each model
    - The smaller the step size is, the more trees are needed
    
    If the step size is very large, we firstly converge to a very good solution. But then if we keep adding trees,  the error on the test set may start increasing. On the same time, if the step size is small, it takes more time to achieve good performance, but at the end, we might even end up in a better solution.
    
    __Feature randomization__ (another option of regularize our Gradient Boosting)
    
    - We can train trees on random subspaces
    - Boosting reduces bias, therefore the final composition will be still good enough
    - At the same time, it may reduce overfitting.
    
### Hyperparameters
- Tree depth
- Total number of trees
- Step size
- Size of the subset (used to train a tree)
