# Gradient Boosting

Gradient Boosting is an ensemble learning method that combines the predictions of multiple weak learners to create a strong predictive model. The main idea is to iteratively fit new models to the residuals of the previous models, minimizing the residual error at each step.

## Gradient Boosting Mathematical Formulation

Let's consider a differentiable Loss function $L(y, z)$. We will build a weighted sum of base algorithms:

$$ a_N = \sum_{n=0}^{N} \gamma_n b_n(x) $$

What about $b_0$? There are several common ways to define $b_0$ and $\gamma_0$:  
 - zero: $$b_0(x) = 0$$ 
 - most frequent class (in case of classification task): $$b_0(x) = \arg \max_{y \in \mathbb{Y}} \sum_{i=1}^{\ell} [y_i=y]$$
 - average response (in case of regression task): $$b_0(x) = \frac{1}{\ell} \sum_{i=1}^{\ell} y_i$$
 - $\gamma_0$ is usually set to zero: $\gamma_0 = 0$

Assume we have built a composition $a_{N-1}(x)$ of $N-1$ base algorithms and want to choose the next algorithm $b_N(x)$ in a way to maximize the minimization of the loss function:

$$ \sum_{i=1}^{\ell}L(y_i, a_{N-1}(x) + \gamma_N b_N(x)) \rightarrow \min_{b_N, \gamma_N}$$  

In other words we want to know which numbers $s_1, \ldots, s_{\ell}$ we need pick to solve this minimization task:

$$ \sum_{i=1}^{\ell}L(y_i, a_{N-1}(x) + s_i) \rightarrow \min_{s_1, \ldots, s_{\ell}}$$ 

The most logical approach will be to pick $s_i$ in a such way that $s_i$ will be opposite to derivative of the loss function at point $z=a_{N-1}(x)$

$$s_i = -\frac{\partial L}{\partial z}\Bigr|_{z=a_{N-1}(x)}$$

Note that vector $s =(s_1, \ldots, s_{\ell})$ coincide with gradient:

$$\left(-\frac{\partial L}{\partial z}\Bigr|_{z=a_{N-1}(x)} \right)_{i=1}^{\ell} = - \nabla_z \sum_{i=1}^{\ell} L(y_i, z_i)\Bigr|_{z=a_{N-1}(x)}$$

### Boosting Procedure

1. **Initialization**: We start by creating an initial model $b_0(x)$, which could be a simple constant value like the mean of the target variable.

2. **Iteration**: For each iteration $m$, we update the model $b_m(x)$ by fitting it to the negative gradient of the loss function with respect to the previous model's prediction:

   $$ s_{im} = -\frac{\partial L(y_i, a_{m-1}(X_i))}{\partial a_{m-1}(x_i)} $$

   Here, $s_{im}$ represents the residuals for the $i$-th sample at iteration $m$.

3. **Fitting New Model**: We fit a new model $b_m(x)$ to the residuals $s_{im}$. This model tries to predict the remaining error left by the previous models.

4. **Update**: We update the ensemble model $a_m(x)$ by adding the new model $b_m(x)$:

   $$ a_m(x) = a_{m-1}(x) + \eta \cdot b_m(x) $$

   Where $\eta$ is the learning rate, controlling the step size during each iteration.

5. **Termination**: Repeat the iteration process for a predefined number of iterations or until a convergence criterion is met.

### Final Ensemble

The final ensemble prediction $a_M(x)$ is obtained by summing up the individual models:

$$ a_M(x) = \sum_{m=1}^{M} \eta \cdot b_m(x) $$

## Summary

Gradient Boosting iteratively builds a strong predictive model by sequentially fitting new models to the negative gradients of the loss function. This approach helps improve the overall performance of the ensemble by correcting the errors made by previous models.

The algorithm's flexibility and effectiveness have made it a popular choice for various machine learning tasks.


In [3]:
from sklearn.metrics import mean_squared_error
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from mllib.gradient_boosting import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor as GradientBoostingRegressor2

X, y = make_regression(n_samples=100, n_features=10, n_informative=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

gb = GradientBoostingRegressor(n_estimators=100, weighted_estimators=False, learning_rate=0.1)
gb.fit(X_train, y_train)
y_hat = gb.predict(X_test)

rf = GradientBoostingRegressor2().fit(X_train, y_train)
y_hat2 = rf.predict(X_test)

print(F"model: {gb} \nMSE: {mean_squared_error(y_test, y_hat)}")
print(F"model: {rf} \nMSE: {mean_squared_error(y_test, y_hat2)}")

model: <mllib.gradient_boosting.GradientBoostingRegressor object at 0x7f2dbc05f460> 
MSE: 11392.946135772683
model: GradientBoostingRegressor() 
MSE: 12578.609069448028
