## Boosting implementations

**Boosting** is a general learning approach that combines multiple weak learners constructed in a way that bias is reduced at each iteration. So, the main aspect of boosting fitting procedures is the progressive learning by the mistakes the ensemble is doing while each of its components is fitted. There are a wide range of algorithms that follow the boosting principle, but two of them are currently the most used: AdaBoost and gradient boosted trees.

Differently from random forests, which try to reduce variance among the trees in the ensemble, boosting methods focus on controling the bias. While random forests produce estimators with low bias, but in a way that the overall variance is minimized, boosting works with individual estimators that have low variance, and the process of constructing the entire ensemble seeks to minimize the bias. Consequently, boosting takes a collection of simple models and combines them so that a complex final model can be produced.

This notebook, in addition to the previous theoretical discussion, brings some codes implementing boosted models from scratch. The main reference is this [article](https://towardsdatascience.com/statistical-machine-learning-gradient-boosting-adaboost-from-scratch-8c4b5a9db9ed), but [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf) is a crucial reading for understanding the inner functioning of this (and many other) learning algorithms. As found in this book, one can understand the boosting principle through **AdaBoost**, an algorithm for binary classification which explicitly reproduces the idea under which, given an estimator in the ensemble, incorrectly predicted instances get more weight in the construction of the following estimator. The following algorithm is an alternative for implementing AdaBoost, and its steps illustrate the boosting principle:
1. Define initial weights $w_i = 1/N$ for each data point $i \in \{1, 2, ..., N\}$.
2. For each step $m \in \{1, 2, ..., M\}$:
  1. Fit a classifier $G_m(x)$ using data points weighted by $w_i$.
  2. Calculate the following weighted classification error:
  \begin{equation}
      \displaystyle err_m = \frac{\sum_{i=1}^Nw_iI(y_i \neq G_m(x_i))}{\sum_{i=1}^N w_i}
  \end{equation}

  3. Calculate the following quantity:
  \begin{equation}
      \alpha_m = \log((1 - err_m)/err_m)
  \end{equation}

  Note that the higher $err_m$, the lower will be $\alpha_m$.
  4. Redefine the weights:
  \begin{equation}
      \displaystyle w_i \leftarrow w_i.\exp(\alpha_mI(y_i \neq G_m(x_i)))
  \end{equation}

  Note that more weight is put to misclassified observations.

3. Define the output from the algorithm by:
  \begin{equation}
      \displaystyle g(x) = sign\Big(\sum_{m=1}^M\alpha_mG_m(x)\Big)
  \end{equation}
  The AdaBoost.M1 algorithm should be adapted if instead of a class prediction, the objective is to compute class-probabilities.

Although very similar to AdaBoost, **gradient boosted trees** algorithm is another implementation that explicitly reproduces a fundamental principal of boosting: each estimator is constructed using information on the amount of error the current ensemble has made. The following algorithm implements a boosting model on a regression setting:
1. Define the initial guess:
\begin{equation}
    \displaystyle f_0(x) = \underset{\gamma}{\mathrm{argmin}} \sum_{i=1}^N L(y_i, \gamma)
\end{equation}
2. For each step $m \in \{1, 2, ..., M\}$:
  1. For each observation, compute:
  \begin{equation}
      \displaystyle r_{mi} = -\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\Big|_{f = f_{m-1}}
  \end{equation}
  Note that $f(.)$ refers to the current ensemble.

  2. Estimate a regression tree, where the response variable is given by $r_{mi}$, providing regions $R_{mj}$, with $j \in \{1, 2, ..., J_m\}$.
  3. For each $j \in \{1, 2, ..., J_m\}$, calculate:
  \begin{equation}
      \displaystyle \gamma_{mj} = \underset{\gamma}{\mathrm{argmin}} \sum_{x_i \in R_{mj}} L(y_i, f_{m-1}(x_i) + \gamma)
  \end{equation}
  4. Update the gradient boosting estimate:
  \begin{equation}
      \displaystyle f_m(x) = f_{m-1}(x) + \sum_{j=1}^{J_m} \gamma_{mj}I(x \in R_{mj})
  \end{equation}

3. Define the final estimator by $\hat{f}(x) = f_M(x)$.

The gradient tree boosting algorithm applied for classification should be adapted properly. Mainly, steps 2a-2d are repeated for each class $k \in \{1, 2, ..., K\}$, producing final trees $f_{kM}(x)$. The notation $r_{mi}$ follows from the definition of *generalized*, or *pseudo residuals*, since the gradient from squared-error loss, $(y_i - f(x_i))^2$, is equal to the current residual $y_i - f(x_i)$, where $f(x) = f_{m-1}(x)$.

Since gradient boosted trees are the most popular implementation of boosting models, it is important to highlight the most important hyper-parameters to tune when modeling with GBM from scikit-learn, LightGBM, XGBoost or any other library. First, the overall depth of trees in the ensemble ($J_m = J$ $\forall m$) regulates how weak are the elementary learners. Second, we can introduce a learning rate $v$ that weights each component of the ensemble $f_m(x)$. Here, the number of trees $M$ can lead to overfitting, so it is more relevant than for random forests to tune $M$. Finally, we can implement stochastic gradient boosting, where only a fraction $\eta \leq 1$ of training data is used for fitting each component in the ensemble.

Another approach to the boosting principle considers its models as implementing a **forward stagewise additive modeling**:
1. Define $f_0(x) = 0$.
2. For each step $m \in \{1, 2, ..., M\}$:
  1. Compute:
  \begin{equation}
      \displaystyle (\beta_m, \gamma_m) = \underset{\beta, \gamma}{\mathrm{argmin}} \sum_{i=1}^N L(y_i, f_{m-1}(x_i) + \beta b(x_i; \gamma))
  \end{equation}

  2. Update $f_m(x) = f_{m-1}(x) + \beta_m b(x; \gamma)$.

It is specially clear how gradient boosted trees relate with forward stagewise additive modeling. Additionally, AdaBoost converges to the preocedure above when its loss function is given by the exponential loss, instead of the misclassification error.

The [article](https://towardsdatascience.com/statistical-machine-learning-gradient-boosting-adaboost-from-scratch-8c4b5a9db9ed) mentioned above brings a different, but very consistent and well-constructed, theoretical foundation for both AdaBoost and gradient boosted trees. Even so, it presents solutions that resemble the algorithms found in [The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf). So, for instance, its solution to AdaBoost optimization problem indicates that a new component of the ensemble should minimize the weights defined to the misclassified observations. When it comes to the gradient boosting model for a regression task, a new component of the ensemble should minimize the sum of squared differences between the actual value of the outcome variable and the pseudo-residual previously defined.

The **gradient boosting implementation from scratch** that can be found below first initializes parameters (learning rate - *alpha* - and loss function) and then iterates the following until a termination criterium is met: a decision tree is fitted and predicted values are calculated; then, the alpha value for a given estimator is optimized (here, a value is found so the loss function can be reduced) and the outcome variable is finally updated by calculating pseudo-residuals.

The **AdaBoost implementation from scratch** reproduced here initializes weights (and the ensemble prediction), and then iterates over the following: a weak decision tree is fitted and predictions are made; the learning rate (*alpha* parameter) is defined through its optimal definition, so the ensemble predictions and weights can be updated. The iteration is ended when the termination criterium is met.

**References**
<br>
[Statistical Machine Learning: Gradient Boosting & AdaBoost from Scratch](https://towardsdatascience.com/statistical-machine-learning-gradient-boosting-adaboost-from-scratch-8c4b5a9db9ed).
<br>
[Gradient Boosted Decision Trees Explained with a Real-Life Example and Some Python Code](https://towardsdatascience.com/gradient-boosted-decision-trees-explained-with-a-real-life-example-and-some-python-code-77cee4ccf5e?gi=25ec6e2c8c4a)
<br>
[The Elements of Statistical Learning](https://web.stanford.edu/~hastie/Papers/ESLII.pdf).

----------------

This notebook first imports all relevant libraries, and then presents an implementation and its demonstration.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [First implementation](#first_implementation)<a href='#first_implementation'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [2]:
cd "/content/gdrive/MyDrive/Studies/tree_based/Codes"

/content/gdrive/MyDrive/Studies/tree_based/Codes


In [3]:
import pandas as pd
import numpy as np
np.random.seed(123456789)
from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

<a id='first_implementation'></a>

## First implementation

This first implementation follows from this [article](https://towardsdatascience.com/statistical-machine-learning-gradient-boosting-adaboost-from-scratch-8c4b5a9db9ed) ([Github](https://github.com/atrothman/Gradient-Boosting-AdaBoost-Simulation) page of reference).

<a id='dataset'></a>

### Simulated dataset

In [4]:
def simulate_df(n=100, seed=123456, binary_flag=False):
    np.random.seed(seed)
    
    ## specify dataframe
    df = pd.DataFrame()

    ## specify variables L1 through L6
    L1_split = 0.52
    L2_split = 0.23
    L3_split = 0.38
    df['L1'] = np.random.choice([0, 1], size=n, replace=True, p=[L1_split, (1-L1_split)])
    df['L2'] = np.random.choice([0, 1], size=n, replace=True, p=[L2_split, (1-L2_split)])
    df['L3'] = np.random.choice([0, 1], size=n, replace=True, p=[L3_split, (1-L3_split)])
    df['L4'] = np.random.normal(0, 1, df.shape[0])
    df['L5'] = np.random.normal(0, 0.75, df.shape[0])
    df['L6'] = np.random.normal(0, 2, df.shape[0])
    
    theta_0 = 5.5
    theta_1 = 1.28
    theta_2 = 0.42
    theta_3 = 2.32
    theta_4 = -3.15
    theta_5 = 3.12
    theta_6 = -4.29
    theta_7 = -1.23
    theta_8 = -10.18
    theta_9 = 2.21
    theta_10 = 10.3
    
    if(binary_flag):
        Z = theta_0 + (theta_1*df['L1']) + (theta_2*df['L2']) + (theta_3*df['L3']) + (theta_4*df['L4']) + (theta_5*df['L5']) + (theta_6*df['L6']) + (theta_7*df['L2']*df['L4']) + (theta_8*df['L3']*df['L6']) + (theta_9*df['L5']*df['L5']) + (theta_10*np.sin(df['L5']))
        p = 1 / (1 + np.exp(-Z))
        df['Y'] = np.random.binomial(1, p)
        df.loc[df['Y']==0, 'Y'] = -1
    else:
        df['Y'] = theta_0 + (theta_1*df['L1']) + (theta_2*df['L2']) + (theta_3*df['L3']) + (theta_4*df['L4']) + (theta_5*df['L5']) + (theta_6*df['L6']) + (theta_7*df['L2']*df['L4']) + (theta_8*df['L3']*df['L6']) + (theta_9*df['L5']*df['L5']) + (theta_10*np.sin(df['L5'])) + np.random.normal(0, 0.1, df.shape[0])

    return(df)

<a id='gradient_boosting'></a>

### Gradient boosting

In [5]:
# Creating the simulated dataset for a regression task:
df = simulate_df(n=1000, seed=123456)
df['Y_original'] = df['Y']

# Initializing parameters:
alpha = 100000
current_loss = sum((df['Y'])**2)

# Fitting a gradient boosting model:
while(current_loss > 1):
    # Creating and fitting a weak learner:
    model = DecisionTreeRegressor(random_state=0, max_depth=3)
    model.fit(df[['L1', 'L2', 'L3','L4', 'L5', 'L6']], df['Y'])

    # Predicted outcomes:
    df['Y_hat'] = model.predict(df[['L1', 'L2', 'L3','L4', 'L5', 'L6']])
    df['Y_hat_squared'] = df['Y_hat']**2
    df['Y_hat_scaled'] = np.sqrt(df['Y_hat_squared'] / df['Y_hat_squared'].sum()) * np.sign(df['Y_hat'])

    # Optimizing the learning rate:
    loss_not_lowered_flag = True
    while(loss_not_lowered_flag):
        # Calculating the loss function (sum of squared residuals):
        new_loss = sum((df['Y'] - (alpha*df['Y_hat_scaled']))**2)

        # Checking whether the current learning rate reduces the loss function:
        if(new_loss < current_loss):
            loss_not_lowered_flag = False
            current_loss = new_loss
            print('Current Loss: ' + str(current_loss))

            # Updating the outcome variable for training models (pseudo-residuals):
            df['Y'] = df['Y'] - (alpha*df['Y_hat_scaled'])
        else:
            # Redefining the learning rate:
            alpha = 0.99*alpha

    del model
print('model converged')

Current Loss: 710652.9617096438
Current Loss: 704697.9593626335
Current Loss: 687852.7483979535
Current Loss: 681537.2980095611
Current Loss: 665478.7445945848
Current Loss: 659586.5364438506
Current Loss: 640014.5760520245
Current Loss: 638497.2551319164
Current Loss: 618245.6441627133
Current Loss: 618203.3564164563
Current Loss: 616505.1593726957
Current Loss: 598494.8103541062
Current Loss: 587837.59235525
Current Loss: 579409.3972249583
Current Loss: 574747.8100380449
Current Loss: 561192.4950622892
Current Loss: 543560.0029280061
Current Loss: 543207.4725473331
Current Loss: 536458.6003377557
Current Loss: 526373.448554166
Current Loss: 506577.88892480097
Current Loss: 493605.09762763977
Current Loss: 479834.3592400863
Current Loss: 478361.1843650924
Current Loss: 464437.13313002617
Current Loss: 463945.9471018524
Current Loss: 453269.89872302604
Current Loss: 449970.7331680531
Current Loss: 446826.93110305385
Current Loss: 436211.3148086319
Current Loss: 436021.46326745563
Curre

<a id='ada_boost'></a>

### AdaBoost

In [6]:
# Creating the simulated dataset for a binary classification task:
df = simulate_df(n=1000, seed=123456, binary_flag=True)
df = df.rename(columns={'Y':'Y_original'})

# Initializing weights (same for all training instances):
df['w'] = df.shape[0]*[1/df.shape[0]]

# Initializing ensemble predictions:
df['Y'] = 0

# Fitting an AdaBoost model:
count = True
while(count):
    # Creating and fitting a weak learner (using weights for training instances):
    model = DecisionTreeClassifier(random_state=0, max_depth=2)
    model.fit(df[['L1', 'L2', 'L3','L4', 'L5', 'L6']], df['Y_original'], sample_weight=df['w'])

    # Predicted outcomes (individual learner):
    df['Y_hat'] = model.predict(df[['L1', 'L2', 'L3','L4', 'L5', 'L6']])

    # Setting to zero the weights of correctly predicted observations:
    df.loc[df['Y_hat']==df['Y_original'], 'w'] = 0

    # Defining the alpha parameter as given by its optimal definition:
    epsilon = sum(df['w'])
    alpha = 0.5*(np.log((1-epsilon)/epsilon))
    
    # Updating ensemble predictions:
    df['Y'] = df['Y'] + (alpha*df['Y_hat'])
    
    # Calculating loss function:
    current_loss = sum(np.exp(-df['Y_original']*df['Y']))
    print('Current Loss: ' + str(current_loss))

    # Updating weights:
    psi = np.exp(-df['Y_original']*df['Y'])
    df['w'] = psi / current_loss
    
    # Predicted outcomes (ensemble):
    df['Y_final'] = 1
    df.loc[df['Y']<0, 'Y_final'] = -1
    
    # Termination criterium:
    if(df.loc[df['Y_original']!=df['Y_final'], :].shape[0] == 0):
        print('results converged, all datapoints correctly classified')
        break

Current Loss: 659.1631057636724
Current Loss: 487.7655215433816
Current Loss: 385.43015602209204
Current Loss: 309.6683993321115
Current Loss: 255.5542037237089
Current Loss: 218.3147055733867
Current Loss: 198.66608204748306
Current Loss: 171.9175494217112
Current Loss: 151.6839290289567
Current Loss: 138.20003547506386
Current Loss: 131.45019790958239
Current Loss: 126.25445993343435
Current Loss: 122.78411083637391
Current Loss: 116.59354372531038
Current Loss: 108.28050371146264
Current Loss: 104.39126436345835
Current Loss: 101.02340988337336
Current Loss: 97.78481006804954
Current Loss: 95.31392819906642
Current Loss: 92.54905329566284
Current Loss: 87.00935770165631
Current Loss: 85.1447829237424
Current Loss: 83.34377736632857
Current Loss: 81.5389811738938
Current Loss: 77.92521488496631
Current Loss: 74.76450253594427
Current Loss: 71.78607286865692
Current Loss: 69.08618323068403
Current Loss: 67.14548022606846
Current Loss: 66.48569498215033
Current Loss: 65.60058832576865
