## Gradient Boosting

- [Gradient boosting](https://en.wikipedia.org/wiki/Gradient_boosting) is another ensemble technique for classification and regression. It can be viewed as a "series circuit" of base learners.


- The idea of gradient boosting originates from [Leo Breiman](https://en.wikipedia.org/wiki/Leo_Breiman) and [Jerome Friedman](https://en.wikipedia.org/wiki/Jerome_H._Friedman) (1999).


- The diversity of the base learners is achieved by training them on different targets.


- The base learners are regressors, both for classification and regression.


- Usually, the base learners are decision tree regressors, but in theory they could be any regression algorithm.


- Gradient Boosted Decision Trees (or Gradient Boosting Machine) is a "swiss army knife" method in machine learning. It is invariant to the scale of the feature values and performs well on a wide variety of problems.

### Pseudo Code of Training (w/o Learning Rate)
<img src="../_img/gradient_boosting_algorithm.png" width="600px">

### Learning Rate

- instead of step size $\gamma_m$, we use $\eta \cdot \gamma_m$, where $\eta \in (0, 1]$
- $\eta<1$ implements the "slow cooking" idea, and in practice leads to better ensembles than $\eta=1$

<img src="../_img/slow_cooking.jpg" width="250px">

### Special Case: Gradient Boosting for Regression

- the loss function is the squared loss: $L(y, F(x)) = \frac{1}{2} \left(y - F(x)\right)^2$
- the initial model is the average target: $F_0(x) = \frac{1}{n} \sum_{i=1}^n y_i$
- pseudo-residuals: $r_{im} = y_i - F_{m-1}(x_i)$
- $m$-th weak learner: train model $h_m$ on $\{(x_i, r_{im})\}_{i=1}^n$
- weight of the $m$-th weak learner: $w_m = \eta \left[\sum_{i=1}^n h_m(x_i)r_{im}\right] / \left[\sum_{i=1}^n \left(h_m(x_i)\right)^2\right]$

**Exercise 1**: Implement a tree based gradient boosting regressor and evaluate it on the Boston Housing data set using 3-fold cross-validation! Use a maximal tree depth of 3! The metric should be RMSE!

In [1]:
# Load the Boston Housing data set.
import pandas as pd
names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE',
         'DIS', 'RAD', 'TAX', 'PTRATIO', 'LSTAT', 'MEDV']
df = pd.read_csv('../_data/housing_data.txt', delim_whitespace=True, names=names)
df = df.sample(len(df), random_state=42) # data shuffling
X = df.values[:, :-1] # input matrix
y = df['MEDV'].values # target vector

In [16]:
import numpy as np
from sklearn.tree import DecisionTreeRegressor

class SimpleGradientBoostingRegressor:
    def __init__(self, eta=0.1, n_estimators=100, max_depth=3):
        self.eta = eta
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        
    def fit(self, X, y):
        self.yhat0 = y.mean()
        r = y - self.yhat0
 
        self.trees = []
        for k in range(self.n_estimators):
            tree = DecisionTreeRegressor(max_depth=self.max_depth, random_state=42)
            tree.fit(X, r)
            h = tree.predict(X)
            w = eta * (h @ r) / (h @ h)
            r -= w * h
            self.trees.append((w, tree))
        return self
    
    def predict(self, X):
        yhat = np.ones(len(X)) * self.yhat0
        for w, tree in self.trees:
            yhat += w * tree.predict(X)
        return yhat

In [19]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

def evaluate(re, X, y):
    cv = KFold(3, shuffle=True, random_state=42)
    scores = []
    for tr, te in cv.split(X):
        re.fit(X[tr], y[tr])
        yhat = re.predict(X)
        rmse = mean_squared_error(y[te], yhat[te])**0.5
        scores.append(rmse)
    return np.mean(scores)

evaluate(SimpleGradientBoostingRegressor(), X, y)

3.375684135715931

**Exercise 2**: Repeat the previous experiment using scikit-learn!

In [26]:
from sklearn.ensemble import GradientBoostingRegressor
evaluate(GradientBoostingRegressor(random_state=42), X, y)

3.36820868174328

**Exercise 3**: Which tree depth gives the most accurate ensemble?

In [28]:
res = []
for max_depth in range(1, 13):
    print(max_depth, end=' ')
    rmse = evaluate(GradientBoostingRegressor(random_state=42, max_depth=max_depth), X, y)
    res.append({
        'max_depth': max_depth,
        'rmse': rmse
    })

1 2 3 4 5 6 7 8 9 10 11 12 

**Exercise 3/B**: How the training and test RMSE changes with the number of trees? (Use a simple train-test split for this experiment!)

**Exercise 4**: Apply a random forest and a gradient boosting classifier on the Wisconsin Breast Cancer data set! Use stratified 10-fold cross-validation! The evaluation metric should be the ratio of correct classifications. For both ensemble methods, determine the maximal tree depth that gives the highest accuracy!

In [None]:
# Load the Wisconsin Breast Cancer data set.
import pandas as pd
names = [
    'Sample_code_number', 'Clump_Thickness', 'Uniformity_of_Cell_Size',
    'Uniformity_of_Cell_Shape', 'Marginal_Adhesion', 'Single_Epithelial_Cell_Size',
    'Bare_Nuclei', 'Bland_Chromatin', 'Normal_Nucleoli', 'Mitoses', 'Class'
]
df = pd.read_csv('../_data/wisconsin_data.txt', sep=',', names=names, na_values='?')
df = df.sample(len(df), random_state=42) # data shuffling
df['Bare_Nuclei'].fillna(0, inplace=True)
X = df[df.columns[1: -1]].values
y = (df['Class'].values / 2 - 1).astype('int')

### Gradient Boosting on Steroids

- [XGBoost](https://en.wikipedia.org/wiki/XGBoost) and [LightGBM](https://en.wikipedia.org/wiki/LightGBM) are a highly efficient and flexible implementations of gradient boosting.
- XGBoost started as a research project by Tianqi Chen (in 2014).
- LightGBM was introduced by Microsoft Research (in 2016).

**Exercise 5**: Compare XGBoost, LightGBM and scikit-learn's GradientBoostingClassifier on the Wisconsin Breast Cancer problem, in terms of speed and accuracy!