##### Machine Learning 2: predictive models, deep learning, neural network 2022Z

# Gradient Boosting (GRB)

## AdaBoost
- AdaBoost was the first successful boosting algorithm [Freund et al., 1996, Freund and Schapire, 1997]
- the most successful form of the AdaBoost algorithm for binary classification problems is called AdaBoost.M1
- belongs to the group of algorithms called Arcing - Adaptive Reweighting and Combining algorithms
  - a weighted minimization followed by a recomputation of the classifiers and weighted input
  
<img src="Schapire.png">

#### Main idea
- model is a combination of many "weak learners" - decision trees with a single split <B>(stumps)</B>

#### Key features
- puts more weight on difficult to classify instances and less on those already handled well
- order of the stumps is important,error of the first stump influence on how the second stump is made
- some stumps have the greater weights(votes) than the others
- final classification is made by total number of votes for given output 

## Example AdaBoost

### Data set

<img src="adaBoost1.png">

### Stumps
<img src="adaBoost2.png">

### Algorithm
<img src="adaBoost0.png">

<img src="adaBoost3.png">

<img src="adaBoost4.png">

<img src="adaBoost5.png">

## Gradient Boosting
- ML technique for both regression and classification problems
- develops a strong learner by combining weak learners in iterative way

#### History:
- Formulate Adaboost as gradient descent with a special loss function[Breiman et al., 1998, Breiman, 1999]
- Generalize Adaboost to Gradient Boosting in order to handle a variety of loss functions [Friedman et al., 2000, Friedman, 2001]

#### Friedman
<img src="Friedman.png">

#### Main idea
- minimize the loss of the model by adding weak learners using a gradient descent like procedure

#### Why GRB is stage-wise additive model?
- one new weak learner is added at a time
- existing weak learners in the model are frozen and left unchanged

### Gradient boosting involves three elements
1. A loss function to be optimized.
2. A weak learner to make predictions.
3. An additive model to add weak learners to minimize the loss function.

#### Loss Function
- depends on the type of problem being solved
- must be differentiable
- regression may use a squared error and classification may use logarithmic loss

#### Weak Learner
- subsequent regression trees are used to correct the residuals in the predictions
- regression trees have usually with 4-to-8 levels
- possible parameters are  maximum number of layers,nodes, splits or leaf nodes
- trees are constructed in a greedy manner (best split points based on purity scores like Gini)

#### Additive Model
- trees are added one at a time, 
- existing trees in the model are not changed
- a gradient descent procedure is used to minimize the loss when adding trees (reducing the residual loss)

<img src="grbDiagram1.png">


# Example  Gradient Boosting - Regression 
<img src="grbXLS1.png">

## 1st tree
    a) Predicted_weight=Avg(Weights)=71,2
    b) Move the residuals into the leafs
    c) Calculate average residual for each leaf

<img src="grbXLS2.png">
<img src="grbDiagram2a.png">

## 2nd tree

    a) Predicted_weight=Avg(Weights)+learningRate*1st_Tree_AvgRes
    b) Move the residuals into the leafs
    c) Calculate average residual for each leaf

<img src="grbXLS3.png">
<img src="grbDiagram2b.png">

## 3rd tree

    a) Predicted_weight=Avg(Weights)+learningRate (1st_Tree_AvgRes+2nd_Tree_AvgRes)
    b) Move the residuals into the leafs
    c) Calculate average residual for each leaf

 <img src="grbXLS4.png"> 
 
# Example  Gradient Boosting - Classification 
 <img src="grbXLS5.png"> 

### Choosing hyper-parameters:
- the number of stages M , higher number increases the accuracy on the training set,but ....
- the learning rate (also known as shrinkage)


#### Remark
Tuning the hyper-parameters is required to get a decent GBM model unlike Random Forests.


### Disadvantages
- GRB as a greedy algorithm can overfit a training dataset quickly


### Improvements to Basic Gradient Boosting
- Tree Constraints
- Shrinkage
- Random Sampling.
- Penalized Learning

#### Tree Constraints
- weak learners have skill but remain weak
- the more constrained tree creation is, the more trees you will need in the model
- shorter trees are preferred.
- constraints: maximum number of layers,nodes, splits or leaf nodes

#### Shrinkage (learning rate)
- the contribution of each tree to the total sum
- slows down  the learning by the algorithm
- use small values <0.1-0.3>

#### Stochastic Gradient Boosting
- allows trees to be greedily created from subsamples of the training dataset
- reduce the correlation between the trees in the sequence in gradient boosting models
- at each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset

#### Penalized Gradient Boosting
- use a regression trees (numeric values in the leaf nodes)
- values in the leaves of the trees - can be called weights and regularized using popular regularization functions:
  - L1 regularization of weights
  <img src="L1.png">
  - L2 regularization of weights
  <img src="L2.png">

## Extreme gradient boosting (XGBoost)
- refined and customized version of a gradient boosting decision tree system


### What is XGBoost?
- one of the most popular implementations of the Gradient Boosted Trees algorithms, created by Tianqi Chen,
- a broader collection of tools under the umbrella of the Distributed Machine Learning Community creators of mxnet deep learning library-
- regularized form of Gradient Boosting
- fast compared to other Gradient Boosting implementation (http://datascience.la/benchmarking-random-forest-implementations/)
- C++ library

### XGBoost supports the following main interfaces:
- Command Line Interface (CLI).
- C++ (the language in which the library is written).
- Python interface as well as a model in scikit-learn.
- R interface as well as a model in the caret package.
- Julia support.
- Java and JVM languages like Scala and platforms like Hadoop.

### Model Features
- gradient boosting machine including the learning rate
- Stochastic Gradient Boosting with sub-sampling at the row, column and column per split levels
- uses the regularization

#### When to use XGBoost?
- When there is a larger number of training samples. Ideally, greater than 1000 training samples and less 100 features or we can say when the number of features < number of training samples.
- When there is a mixture of categorical and numeric features or just numeric features.

#### Main idea 
- minimize the objective function
- apply Taylorâ€™s Theorem so can use objective (loss) function as a simple function of the new added learner 
- build a learner that achieves the maximum possible reduction of loss, 
- we can't enumerate all the possible tree structures so we must use "Exact Greedy Algorithm"

#### How to build the new learner

- Start with single root (contains all the training examples)
- Iterate over all features and values per feature, and evaluate each possible split loss reduction:
- gain = loss(father instances) - (loss(left branch)+loss(right branch))
- The gain for the best split must be positive (and > min_split_gain parameter), otherwise we must stop growing the branch.

### Toy example  1a (Gradient Boosting)

#### Load data

In [3]:
#survivors on Titanic
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import GradientBoostingClassifier

train_data = pd.read_csv(".\\titanic\\train.csv")
test_data = pd.read_csv(".\\titanic\\test.csv")

#### Prepare data

In [None]:
# create output vector
y_train = train_data["Survived"]
train_data.drop(labels="Survived", axis=1, inplace=True)

#prepare data
full_data = train_data.append(test_data)

#remove unnecessary columns
drop_columns = ["Name", "Age", "SibSp", "Ticket", "Cabin", "Parch", "Embarked"]
full_data.drop(labels=drop_columns, axis=1, inplace=True)

full_data = pd.get_dummies(full_data, columns=["Sex"])
full_data.fillna(value=0.0, inplace=True)

X_train = full_data.values[0:891]
X_test = full_data.values[891:]

# scale date
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#Divide into training and validation data

state = 12  
test_size = 0.30  
  
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=test_size, random_state=state)

AttributeError: 'DataFrame' object has no attribute 'append'

#### Build Gradient Boosting classifier - learning rate optimization

In [5]:
lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1]

for learning_rate in lr_list:
    gb_clf = GradientBoostingClassifier(n_estimators=20, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
    gb_clf.fit(X_train, y_train)

    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))
    print("Accuracy score (validation): {0:.3f}".format(gb_clf.score(X_val, y_val)))

#Based on the data above is higher learning rate always better?

NameError: name 'X_train' is not defined

#### Generate predictions

In [None]:
#use GRB model where learning_rate=0,5 to generate predictions 
gb_clf2 = GradientBoostingClassifier(n_estimators=20, learning_rate=0.5, max_features=2, max_depth=2, random_state=0)
gb_clf2.fit(X_train, y_train)
predictions = gb_clf2.predict(X_val)

print("Confusion Matrix:")
print(confusion_matrix(y_val, predictions))

print("Classification Report")
print(classification_report(y_val, predictions))

### Toy example 1b (XGBoost) 

In [None]:
#survivors on Titanic

#!pip install xgboost
from xgboost import XGBClassifier

xgb_clf = XGBClassifier()
xgb_clf.fit(X_train, y_train)

score = xgb_clf.score(X_val, y_val)
print(score)

### Toy example 2a

In [None]:
# k-fold cross validation evaluation of xgboost model
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# load data
dataset = loadtxt('diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# CV model
model = XGBClassifier()
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

### Toy example 2b

In [None]:
# stratified k-fold cross validation evaluation of xgboost model
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# load data
dataset = loadtxt('diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# CV model
model = XGBClassifier()
kfold = StratifiedKFold(n_splits=10, random_state=7)
results = cross_val_score(model, X, Y, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))