## Boosting
Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors. In this chapter, you'll be introduced to the two boosting methods of AdaBoost and Gradient Boosting.

### Define the AdaBoost classifier
In the following exercises you'll revisit the Indian Liver Patient dataset which was introduced in a previous chapter. Your task is to predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. However, this time, you'll be training an AdaBoost ensemble to perform the classification task. In addition, given that this dataset is imbalanced, you'll be using the ROC AUC score as a metric instead of accuracy.

As a first step, you'll start by instantiating an AdaBoost classifier.

In [None]:
# SOME notes: 
# learning error vs numbers of estimators, 
# smaller learning errror must be compensated with more estimators 
# popular with CARTs predictors, due to high variance

# Import pandas to read csv
import pandas as pd
# Import utility functions
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score 
from sklearn.ensemble import AdaBoostClassifier

SEED = 1

# Load data
data = pd.read_csv('Wisconsin Breast Cancer.csv')
# seprate variables
y = data['diagnosis']
X = data.iloc[:,3:]

# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size = 0.3, 
                                                    random_state = SEED)

# classification tree
dt = DecisionTreeClassifier(max_depth = 1, random_state = SEED)

# instantiate an Adaboost classifier

adb_clf = AdaBoostClassifier(base_estimator = dt, 
                             n_estimators = 100)

### Train the AdaBoost classifier
Now that you've instantiated the AdaBoost classifier ada, it's time train it. You will also predict the probabilities of obtaining the positive class in the test set. This can be done as follows:

Once the classifier ada is trained, call the .predict_proba() method by passing X_test as a parameter and extract these probabilities by slicing all the values in the second column as follows:

ada.predict_proba(X_test)[:,1]
The Indian Liver dataset is processed for you and split into 80% train and 20% test. Feature matrices X_train and X_test, as well as the arrays of labels y_train and y_test are available in your workspace. In addition, we have also loaded the instantiated model ada from the previous exercise.

In [None]:
# fit, predict, accuracy (ROC-AUC CURVE)
adb_clf.fit(X_train, y_train)
y_pred_proba = adb_clf.predict_proba(X_test)[:,1]

### Evaluate the AdaBoost classifier
Now that you're done training ada and predicting the probabilities of obtaining the positive class in the test set, it's time to evaluate ada's ROC AUC score. Recall that the ROC AUC score of a binary classifier can be determined using the roc_auc_score() function from sklearn.metrics.

The arrays y_test and y_pred_proba that you computed in the previous exercise are available in your workspace.

In [10]:
adb_clf_roc_auc_score = roc_auc_score(y_test, y_pred_proba)
print('ROC AUC score:{:.2f}'.format(adb_clf_roc_auc_score))

ROC AUC score:0.99


In [16]:
# Gradient Boosting (GB)

# GB use the residuals of the predecessors

#download functions
from sklearn.metrics import mean_squared_error as MSE
from sklearn.ensemble import GradientBoostingRegressor

# Load data
data = pd.read_csv('Auto-mpg.csv')
# seprate variables
y = data['mpg']
X = data.iloc[:,1:]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size = 0.3, 
                                                    random_state = SEED)

# instantiate an GB classifier
gbt = GradientBoostingRegressor(n_estimators = 300, max_depth = 1,
                               random_state = SEED)

# fit, predict, MSE
gbt.fit(X_train, y_train)
y_pred = gbt.predict(X_test)
rmse_test = MSE(y_test, y_pred)**(1/2)
print('Test set RMSE:{:.2f}'.format(rmse_test))

Test set RMSE:4.01


### Define the GB regressor
You'll now revisit the Bike Sharing Demand dataset that was introduced in the previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you'll be using a gradient boosting regressor.

As a first step, you'll start by instantiating a gradient boosting regressor which you will train in the next exercise.

In [None]:
# Load data
data = pd.read_csv('Bike Sharing Demand.csv')
# seprate variables
y = data['cnt']
X = data.iloc[:,1:]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size = 0.3, 
                                                    random_state = 2)
# instantiate an GB classifier
gb = GradientBoostingRegressor(n_estimators = 200, max_depth = 4,
                               random_state = 2)

### Train the GB regressor
You'll now train the gradient boosting regressor gb that you instantiated in the previous exercise and predict test set labels.

The dataset is split into 80% train and 20% test. Feature matrices X_train and X_test, as well as the arrays y_train and y_test are available in your workspace. In addition, we have also loaded the model instance gb that you defined in the previous exercise.

In [None]:
# Fit gb to the training set
gb.fit(X_train, y_train)
# Predict test set labels
y_pred = gb.predict(X_test)

### Evaluate the GB regressor
Now that the test set predictions are available, you can use them to evaluate the test set Root Mean Squared Error (RMSE) of gb.

y_test and predictions y_pred are available in your workspace.

In [25]:
# Compute MSE
mse_test = MSE(y_test, y_pred)
# Compute RMSE
rmse_test = mse_test**(1/2)
# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test))

Test set RMSE of gb: 49.392


### Stochastic Gradient Boosting (SGB)
Stochastic gradient boosting involves subsampling the training
dataset and training individual learners on random samples 
created by this subsampling. This reduces the correlation 
between results from individual learners and combining results
with low correlation provides us with a better overall result.

In [None]:
data = pd.read_csv('Bike Sharing Demand.csv')
# seprate variables
y = data['cnt']
X = data.iloc[:,1:]
# split the data
X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                                    test_size = 0.3, 
                                                    random_state = 2)

### Train the SGB regressor
In this exercise, you'll train the SGBR sgbr instantiated in the previous exercise and predict the test set labels.

The bike sharing demand dataset is already loaded processed for you; it is split into 80% train and 20% test. The feature matrices X_train and X_test, the arrays of labels y_train and y_test, and the model instance sgbr that you defined in the previous exercise are available in your workspace.

In [None]:
# instantiate an SGB classifier
# just add a subsample and a max_features
sgbt = GradientBoostingRegressor(n_estimators = 200, max_depth = 4,
                               random_state = 2, subsample = 0.9,
                                max_features = 0.75)
# Fit sgb to the training set
sgbt.fit(X_train, y_train)
# Predict test set labels
y_pred = sgbt.predict(X_test)

### Evaluate the SGB regressor
You have prepared the ground to determine the test set RMSE of sgbr which you shall evaluate in this exercise.

y_pred and y_test are available in your workspace.

In [23]:
# Compute MSE
mse_test = MSE(y_test, y_pred)
# Compute RMSE
rmse_test = mse_test**(1/2)
# Print RMSE
print('Test set RMSE of sgb: {:.3f}'.format(rmse_test))

Test set RMSE of sgb: 49.522
