# Boosting
Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors. 
### Boosting
- Boosting: Ensemble method combining several weak learners to form a strong learner .
- Weak learner: Model doing slightly better than random guessing.
- Example of weak learner: Decision stump (CART whose maximum depth is 1).
- Train an ensemble of predictors sequentially .
- Each predictor tries to correct its predecessor .
- Most popular boosting methods:
    - AdaBoost,
    - Gradient Boosting.

### AdaBoost
- Stands for Adaptive Boosting.
- Each predictor pays more attention to the instances wrongly predicted by its predecessor .
- Achieved by changing the weights of training instances.
- Each predictor is assigned a coeficient α.
- α depends on the predictor's training error .
-AdaBoost: Prediction
    - Classification:
        - Weighted majority voting.
        - In sklearn:  `AdaBoostClassifier` .
    - Regression:
        - Weighted average.
        - In sklearn:  `AdaBoostRegressor`.
- Define the AdaBoost classifier
    - Dataset: the Indian Liver Patient dataset 
    - Task: 
        - Predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. 
        - with AdaBoost ensemble to perform the classification task. 
        - using the ROC AUC score as a metric instead of accuracy.
    - Doing:
        - Dataset
        - Instantiate dt,ada
        - Predict the probabilities of obtaining the positive class in the test set.
        - Extract these probabilities by slicing all the values in the second column
        - Evaluate ada's ROC AUC score, a binary classifier can be determined using the `roc_auc_score()`

In [27]:
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

from sklearn.tree import DecisionTreeRegressor
from sklearn.tree import DecisionTreeClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier as KNN

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error as MSE

from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingRegressor

SEED =1




In [12]:
# Dataset
liver = pd.read_csv('indian_liver_patient/indian_liver_patient_preprocessed.csv', index_col = 0)
X = liver.drop('Liver_disease', axis = 1)
y = liver['Liver_disease']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)


# Split data into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
stratify=y,
random_state=SEED)

liver.head()

Unnamed: 0,Age_std,Total_Bilirubin_std,Direct_Bilirubin_std,Alkaline_Phosphotase_std,Alamine_Aminotransferase_std,Aspartate_Aminotransferase_std,Total_Protiens_std,Albumin_std,Albumin_and_Globulin_Ratio_std,Is_male_std,Liver_disease
0,1.247403,-0.42032,-0.495414,-0.42887,-0.355832,-0.319111,0.293722,0.203446,-0.14739,0,1
1,1.062306,1.218936,1.423518,1.675083,-0.093573,-0.035962,0.939655,0.077462,-0.648461,1,1
2,1.062306,0.640375,0.926017,0.816243,-0.115428,-0.146459,0.478274,0.203446,-0.178707,1,1
3,0.815511,-0.372106,-0.388807,-0.449416,-0.36676,-0.312205,0.293722,0.329431,0.16578,1,1
4,1.679294,0.093956,0.179766,-0.395996,-0.295731,-0.177537,0.755102,-0.930414,-1.713237,1,1


In [13]:
# Import DecisionTreeClassifier
#from sklearn.tree import DecisionTreeClassifier

# Import AdaBoostClassifier
#from sklearn.ensemble import AdaBoostClassifier

# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)

# Instantiate ada
ada = AdaBoostClassifier(base_estimator=dt, n_estimators=180, random_state=1)

# Fit ada to the training set
ada.fit(X_train, y_train)

# Compute the probabilities of obtaining the positive class
y_pred_proba = ada.predict_proba(X_test)[:,1]

# Import roc_auc_score
#from sklearn.metrics import roc_auc_score

# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)

# Print roc_auc_score
print('ROC AUC score: {:.2f}'.format(ada_roc_auc))

ROC AUC score: 0.71


### Gradient Boosting
#### Gradient Boosted Trees
- Sequential correction of predecessor's errors.
- Does not tweak the weights of training instances.
- Fit each predictor is trained using its predecessor's residual errors as labels.
- Gradient Boosted T rees: a CART is used as a base learner
#### Gradient Boosted T rees: Prediction
- Regression:
    - $y_{pred} = y + ηr_1 + ... + ηr_N$
    - In sklearn:  GradientBoostingRegressor .
- Classi×cation:
    - In sklearn:  GradientBoostingClassifier .
#### Example
- Dataset:  Bike Sharing Demand
- Tasks: 
    - to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. 
    - using a gradient boosting regressor.
- Doing:
    - Dataset
    - Instantiate a gradient boosting regressor 
    - Train the dataset
    - Evaluate the GB regressor

In [48]:
#Dataset
bike = pd.read_csv('bikes.csv')

X = bike[['hr', 'holiday', 'workingday', 'temp', 'hum', 'windspeed', 'instant',
       'mnth', 'yr', 'Clear to partly cloudy', 'Light Precipitation', 'Misty']]
y = bike['cnt']

# Split data into 80% train and 20% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

#### GB regressor

In [51]:
# Import GradientBoostingRegressor
#from sklearn.ensemble import GradientBoostingRegressor

# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4, 
            n_estimators=200,
            random_state=2)
# Fit gb to the training set
gb.fit(X_train, y_train)

# Predict test set labels
y_pred = gb.predict(X_test)

# Compute MSE
mse_test = MSE(y_test, y_pred)

# Compute RMSE
rmse_test_gb = mse_test**(1/2)

# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test_gb))

Test set RMSE of gb: 50.726


### Stochastic Gradient Boosting (SGB)

#### Gradient Boosting
- GB involves an exhaustive search procedure.
- Each CART is trained to **find the best split points and features**.
- May lead to CARTs using the same split points and maybe the same features.

#### Stochastic Gradient Boosting
- Each tree is trained on **a random subset of rows** of the training data.
- The sampled instances (40%-80% of the training set) are sampled without replacement.
- Features are sampled (without replacement) when choosing split points.
- Result: further ensemble diversity .
- Effect: adding further variance to the ensemble of trees.

#### Example: Regression with SGB
- Dataset Bike Sharing Demand. 
- Task: solve this bike count regression problem using stochastic gradient boosting.
- Doing
    - Dataset
    - Train the SGB regressor
    - Predict the test set labels.
    - Evaluate test set.

In [59]:
# Import GradientBoostingRegressor
#from sklearn.ensemble import GradientBoostingRegressor

# Instantiate sgbr
sgbr = GradientBoostingRegressor(max_depth=4, 
            subsample=0.9,
            max_features=0.75,
            n_estimators=200,                                
            random_state=2)
# Fit sgbr to the training set
sgbr.fit(X_train, y_train)

# Predict test set labels
y_pred = sgbr.predict(X_test)

# Import mean_squared_error as MSE
#from sklearn.metrics import mean_squared_error as MSE

# Compute test set MSE
mse_test = MSE(y_test, y_pred)

# Compute test set RMSE
rmse_test_sgb = mse_test**(1/2)

# Print rmse_test
print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test_sgb))
print('Test set RMSE of gb: {:.3f}'.format(rmse_test_gb))

print('\nThe stochastic gradient boosting regressor achieves a lower test set RMSE\
\nthan the gradient boosting regressor, which was {:.3f}'.format(rmse_test_gb) )

Test set RMSE of sgbr: 54.604
Test set RMSE of gb: 50.726

The stochastic gradient boosting regressor achieves a lower test set RMSE
than the gradient boosting regressor, which was 50.726
