### Module 12: Learning Notebook: The Gradient Boosting Algorithm

As we start to use more sophisticated algorithms, we won't spend too much time trying to explain how they work. You can investigate the details later and you learn more about ML.<P>

How Boosting Works:<P>

Boosting is a sequential technique which works on the principle of ensemble, which typically uses decision trees in some form. It combines a set of weak learners and delivers improved prediction accuracy. At any instant t, the model outcomes are weighed based on the outcomes of previous instant t-1. The outcomes predicted correctly are given a lower weight and the ones miss-classified (predicted) are weighted higher. This technique is followed for a classification problem while a similar technique is used for regression.

Visually:

<img src="images/gb.png" alt="Gradient Boosting" style="width: 400px;"/><P>
 
https://en.wikipedia.org/wiki/Gradient_boosting

In [1]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
import matplotlib.pyplot as plt
import boto3
import pandas as pd
import numpy as np
import pickle
import time

### 1. Load and investigate the data
We prepared this Titanic data. It should be all ready to go for classification.<P>

Remember:
- Gender: 0 = male, 1 = female
- HasCabin: 0 = no assigned cabin on the ship (cheap ticket), 1 = has cabin (more expensive ticket)
- C, Q, S: Dummy varibles representing the City where the passenger got on the ship
- Survivied: 0 = died, 1 = survived

In [2]:
# Setup boto3
sess = boto3.session.Session()
s3 = sess.client('s3') 
# Define the bucket & file you want to load
source_bucket = 'machinelearning-read-only'
source_key = 'data/titanic_clean.pkl'
# Get the file from S3 
response = s3.get_object(Bucket = source_bucket, Key = source_key)
#
# Read the 'Body' part of the response into a variable. This is where the DataFrame data exists in the response.
body = response['Body'].read()
#
# Create a new pandas DataFrame using the pickle.loads() function
titanic_df = pickle.loads(body)
titanic_df.head(6)

Unnamed: 0,Gender,Age,HasCabin,C,Q,S,Survived
0,0,22.0,0,0,0,1,0
1,1,38.0,1,1,0,0,1
2,1,26.0,0,0,0,1,1
3,1,35.0,1,0,0,1,1
4,0,35.0,0,0,0,1,0
5,0,29.699118,0,0,1,0,0


In [3]:
# Verify data types and no missing values
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Gender    889 non-null    int64  
 1   Age       889 non-null    float64
 2   HasCabin  889 non-null    int64  
 3   C         889 non-null    uint8  
 4   Q         889 non-null    uint8  
 5   S         889 non-null    uint8  
 6   Survived  889 non-null    int64  
dtypes: float64(1), int64(3), uint8(3)
memory usage: 37.3 KB


### 2. Isolate the X and y variables

In [4]:
y = titanic_df['Survived']
X = titanic_df.drop(['Survived'], axis = 1)

### 3. Split the data into training and test sets

In [5]:
# Split into train/test
# Reserve 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# Verify the sizes of the split datasets
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

X_train: (711, 6)
y_train: (711,)
X_test: (178, 6)
y_test: (178,)


### 4. Create and train a Gradient Boosting Classifier model
Read about the hyperparameters here:<P>
    https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

In [6]:
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=None, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

### 5. Evaluate and show performance metrics

In [7]:
# Predict using the trained model and X_test data
y_pred = gbc.predict(X_test)
# Show the first 4 from predicted/actual
print('Predicted survivial:', y_pred[0:4]) # Show a few of the predicted classes
print('Actual survival:', list(y_test[0:4])) # Show a few of the actual classes
# To get the accuracy, call the score() function on the trained model
acc = gbc.score(X_test, y_test) # Number predicted correctly divided by total in data set
print('Our model accuracy:', acc)
print('Confusion Matrix:\n', confusion_matrix(y_test, y_pred))

Predicted survivial: [0 1 1 1]
Actual survival: [0, 1, 1, 0]
Our model accuracy: 0.8033707865168539
Confusion Matrix:
 [[92 21]
 [14 51]]


### 6. Perform Hyperparameter Tuning
Here is an example of hyperparameter tuning on the Gradient Boosting Classifier algorithm.<P>
    
3 interesting parameters to tune:<P>
**n_estimators: int, default=100**

    The number of boosting stages to perform. Gradient boosting is fairly robust to over-fitting so a large number usually results in better performance. Values must be in the range [1, inf).
    
**max_depth: int, default=3**

    The maximum depth of the individual regression estimators. The maximum depth limits the number of nodes in the tree. Tune this parameter for best performance; the best value depends on the interaction of the input variables. Values must be in the range [1, inf).

**learning_rate: float, default=0.1**

    Learning rate shrinks the contribution of each tree by learning_rate. There is a trade-off between learning_rate and n_estimators. Values must be in the range (0.0, inf).



In [8]:
# Define a function to show the results from the search.
# We'll use this below
def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')

In [9]:
# Create a default model
gbc = GradientBoostingClassifier()
#
# Define the range of parameters to evaluate
parameters = {
    "n_estimators":[5,50,250,500],
    "max_depth":[1,3,5,7,9],
    "learning_rate":[0.01,0.1,1,10,100]
}

In [10]:
# Import a very powerful function called Grid Search Cross Validation.
# It does the hyper parameter tuning for us
#   https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
from sklearn.model_selection import GridSearchCV
#
# Create the cv object using our model, the paramters and our k-value for k-fold cross validation
cv = GridSearchCV(gbc,parameters,cv=5)
#
# Peform the grid search. This will take a while, like 3 minutes on our default instant type
start = time.time()
print("Executing...")
# Here it goes...
cv.fit(X,y.values.ravel()) # Fit with the whole dataset (X,y)
end = time.time()
print(end - start, 'seconds')

Executing...
177.02284145355225 seconds


In [11]:
# Show the results of the search
print('Best Score:', cv.best_score_)
print('Best parameter values:', cv.best_params_)

Best Score: 0.7952961340697009
Best parameter values: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 250}


In [13]:
# If you want, show the entire results from the search
#display(cv)

### 7. Where to go from here
- We tuned the parameters and found the best performing combination
- Now, create a new gbc model using those parameters
- Train the model using X_train and y_train
- Evaluate the final performance
- Use the model for predicting new passengers, if desired