### Module 12 Lab: Using the XGBOOST algorithm

In this lab, we are going to use a new algorithm, XGBOOST. I would consider it to be an "industrial strength" algoritm and not a teaching algorithm.<P>

It is very similar to the Gradient Boosting algorithm we've already used but not exactly. A few details:
- It is not included in the sklearn package
- It is not installed on our instance in Sagemaker
- You'll have to install it every time you restart the instance
- It is just slightly different than other sklearn models we've used.

In this lab, we will again use our abalone data. We have already prepared it for classification.

The goal is the classify each abalone as 'adult' or 'youth'

Recall, in the target column: adult = 1, youth = 0

In [None]:
# We will need to install XGBOOST every time we restart our instanace
%pip install xgboost

In [None]:
# Now we can import it
import xgboost as xgb

In [None]:
# Import the other stuff
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import confusion_matrix, accuracy_score, plot_confusion_matrix
import matplotlib.pyplot as plt
import boto3
import pandas as pd
import numpy as np
import pickle
import time

### 1. Load and investigate the data
We prepared this data in an earlier module. It should be all ready to go for classification.<P>

In [None]:
# Setup boto3
sess = boto3.session.Session()
s3 = sess.client('s3') 
# Define the bucket & file you want to load
source_bucket = 'machinelearning-shared'
source_key = 'data/kcolvin/abalone_clean.pkl'  # You must use your data here
# Get the file from S3 
response = s3.get_object(Bucket = source_bucket, Key = source_key)
#
# Read the 'Body' part of the response into a variable. This is where the DataFrame data exists in the response.
body = response['Body'].read()
#
# Create a new pandas DataFrame using the pickle.loads() function
abalone_df = pickle.loads(body)
abalone_df.head(3)

In [None]:
# Verify data types and no missing values
abalone_df.info()

### 2. Isolate the X and y variables

In [None]:
y = abalone_df['target']
X = abalone_df.drop(['target'], axis = 1)

### 3. Split the data into training and test sets

In [None]:
# Split into train/test
# Reserve 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
# Verify the sizes of the split datasets
print('X_train:', X_train.shape)
print('y_train:', y_train.shape)
print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

### 4. Create and train a XGB model

Where to start?:
- look for examples on the web. You might search for: xgboost classification examples
- Glance at the XGBOOST documentation: https://xgboost.readthedocs.io/en/stable/python/python_api.html

To start, just use the default hyperparameters and see if you can get it work. At the end of this step, you shold have a trained model.<P>

Hint:<BR>
xgbc = xgb.XGBClassifier()<BR>
xgbc.fit(X_train, y_tain)

In [None]:
# your code here

### 5. Evaluate and show the model performance

Just like we did with other classification algorithms, display the accuracy and the confusion matrix

In [None]:
# Your code here

### 6. Perform cross validation¶

Recall from the previous module on cross validation the cross_val_score() function.

Perform cross validation with k = 5 on your initial model and look for consistency for each fold.


In [None]:
# Recall from previous modules:
# Evaluate using the whole data set: X, y
default_xgb_scores = cross_val_score(xgbc, X, y, cv = 5)
default_xgb_scores

### 7. Perform hyperparameter tuning¶

3 interesting parameters to tune. They are similar to the Gradient Boosting parameters:
- n_estimators
- max_depth
- learning_rate

At the end of this task, you should have the best value for each of these parameters

In [None]:
# Create a default model
xgbc = xgb.XGBClassifier()
#
# Define the range of parameters to evaluate
parameters = {
    "n_estimators":[5,50,250,500],
    "max_depth":[1,3,5,7,9],
    "learning_rate":[0.01,0.1,1,10,100]
}

In [None]:
# Your code here

In [None]:
# Define a function to show the results from the search.
def display(results):
    print(f'Best parameters are: {results.best_params_}')
    print("\n")
    mean_score = results.cv_results_['mean_test_score']
    std_score = results.cv_results_['std_test_score']
    params = results.cv_results_['params']
    for mean,std,params in zip(mean_score,std_score,params):
        print(f'{round(mean,3)} + or -{round(std,3)} for the {params}')

In [None]:
# If you want, show the entire results from the search
display(cv)

### 8. Using the best parameter values above, train a new model and make predictions¶

    Predict if the following abalone is 'adult' or 'youth'

a = [1.0, 0.435, 0.395, 0.090, 0.534, 0.1245, 0.131, 0.25]


In [None]:
# your code here