# Hands-on: Build ML Predictive Model with Python (Classification)

## Overview

In this hands-on activity, develop ML predictive model (classification model) to predict the likelihood of reservation cancellations.

You will learn about:
1. Create classification models using AutoML package (PyCaret) and select the best model.
2. Create classification models using LightGBM model.
3. Test model on holdout dataset (Unseen data).

## Setup

In [None]:
!pip install pycaret==3.2.0
!pip install scikit-learn==1.2.2
!pip install joblib==1.3.2

In [None]:
# Import library
import pandas as pd
import numpy
import time
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter("ignore")
pd.set_option('display.max_columns', None)

from pycaret.classification import *

## Load Data
- Replace this part with your own code. Copy from 'Hands-on: Data Preparation' notebook and change the CSV file name to 'hotel_bookings_v1.csv'

In [None]:
# Replace this part with your own code
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

cos_client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='PjXGOLvd9BTXHT3f_wi2ujiwywR5hnfK7tAJkfmahpxu',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.private.us-south.cloud-object-storage.appdomain.cloud')

bucket = 'mlpredictivemodel-donotdelete-pr-se3ulnjuojrkgg'
object_key = 'hotel_bookings_v1.csv'

body = cos_client.get_object(Bucket=bucket,Key=object_key)['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df = pd.read_csv(body)
df.head(10)

## Process Data

In [None]:
# Drop columns 'reservation_status' and 'reservation_status_date'
df = df.drop(columns=['reservation_status', 'reservation_status_date'])

In [None]:
# Show datatypes
df.dtypes

In [None]:
num_cols = df.shape[1]
print(num_cols)

## Save a copy of dataset for ML model training using AutoAI 

In [None]:
#The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform API
from project_lib import Project

project = Project(None, '<my_project_id>', '<my_project_token>')
pc = project.project_context

# Show Project, Bucket and Assets
print('Project Name: {0}'.format(project.get_name()))
print('Project Description: {0}'.format(project.get_description()))
print('Project Bucket Name: {0}'.format(project.get_project_bucket_name()))
print('Project Assets (Connections): {0}'.format(project.get_assets(asset_type='connection')))

# Save dataframe as csv file in your bucket 
project.save_data(data=df.to_csv(index=False), file_name='hotel_bookings_v1_training.csv', overwrite=True)

## Create Holdout Dataset (Unseen Data)

In [None]:
# Sample 90% of data as training dataset and 10% of data as holdout dataset
data = df.sample(frac=0.9)
data.reset_index(inplace=True, drop=True)

data_unseen = df.drop(data.index)
data_unseen.reset_index(inplace=True, drop=True)

# Print the revised shape
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

## Train Model

In [None]:
# Setup the experiment
exp_clf = setup(data=data, target='is_canceled')

In [None]:
# Show the best model and their statistics
best_model = compare_models() 

In [None]:
# Create Model
start_time = time.time()

lgb = create_model('lightgbm')

end_time = time.time()
elapsed_time = end_time - start_time

print("Elapsed time: ", elapsed_time/60, 'minutes')

## Tune Model

In [None]:
tuned_lgb = tune_model(lgb)

In [None]:
print(tuned_lgb)

## Evaluate Model

In [None]:
plot_model(tuned_lgb, plot = 'auc')

In [None]:
plot_model(tuned_lgb, plot = 'pr')

In [None]:
plot_model(tuned_lgb, plot='feature')

In [None]:
plot_model(tuned_lgb, plot = 'confusion_matrix')

## Predict Model

In [None]:
predict_model(tuned_lgb);

In [None]:
# Finalize model
final_lgb = finalize_model(tuned_lgb)

# Print final model parameters
print(final_lgb)

In [None]:
predict_model(final_lgb);

### Predict on Holdout Dataset

In [None]:
data_unseen['is_canceled'].value_counts()

In [None]:
unseen_predictions = predict_model(final_lgb, data=data_unseen, raw_score=True)
unseen_predictions.head()

In [None]:
unseen_predictions[['is_canceled', 'prediction_label', 'prediction_score_0', 'prediction_score_1']]

In [None]:
dfx = unseen_predictions[(unseen_predictions['is_canceled'])!=(unseen_predictions['prediction_label'])]
dfx[['is_canceled', 'prediction_label', 'prediction_score_0', 'prediction_score_1']]

## Explanation of Results
- LightGBM is among the highest accuracy and AUC and one of the fastest among others.
- The model Accuracy and AUC can be considered high for usage to predict the cancellation likelihood.
- The final data output consist of 2 scores, Score_0 and Score_1. 
- For instance, if Score_1 has a higher probability value from 0 to 1 compared to Score_0, the predictor output is 1, which means the reservation is predicted to be canceled.

## Example of Use Case
1. Objective:
    - To reduce reservation cancellation rate by approaching customers who tend to cancel their reservations.
    
2. Target group:
    - Customers who were predicted to likely cancel their reservation.
    
3. Actions:
    - Take preventive and personalized action by offering discounts or coupons or proactively asking customers if they need any special requests.

## Summary
1. Utilized the AutoML package (PyCaret) to create classification models and selected the best model.
2. Employed the LightGBM model to generate classification models.
3. Assessed the model's performance on a holdout dataset (unseen data).