# **SML 1**

**Please "Save A Copy" in your google drive and work with your own copy**

The public dataset is from IBM Watson Analytics. It is about customer attributes and behaviour.

The source is [here](https://www.kaggle.com/pankajjsh06/ibm-watson-marketing-customer-value-data) from Kaggle.

### AutoML Tools
The 2 AutoML tools that we will use are:
1. Exploratory Phase - Pandas Profiling
2. Model Selection Phase - PyCaret

In [None]:
# Install the two packages
# pandas profiling builds a nice report about the dataset
# pycaret is our autoML package
# pyyaml is a problematic package in colab that conflicts with pandas profiling
# so we install a version that will avoid the conflicts
!pip3 install -U ydata-profiling pycaret[full]


Connect your google drive and load the customer_value dataset.

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Import python packages that we will be using.

In [None]:
# import the required packages
import pandas as pd
from pathlib import Path
from pandas_profiling import ProfileReport
from pycaret.regression import *

In [None]:
# Load dataset from Google Drive - customer_value.csv
data_path = Path('/content/drive/My Drive/pcml_data/5SML/')
filename = 'customer_value.csv'

# Use the read_csv method of pandas and assign the result to a variable called customer_value


In [None]:
# A quick eyeball on the dataset


### Data Science General Workflow
Every data science project starts with the following workflow

In [None]:
# Step 1: Source and load your data
df = pd.read_csv(...)

In [None]:
# Step 2: Wrangle your data
df = df.apply(lambda x: ..., axis=1)

In [None]:
# Step 3: Inspect the data (exploration)
df.profile_report()

In [None]:
# Step 4: Perform a first fitting of the models to the data

# Setup the experiment on pycaret
experiment = setup(
    data=df,
    target=...,
    #fix_imbalance=True,
    #data_split_stratify=True
    )

# Fit the models
compare_models()

In [None]:
# Step 5: Choose the best model
best_model = create_model('xgboost')

In [None]:
# Step 6a: Analyze the model's performance
evaluate_model(best_model)

In [None]:
# Step 6b: Explain the model's important features
plot_model(best_model, plot='feature')

In [None]:
# Step 7 branch - Is the accuracy good enough?
# Branch #1: Not good enough
# ## Look at the features to see what is wrong.
# ## Use stacking and blending techniques
# ## Figure out what additional data can be useful
# ## Rinse and repeat from Step 4

# Branch #2: Good enough
# ## look at the features and try to hypothesize why.

In [None]:
# Step 8: Tune the model
tuned_model = tune_model(...)

In [None]:
# Step 9: Validate the model on the test set
predict_model(tuned_model)

In [None]:
# Step 10: Retrain the model using all the data
final_model = finalize_model(...)

In [None]:
# Step 10: Save the model
save_model(final_model, 'final_model_2023_03_30_18_30')

In [None]:
# Step 11: Create a dashboard and load the model into the script
loaded_model = load_model('best_model_2023_03_30_18_30')

### Explore the data using YData Profiling

In this section, we will do some data exploration to get a feel of the dataset.

In [None]:
# build the report using ydata profiling


We can also export the profile report to a html file that can be presented at the frontend.

In [None]:
# Use the to_file method of ydata profiling to export to a html file locally


Make a note of which features we don't want:
1. Customer - too unique, no generalizable pattern to learn.
2. Customer Lifetime Value - this is created by the company, lower total claims and higher premium will result in a higher score (this is a causation problem).
3. Response - Respond to marketing campaign may not be a useful generalizable feature.
4. Effective To Date - days/months since effect is a better measure, not the date itself.

### Model Selection and Evaluation
Now, let's see how well we can predict total claims using this set of historical data.

Note 2 important things in the setup:
1. What is the feature name that we want to predict.
2. What are the features that we want/ don't want.

In [None]:
# setup the model
# https://pycaret.readthedocs.io/en/latest/api/regression.html

regression_model = setup(
    data=...,
    target=...,
    ignore_features=...,
)

In [None]:
# We can check the predictive features of the training set
get_config('X_train')

Unnamed: 0,Income,Monthly Premium Auto,Months Since Last Claim,Months Since Policy Inception,State_Arizona,State_California,State_Nevada,State_Oregon,State_Washington,Coverage_Basic,Coverage_Extended,Coverage_Premium,Education_Bachelor,Education_College,Education_Doctor,Education_High School or Below,Education_Master,EmploymentStatus_Disabled,EmploymentStatus_Employed,EmploymentStatus_Medical Leave,EmploymentStatus_Retired,EmploymentStatus_Unemployed,Gender_M,Location Code_Rural,Location Code_Suburban,Location Code_Urban,Marital Status_Divorced,Marital Status_Married,Marital Status_Single,Number of Open Complaints_0,Number of Open Complaints_1,Number of Open Complaints_2,Number of Open Complaints_3,Number of Open Complaints_4,Number of Open Complaints_5,Number of Policies_1,Number of Policies_2,Number of Policies_3,Number of Policies_4,Number of Policies_5,Number of Policies_6,Number of Policies_7,Number of Policies_8,Number of Policies_9,Policy Type_Corporate Auto,Policy Type_Personal Auto,Policy Type_Special Auto,Policy_Corporate L1,Policy_Corporate L2,Policy_Corporate L3,Policy_Personal L1,Policy_Personal L2,Policy_Personal L3,Policy_Special L1,Policy_Special L2,Policy_Special L3,Renew Offer Type_Offer1,Renew Offer Type_Offer2,Renew Offer Type_Offer3,Renew Offer Type_Offer4,Sales Channel_Agent,Sales Channel_Branch,Sales Channel_Call Center,Sales Channel_Web,Vehicle Class_Four-Door Car,Vehicle Class_Luxury Car,Vehicle Class_Luxury SUV,Vehicle Class_SUV,Vehicle Class_Sports Car,Vehicle Class_Two-Door Car,Vehicle Size_Large,Vehicle Size_Medsize,Vehicle Size_Small
5591,34865.0,69.0,15.0,16.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
7436,0.0,61.0,26.0,79.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6495,73049.0,139.0,3.0,42.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
7647,75007.0,101.0,1.0,19.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
8595,0.0,124.0,1.0,43.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3511,26121.0,119.0,12.0,51.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
8586,0.0,68.0,35.0,98.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2918,0.0,63.0,4.0,52.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
5748,0.0,139.0,5.0,56.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0


In [None]:
# What are the available models in pycaret's regression module?
models()

Unnamed: 0_level_0,Name,Reference,Turbo
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
lr,Linear Regression,sklearn.linear_model._base.LinearRegression,True
lasso,Lasso Regression,sklearn.linear_model._coordinate_descent.Lasso,True
ridge,Ridge Regression,sklearn.linear_model._ridge.Ridge,True
en,Elastic Net,sklearn.linear_model._coordinate_descent.Elast...,True
lar,Least Angle Regression,sklearn.linear_model._least_angle.Lars,True
llar,Lasso Least Angle Regression,sklearn.linear_model._least_angle.LassoLars,True
omp,Orthogonal Matching Pursuit,sklearn.linear_model._omp.OrthogonalMatchingPu...,True
br,Bayesian Ridge,sklearn.linear_model._bayes.BayesianRidge,True
ard,Automatic Relevance Determination,sklearn.linear_model._bayes.ARDRegression,False
par,Passive Aggressive Regressor,sklearn.linear_model._passive_aggressive.Passi...,True


In [None]:
# What evaluation metrics can we use?
get_metrics()

In [None]:
# Run all the models under "Turbo" on the dataset
# Pycaret auto ranks the models based on accuracy criteria in sort

compare_models(sort='MAPE')compare_models(sort='MAPE')

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
dt,Decision Tree Regressor,91.9106,25214.2129,158.5085,0.6984,0.5877,0.5948,0.093
rf,Random Forest Regressor,74.0853,13405.216,115.5868,0.8395,0.4504,0.7182,4.466
et,Extra Trees Regressor,76.0584,14870.6636,121.797,0.8215,0.4581,0.7685,4.724
lightgbm,Light Gradient Boosting Machine,76.9162,13716.8998,116.9228,0.8358,0.4641,0.7996,0.195
gbr,Gradient Boosting Regressor,78.545,14025.1292,118.2283,0.8321,0.4703,0.8293,1.271
lasso,Lasso Regression,94.6219,19298.5307,138.6632,0.7699,0.6636,0.9029,0.036
omp,Orthogonal Matching Pursuit,95.4438,19518.9904,139.4664,0.7672,0.6844,0.9168,0.03
ridge,Ridge Regression,95.0409,19346.9358,138.8464,0.7692,0.6595,0.936,0.032
br,Bayesian Ridge,94.8212,19328.6512,138.7806,0.7694,0.6605,0.9395,0.059
lr,Linear Regression,95.6314,19464.4729,139.2569,0.7679,0.6545,0.9475,0.037


DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=467, splitter='best')

To train a model and inspect its 10-fold cross-validation, we can use the create_model function in pycaret and specify which model we want.

The result is a trained ML model object that we can eventually save as a model file.

In [None]:
# create a model based linear regression


In [None]:
# create a model based on decision tree regressor


In [None]:
# create a model based on an ensemble regressor


In [None]:
# Check the accuracy on the test set


In [None]:
# The difference between giving a dataframe to predict_model vs not giving


In [None]:
# Check the predicted values for each observation in the test set


In [None]:
# Check MAPE calculation


In [None]:
# Create custom metrics
# ## MAPD

### Feature Importances

Investigate the important features of the best model and the most common model in statistics - the Linear Regression.

Which one make better sense?

In [None]:
# force the plot to show
%matplotlib inline

In [None]:
# plot the model to take a look at the feature importance
# ## Use the linear regression model


In [None]:
# ## Try the decision tree model


In [None]:
# ## More analytics (evaluate)

In [None]:
# ## Tune the model


In [None]:
# Recall that we split the data into training and test sets
# After we are satisfied with the test accuracy, we are ready to 
# use the model from this point forward

# Since ML accuracy generally would improve with more data, we should
# deploy a model that is trained with the full dataset, and not the one
# that is only fitted on training data (only 70% of the full dataset)

# Retrain the model with the full dataset


In [None]:
import datetime

# Save the trained model as a file
# We can install pycaret in other computers, load this model file, and
# get the trained model capability

# Always version your file, easiest is by datetime. Use f-string to automate
# the naming process
model_filename = f'total_claims_model_62_1_{datetime.datetime.now()}'
save_model(final_model, model_filename)

### Deploying the model

Deploying the model requires some understanding about software engineering. 

Typically, data scientist/analyst pass the model file and a schema of the input and output to the software engineer.

What is an input schema? Its simply the set of data that the model ingest as input.

In [None]:
# Extracting an observation from the original dataset
# The loaded model file can give a prediction if the following features
# are given.
customer_value.iloc[0]

The above is a pandas representation of an input. Generally, software applications uses a data schema known as json.

Note that json looks exactly like a python dictionary.

In [None]:
# This is how the same data looks like in json
customer_value.iloc[0].to_json()

'{"Customer":"BU79786","State":"Washington","Customer Lifetime Value":2763.519279,"Response":"No","Coverage":"Basic","Education":"Bachelor","Effective To Date":"2\\/24\\/11","EmploymentStatus":"Employed","Gender":"F","Income":56274,"Location Code":"Suburban","Marital Status":"Married","Monthly Premium Auto":69,"Months Since Last Claim":32,"Months Since Policy Inception":5,"Number of Open Complaints":0,"Number of Policies":1,"Policy Type":"Corporate Auto","Policy":"Corporate L3","Renew Offer Type":"Offer1","Sales Channel":"Agent","Total Claim Amount":384.811147,"Vehicle Class":"Two-Door Car","Vehicle Size":"Medsize"}'

To use the model, the AI application that the software engineer built must pass data in the above format to the model, and the model will give the predicted result.

In [None]:
simulated_input = {
    "State": "Washington",
    "Coverage": "Basic",
    "Education": "Bachelor",
    "EmploymentStatus": "Employed",
    "Gender": "F",
    "Income": 60000,
    "Location Code": "Suburban",
    "Marital Status": "Married",
    "Monthly Premium Auto": 75,
    "Months Since Last Claim": 12,
    "Months Since Policy Inception": 7,
    "Number of Open Complaints": 1,
    "Number of Policies": 3,
    "Policy Type": "Corporate Auto",
    "Policy": "Corporate L3",
    "Renew Offer Type": "Offer1",
    "Sales Channel": "Agent",
    "Vehicle Class": "Two-Door Car",
    "Vehicle Size": "Medsize"}

To use the model's prediction function, we need to wrap the input as a pandas dataframe. 

We can easily convert dictionaries to pandas dataframes through pd.DataFrame([{...}, {...]). This will take care of the feature names as well, so we don't need to worry about the order of the features.

In [None]:
input_data = pd.DataFrame([simulated_input])

In [None]:
# Load the model


In [None]:
# Use the loaded model to predict the input
