# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [19]:
from pycaret.classification import setup, compare_models, predict_model, create_model, save_model, load_model

In [16]:
import pandas as pd
import numpy as np

In [25]:
df = pd.read_csv(r'C:\Users\jwkon\Desktop\School\MSDS600 - Introduction to Data Science\Week 5\Data\churn_data_clean.csv', index_col='customerID')
df = df.drop('Unnamed: 0', axis=1)

In [26]:
df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5375,1,0,0,2,29.85,29.85,0,2.0
3962,34,1,1,3,56.95,1889.5,0,67.178227
2564,2,1,0,3,53.85,108.15,1,4.008357
5535,45,0,1,0,42.3,1840.75,0,88.516548
6511,2,1,0,2,70.7,151.65,1,4.144979


In [27]:
#View data types
df.dtypes

tenure              int64
PhoneService        int64
Contract            int64
PaymentMethod       int64
MonthlyCharges    float64
TotalCharges      float64
Churn               int64
charge_ratio      float64
dtype: object

In [28]:
#Initalize training environment and create transformation pipeline
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,2297
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7043, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


#### 1. Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
###### After running compare_models without a prefered metric, the best Modle was GBC. However, there is better performance with using higher recall.

In [29]:
#Compare performance of available models
best_model = compare_models(sort='Recall')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.6905,0.8048,0.8362,0.4578,0.5915,0.3753,0.4196,0.005
svm,SVM - Linear Kernel,0.7051,0.0,0.5938,0.547,0.5138,0.3289,0.364,0.01
qda,Quadratic Discriminant Analysis,0.5651,0.5715,0.5857,0.3502,0.3909,0.1225,0.1458,0.007
lda,Linear Discriminant Analysis,0.7884,0.8213,0.5186,0.6289,0.5674,0.4292,0.4334,0.006
lightgbm,Light Gradient Boosting Machine,0.7793,0.8247,0.5072,0.606,0.5518,0.407,0.4101,0.024
lr,Logistic Regression,0.7905,0.8297,0.5049,0.6403,0.5629,0.4279,0.4342,1.112
gbc,Gradient Boosting Classifier,0.7915,0.8346,0.4989,0.6456,0.5613,0.4277,0.4347,0.073
dt,Decision Tree Classifier,0.7327,0.6626,0.4981,0.5009,0.4989,0.3168,0.3171,0.006
ada,Ada Boost Classifier,0.7878,0.8326,0.4905,0.6337,0.5521,0.4163,0.4226,0.037
et,Extra Trees Classifier,0.7594,0.7774,0.4905,0.5579,0.5217,0.3619,0.3635,0.1


###### View shape of last row in dataframe

In [30]:
#Shape of the last row of df
print(df.iloc[-2:-1].shape)
#Display last row of df
df.iloc[-2:-1]

(1, 8)


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
5934,4,1,0,3,74.4,306.6,1,8.120968


In [31]:
#Tested other model results
#createModel = create_model('ridge')
#pred_model = predict_model(createModel)
#predict_model(createModel, df.iloc[-2:-1])

In [32]:
#Preview perdiction of last row
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5934,4,1,0,3,74.4,306.6,1,8.120968,1,0.9501


#### 2. Save the model to disk

In [33]:
save_model(best_model, 'nb')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                 ('cluster_all', 'passthrough'),
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Cl

In [34]:
import pickle
#Save model to pickle file
with open('nb.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [35]:
#Reload model from saved pickle file
with open('nb.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [36]:
#Create copy of last row and pass to new_data.This will be used to validate model performance
new_data =  df.iloc[-2:-1].copy()

In [38]:
#Validate saved pk file model performance
loaded_nb = load_model('nb')

Transformation Pipeline and Model Successfully Loaded


In [39]:
#View model performance
predict_model(loaded_nb,new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_ratio,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
5934,4,1,0,3,74.4,306.6,1,8.120968,1,0.9501


In [10]:
#View code from PredChurn script
from IPython.display import Code
Code('Week_5_PredChurn.py')

In [40]:
#Test model performance on new dataset
%run Week_5_PredChurn.py

Transformation Pipeline and Model Successfully Loaded
predictions: 
customerID
9305-CKSKC       Churn
1452-KNGVK    No Churn
6723-OKKJM       Churn
7832-POPKP       Churn
6348-TACGU    No Churn
Name: Churn_Prediction, dtype: object


# Summary

This assignment started with lots of issues from installing Pycaret, specifically continuous dependency conflicts. I was able to resolve them by creating a new environment and fresh stall of required packages. The best model from the Pycaret compare_models function was Gradient Boosting Classifier. However, I wanted to focus on the model with the highest Recall value, and this was Naïve Bayes. The model performed very well compared to the others, with a score of 0.95 when evaluating the last row. 
Creating the script was very simple and a great learning experience. I copied a transformation step from Week 3 and added it to the load data function. I tested the script in Jupyter Lab it executed successfully after evaluating the new_curn_data.csv.  
