# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
%matplotlib inline
import phik
import seaborn as sns
import numpy as np

In [2]:
from pycaret.classification import setup, tune_model, compare_models, predict_model, save_model, load_model

In [3]:
df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0
5575-GNVDE,34,1,1,1,56.95,1889.50,0
3668-QPYBK,2,1,0,1,53.85,108.15,1
7795-CFOCW,45,0,1,3,42.30,1840.75,0
9237-HQITU,2,1,0,2,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0
2234-XADUH,72,1,1,0,103.20,7362.90,0
4801-JZAZL,11,0,0,2,29.60,346.45,0
8361-LTMKD,4,1,0,1,74.40,306.60,1


In [4]:
automl = setup(df, target='Churn', numeric_features=['PhoneService','PaymentMethod', 'Contract'], session_id=42)

Unnamed: 0,Description,Value
0,session_id,42
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 7)"
5,Missing Values,False
6,Numeric Features,6
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


In [5]:
automl[6]

10

In [6]:
best_model = compare_models(sort='AUC')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7944,0.833,0.4985,0.6473,0.5625,0.4312,0.4379,0.179
catboost,CatBoost Classifier,0.7952,0.8322,0.5092,0.6457,0.5691,0.4373,0.4428,1.578
ada,Ada Boost Classifier,0.794,0.8295,0.497,0.646,0.5612,0.4298,0.4364,0.095
lightgbm,Light Gradient Boosting Machine,0.7956,0.8241,0.5245,0.6406,0.5759,0.4433,0.4476,0.095
lr,Logistic Regression,0.795,0.8239,0.5069,0.647,0.5679,0.4362,0.4421,0.559
qda,Quadratic Discriminant Analysis,0.7489,0.8174,0.724,0.52,0.605,0.4282,0.4411,0.01
lda,Linear Discriminant Analysis,0.7865,0.8162,0.4993,0.6239,0.554,0.416,0.4209,0.009
xgboost,Extreme Gradient Boosting,0.78,0.8122,0.5016,0.6048,0.5478,0.4042,0.4076,0.403
rf,Random Forest Classifier,0.7818,0.8031,0.5084,0.6065,0.5526,0.41,0.4131,0.236
nb,Naive Bayes,0.7544,0.7851,0.6537,0.5309,0.5856,0.4138,0.4186,0.009


In [7]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=42, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [8]:
tuned_best_model = tune_model(best_model)

Unnamed: 0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,0.8032,0.8406,0.5115,0.67,0.5801,0.4546,0.4617
1,0.7972,0.8157,0.458,0.6742,0.5455,0.421,0.434
2,0.7764,0.829,0.458,0.6061,0.5217,0.3795,0.3859
3,0.7967,0.8198,0.458,0.6742,0.5455,0.4206,0.4337
4,0.8008,0.8327,0.5115,0.6634,0.5776,0.4501,0.4566
5,0.813,0.8324,0.4885,0.7191,0.5818,0.467,0.4815
6,0.7805,0.8269,0.3969,0.642,0.4906,0.3604,0.3774
7,0.813,0.8569,0.5191,0.701,0.5965,0.4783,0.4874
8,0.8008,0.8424,0.4846,0.6702,0.5625,0.4378,0.4475
9,0.8049,0.8618,0.4846,0.6848,0.5676,0.4463,0.4574


In [9]:
df.iloc[-2:-1]

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1


In [10]:
predict_model(tuned_best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,1,0.5475


In [11]:
save_model(tuned_best_model, 'gbc')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['PhoneService',
                                                           'PaymentMethod',
                                                           'Contract'],
                                       target='Churn', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None...
                                             loss='deviance', max_depth=1,
                                             max_features=1.0,
                                             max_leaf_nodes=None,
                  

In [12]:
import pickle

with open('gbc_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [13]:
with open('gbc_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [14]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1])

In [15]:
loaded_gbc = load_model('gbc')

Transformation Pipeline and Model Successfully Loaded


In [16]:
predict_model(loaded_gbc, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,0.5475


In [17]:
from IPython.display import Code
Code('predict_churn.py')

In [18]:
%run predict_churn.py
print

Transformation Pipeline and Model Successfully Loaded
            of Churn  Probability
customerID                       
9305-CKSKC     Churn       0.5985
1452-KNGVK  No Churn       0.5580
6723-OKKJM  No Churn       0.9491
7832-POPKP  No Churn       0.7240
6348-TACGU  No Churn       0.7272


<function print>

# Summary

I began by loading the prepped data from week 2. I had to change the way I prepped the data because pycaret had too many n_features. I'm seeing how I can use sklearn param and predict proba features to enhance my ml algorithm. I didn't get a chance to test the more advanced autoML libraries like H20 and TPOT. I also didn't understand why the new_data was converting the contracts numerically so that only one-year contracts produced a 1. I would've liked to see the diffences in the predictions with month-to-month and two year contracts being quantified individually. Overall I optimized the best_model by sorting the data by "AUC" rather than Accuracy. I think the area under the curve better models the true predictions and minimizes false positives better when greater than the Accuracy. I tuned the hyperparameters using PyCaret and the accuracy improved to nearly 80%. This accuracy is much better than the randomforestclassifications from last week, though less intuitive when setting hyperparameters. I look forward to practicing with these libraries more in the future. I will play with GUI and python packaging as this will help me run the reports I am automating at work.