# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Importing the Data

In [52]:
import pandas as pd

df = pd.read_csv('/Users/dandelion/Library/CloudStorage/OneDrive-RegisUniversity/MSHI/MSDS 600 Intro to Data Science/MSDS 600 NBC/W2- Cleaning and Preparing Data/prepped_churn_data.csv', index_col='CustomerID', )
df

Unnamed: 0_level_0,Tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,TC_T_ratio
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,1,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,0,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,0,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,3,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,1,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,0,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,4,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,1,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,0,74.40,306.60,1,76.650000


In [53]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

# Finding the best model

In [54]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1324
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7032, 8)"
4,Transformed data shape,"(7032, 8)"
5,Transformed train set shape,"(4922, 8)"
6,Transformed test set shape,"(2110, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


In [55]:
# Compare various models 
best_model = compare_models()
##Gradiant Boosting Classifier has the best accuracy

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7958,0.8408,0.5054,0.6485,0.5676,0.4368,0.4428,0.054
lr,Logistic Regression,0.7948,0.8376,0.5252,0.638,0.5756,0.4421,0.446,0.226
ridge,Ridge Classifier,0.7922,0.8217,0.4548,0.6577,0.537,0.4089,0.4208,0.004
ada,Ada Boost Classifier,0.7883,0.835,0.5053,0.6261,0.5592,0.422,0.4262,0.017
lda,Linear Discriminant Analysis,0.7857,0.8217,0.49,0.6227,0.5478,0.4102,0.4155,0.004
lightgbm,Light Gradient Boosting Machine,0.7842,0.8271,0.5214,0.6104,0.5619,0.4201,0.4226,0.353
rf,Random Forest Classifier,0.7747,0.8018,0.4709,0.5963,0.5261,0.381,0.3856,0.044
svm,SVM - Linear Kernel,0.7653,0.7208,0.3311,0.6148,0.4244,0.2949,0.3195,0.006
et,Extra Trees Classifier,0.7605,0.7824,0.4724,0.5572,0.5111,0.354,0.3562,0.032
knn,K Neighbors Classifier,0.7582,0.7492,0.4449,0.5573,0.4941,0.338,0.3421,0.07


In [56]:
#GBC model performs best on this data set 
best_model

In [57]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,Tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TC_T_ratio,Churn,prediction_label,prediction_score
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,0,74.400002,306.600006,76.650002,1,1,0.5578


# Saving and Loading Model

In [58]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'TC_T_ratio'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categori...
                                             criterion='f

In [59]:
loaded_gbc = load_model('GBC')

Transformation Pipeline and Model Successfully Loaded


In [60]:
new_data=df.iloc[-2:-1]
predict_model(loaded_gbc, new_data)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,Tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TC_T_ratio,Churn,prediction_label,prediction_score
CustomerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,0,74.400002,306.600006,76.650002,1,1,0.5578


# Testing Python Script

In [78]:
from IPython.display import Code

Code('predict_churn2.py')

In [77]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


   CustomerID  Tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
0  9305-CKSKC      22             1         0              2       97.400002   
1  1452-KNGVK       8             0         1              1       77.300003   
2  6723-OKKJM      28             1         0              0       28.250000   
3  7832-POPKP      62             1         0              2      101.699997   
4  6348-TACGU      10             0         0              1       51.150002   

   TotalCharges  TC_T_ratio  prediction_label  prediction_score  
0    811.700012   36.895454                 1            0.5740  
1   1701.949951  212.743744                 0            0.8566  
2    250.899994    8.960714                 0            0.9188  
3   3106.560059   50.105808                 0            0.7959  
4   3440.969971  344.096985                 0            0.6398  


# Summary

The goal was to automate the process of selecting, saving, and using a machine learning model to predict customer churn. After loading the prepped churn data inta a pandas dataframe, PyCaret library was used to select the best model which was the Gradient Boosting Classifier (GBC). The trained model was saved to a disk using the 'save_model()' function allowing it to be reused without neding to retrain it. It was loaded back to make sure it saved correctly. The next portion was creating a python script 'predict_churn2.py' to make predictions for the new churn data provided. The predictions were printed predictiong that 1/5 customer will churn with 57.40% confidence. 