# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model, tune_model, create_model
import pickle
from IPython.display import Code

In [17]:
churn_df = pd.read_csv('churn_data_numeric.csv', index_col = 'customerID')
churn_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,TotalCharge_MonthlyCharge_ratio,TotalCharge_tenure_ratio,Automatic_payment
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,1.000000,29.850000,0
5575-GNVDE,34,1,1,1,56.95,1889.50,0,33.178227,55.573529,0
3668-QPYBK,2,1,0,1,53.85,108.15,1,2.008357,54.075000,0
7795-CFOCW,45,0,1,2,42.30,1840.75,0,43.516548,40.905556,1
9237-HQITU,2,1,0,0,70.70,151.65,1,2.144979,75.825000,0
...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,23.472877,82.937500,0
2234-XADUH,72,1,1,3,103.20,7362.90,0,71.345930,102.262500,1
4801-JZAZL,11,0,0,0,29.60,346.45,0,11.704392,31.495455,0
8361-LTMKD,4,1,0,1,74.40,306.60,1,4.120968,76.650000,0


In [18]:
automl = setup(churn_df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,5591
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 10)"
4,Transformed data shape,"(7043, 10)"
5,Transformed train set shape,"(4930, 10)"
6,Transformed test set shape,"(2113, 10)"
7,Numeric features,9
8,Preprocess,True
9,Imputation type,simple


In [56]:
best_model = compare_models(sort='AUC')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7941,0.8376,0.4877,0.6485,0.556,0.4257,0.4334,0.266
lr,Logistic Regression,0.7939,0.8357,0.5122,0.6393,0.5678,0.4349,0.4401,0.139
ada,Ada Boost Classifier,0.7931,0.8317,0.5114,0.6372,0.5672,0.4334,0.4381,0.107
qda,Quadratic Discriminant Analysis,0.7209,0.8242,0.7844,0.4844,0.5987,0.4027,0.4305,0.026
ridge,Ridge Classifier,0.7921,0.8234,0.4457,0.6611,0.5308,0.4042,0.4179,0.027
lda,Linear Discriminant Analysis,0.7901,0.8234,0.497,0.6328,0.5557,0.4212,0.427,0.026
lightgbm,Light Gradient Boosting Machine,0.7844,0.8225,0.5022,0.6148,0.5522,0.4123,0.4163,0.258
nb,Naive Bayes,0.7166,0.8091,0.7622,0.4787,0.5879,0.3888,0.4134,0.024
rf,Random Forest Classifier,0.768,0.7981,0.4618,0.5791,0.5133,0.3637,0.3679,0.236
et,Extra Trees Classifier,0.7604,0.7758,0.4679,0.5586,0.5089,0.3522,0.3548,0.185


In [57]:
best_model

In [58]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'TotalCharge_MonthlyCharge_ratio',
                                              'TotalCharge_tenure_ratio',
                                              'Automatic_payment'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_e...
                                             criterion='friedman_mse', init=None,
                           

In [59]:
with open('GBC_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [60]:
with open('GBC_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [62]:
new_data = churn_df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
prediction = loaded_model.predict(new_data)

KeyError: "['Churn'] not found in axis"

In [63]:
loaded_gbc = load_model('GBC')

Transformation Pipeline and Model Successfully Loaded


In [64]:
predict_model(loaded_gbc, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,TotalCharge_MonthlyCharge_ratio,TotalCharge_tenure_ratio,Automatic_payment,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7832-POPKP,62,1,0,0,101.699997,3106.560059,30.546312,50.105808,0,0,0.6479


In [41]:
#Code('predict_churn.py')

In [65]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
           Churn_prediction  prediction_score
customerID                                   
9305-CKSKC            Churn            0.5465
1452-KNGVK         No Churn            0.8660
6723-OKKJM         No Churn            0.9365
7832-POPKP         No Churn            0.6479
6348-TACGU         No Churn            0.9450


# Summary

In this assignment I began by loading in the pre processed data from week 2 into the churn data frame. Then I set up the automl that will be used to trai our model on with the target being churn. After the comapre models function was used to compare and find the best model to be used for our dataset and target. The gradient boosing classifier was determined to be the best model based on the having the highest accuracy and AUC while being very close to all the other metrics except recall. The model was then saved and processed so that it can be used with a python script and other data sets. Last the predict_churn python script was executed whch gave the results [1,0,0,0,0]. This does not perfectly predict based on our new data but we do not know how that would look large scale.