# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
df = pd.read_csv('prepped_churn_data.csv', index_col = 'customerID')
df

Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,0,1,0,0,0,29.85,29.85,0
5575-GNVDE,1,34,1,1,1,56.95,1889.50,0
3668-QPYBK,2,2,1,0,1,53.85,108.15,1
7795-CFOCW,3,45,0,1,2,42.30,1840.75,0
9237-HQITU,4,2,1,0,0,70.70,151.65,1
...,...,...,...,...,...,...,...,...
6840-RESVB,7038,24,1,1,1,84.80,1990.50,0
2234-XADUH,7039,72,1,1,3,103.20,7362.90,0
4801-JZAZL,7040,11,0,0,0,29.60,346.45,0
8361-LTMKD,7041,4,1,0,1,74.40,306.60,1


In [2]:
!conda install -c conda-forge pycaret -y

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



In [3]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [4]:
automl = setup(df, target = 'Churn')

Unnamed: 0,Description,Value
0,session_id,8260
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [5]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.797,0.8332,0.492,0.6595,0.5623,0.4339,0.4425,0.095
lr,Logistic Regression,0.7897,0.8272,0.4897,0.6354,0.5525,0.4181,0.4245,0.306
ada,Ada Boost Classifier,0.789,0.8313,0.4828,0.6361,0.5475,0.4137,0.4211,0.034
catboost,CatBoost Classifier,0.789,0.8295,0.4973,0.6319,0.5554,0.4199,0.4258,0.341
lda,Linear Discriminant Analysis,0.7876,0.819,0.5065,0.6255,0.5589,0.4211,0.4258,0.005
ridge,Ridge Classifier,0.784,0.0,0.4293,0.6395,0.513,0.3812,0.3941,0.005
lightgbm,Light Gradient Boosting Machine,0.7819,0.8197,0.5057,0.6093,0.5518,0.4095,0.4131,0.738
xgboost,Extreme Gradient Boosting,0.7726,0.8098,0.4828,0.5899,0.5301,0.3822,0.3861,0.413
rf,Random Forest Classifier,0.7653,0.7915,0.4828,0.5684,0.5212,0.3674,0.3701,0.094
knn,K Neighbors Classifier,0.7548,0.7389,0.4217,0.5513,0.4774,0.3209,0.3261,0.009


In [6]:
best_model

GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=3,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=8260, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

In [12]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=['Unnamed: 0'],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 num...
                                             learning_rate=0.1, loss='deviance',
                                             max_depth=3, max_features=None,
                                             max_leaf_nodes=None,
                                             min_

In [13]:
from IPython.display import Code

In [14]:
Code('churn_predict.py')

In [15]:
%run churn_predict.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC    No churn
1452-KNGVK    No churn
6723-OKKJM    No churn
7832-POPKP    No churn
6348-TACGU    No churn
Name: Churn_prediction, dtype: object


In [23]:
predict_model(best_model, data=df)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455,0,0.5463
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375,0,0.8743
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714,0,0.843
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806,0,0.7441
6348-TACGU,10,0,0,1,51.15,3440.97,344.097,0,0.7559


# Summary

The gradient boosting classifier had the best performance on all metrics except recall and F1 (both of which were highest in the Naive Bayes model). However, the first five results when running churn_predict returned two false negatives (interestingly enough, a prior run of this same code returned the ADA boost classifier as the best model and got all five predictions correct on the %run output). The models are all under 80% accuracy, however, so some error is to be expected. The two false positives also happened to be the two rows with the lowest prediction probabilities (0.5463 and 0.7441, respectively).