# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
%matplotlib inline
import phik
import seaborn as sns
import numpy as np

In [2]:
df = pd.read_csv('churn_data.csv', index_col='customerID')

In [3]:
df.isna().sum()
df.dropna(inplace=True)

In [4]:
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,No,Month-to-month,Electronic check,29.85,29.85,No
5575-GNVDE,34,Yes,One year,Mailed check,56.95,1889.50,No
3668-QPYBK,2,Yes,Month-to-month,Mailed check,53.85,108.15,Yes
7795-CFOCW,45,No,One year,Bank transfer (automatic),42.30,1840.75,No
9237-HQITU,2,Yes,Month-to-month,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...
6840-RESVB,24,Yes,One year,Mailed check,84.80,1990.50,No
2234-XADUH,72,Yes,One year,Credit card (automatic),103.20,7362.90,No
4801-JZAZL,11,No,Month-to-month,Electronic check,29.60,346.45,No
8361-LTMKD,4,Yes,Month-to-month,Mailed check,74.40,306.60,Yes


In [5]:
df['Churn'] = df['Churn'].replace({'No': 0, 'Yes': 1})

In [6]:
df['PhoneService'] = df['PhoneService'].replace({'No': 0, 'Yes': 1})

In [7]:
df['Contract'] = df['Contract'].replace({'Month-to-month': 0, 'One year': 1, 'Two year':0})

In [8]:
df['PaymentMethod'] = df['PaymentMethod'].replace({'Electronic check': 2, 'Mailed check': 1, 'Credit card (automatic)': 0, 'Bank transfer (automatic)': 3})

In [9]:
df['charge_per_tenure'] = df['TotalCharges'] / df ['tenure']

In [10]:
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,3,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,2,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,0,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,2,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,1,74.40,306.60,1,76.650000


In [11]:
df.to_csv('prepped_churn_data.csv')

In [12]:
test_df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
test_df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0,29.850000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,55.573529
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075000
7795-CFOCW,45,0,1,3,42.30,1840.75,0,40.905556
9237-HQITU,2,1,0,2,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,82.937500
2234-XADUH,72,1,1,0,103.20,7362.90,0,102.262500
4801-JZAZL,11,0,0,2,29.60,346.45,0,31.495455
8361-LTMKD,4,1,0,1,74.40,306.60,1,76.650000


In [13]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [14]:
automl = setup(test_df, target='Churn', numeric_features=['PhoneService','Contract','PaymentMethod'] )

Unnamed: 0,Description,Value
0,session_id,2208
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 8)"
5,Missing Values,False
6,Numeric Features,7
7,Categorical Features,0
8,Ordinal Features,False
9,High Cardinality Features,False


In [18]:
automl[6]

'clf-default-name'

In [19]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
catboost,CatBoost Classifier,0.792,0.8249,0.495,0.6405,0.5579,0.425,0.4313,1.53
ada,Ada Boost Classifier,0.7905,0.8238,0.4637,0.6482,0.5401,0.4095,0.4194,0.086
gbc,Gradient Boosting Classifier,0.7893,0.8282,0.4836,0.638,0.5497,0.4156,0.4227,0.192
ridge,Ridge Classifier,0.7883,0.0,0.4407,0.6501,0.5244,0.395,0.4077,0.011
lr,Logistic Regression,0.7877,0.8172,0.4774,0.6338,0.5439,0.4093,0.4167,0.508
lightgbm,Light Gradient Boosting Machine,0.7853,0.8128,0.5064,0.6172,0.5558,0.4162,0.42,0.11
lda,Linear Discriminant Analysis,0.7822,0.8094,0.4812,0.6159,0.5396,0.4,0.4056,0.01
xgboost,Extreme Gradient Boosting,0.7808,0.8035,0.5065,0.6044,0.5506,0.4073,0.4103,0.535
rf,Random Forest Classifier,0.772,0.788,0.4889,0.5843,0.5317,0.383,0.3859,0.243
knn,K Neighbors Classifier,0.7696,0.7478,0.453,0.5882,0.5108,0.3636,0.3695,0.023


In [20]:
best_model

<catboost.core.CatBoostClassifier at 0x1c2a77f0250>

In [23]:
df.iloc[-2:-1].shape

(1, 8)

In [24]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,1,76.65,1,0.5054


In [25]:
save_model(best_model, 'cat')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['PhoneService',
                                                           'Contract',
                                                           'PaymentMethod'],
                                       target='Churn', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None...
                 ('binn', 'passthrough'), ('rem_outliers', 'passthrough'),
                 ('cluster_all', 'passthrough'),
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(t

In [26]:
import pickle

with open('cat_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [27]:
with open('cat_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [28]:
new_data = test_df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1], dtype=int64)

In [30]:
loaded_cat = load_model('cat')

Transformation Pipeline and Model Successfully Loaded


In [31]:
predict_model(loaded_cat, new_data)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
8361-LTMKD,4,1,0,1,74.4,306.6,76.65,1,0.5054


In [33]:
from IPython.display import Code

Code('predict_churn.py')

In [34]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC       Churn
1452-KNGVK    No Churn
6723-OKKJM    No Churn
7832-POPKP    No Churn
6348-TACGU    No Churn
Name: Churn_prediction, dtype: object


# Summary

Write a short summary of the process and results here.