# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
#Load packages
import pandas as pd

#Load data
df = pd.read_csv('prepped_churn_data.csv')
df.head(3)

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,MonthlyC_TotalC_ratio,TotalC_tenure_ratio
0,5365,1,0,0,0,29.85,29.85,0,1.0,29.85
1,3953,34,1,1,1,56.95,1889.5,0,0.03014,55.573529
2,2558,2,1,0,1,53.85,108.15,1,0.49792,54.075


In [2]:
#Data Preparation
#AutoML with pycaret 
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [3]:
?setup

In [4]:
#setup function

#preprocess = False
#ignore_features = ['customerID']

automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,session_id,3796
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 10)"
5,Missing Values,False
6,Numeric Features,6
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


- #11 Transformed Train Set	(n, 12)
- #12 Transformed Test Set	(n, 12)

In [5]:
#setup function

#preprocess = False
#(n, 8)
automl = setup(df, target='Churn', preprocess = False)

Unnamed: 0,Description,Value
0,session_id,4487
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 10)"
5,Missing Values,False
6,Numeric Features,6
7,Categorical Features,3
8,Transformed Train Set,"(4922, 8)"
9,Transformed Test Set,"(2110, 8)"


In [6]:
#automl[6]
#automl[14]

In [7]:
#sklearn using autoML to find best model 
best_model = compare_models(sort='Accuracy')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
catboost,CatBoost Classifier,0.7944,0.8307,0.4966,0.6479,0.5608,0.4299,0.4372,1.179
gbc,Gradient Boosting Classifier,0.7922,0.8302,0.4836,0.6462,0.5519,0.4204,0.4287,0.235
ridge,Ridge Classifier,0.7903,0.0,0.4169,0.6688,0.5123,0.3882,0.4064,0.01
lda,Linear Discriminant Analysis,0.7899,0.8289,0.4706,0.6424,0.5423,0.4104,0.4193,0.012
lr,Logistic Regression,0.7891,0.8338,0.4936,0.6327,0.5536,0.4185,0.4245,0.351
ada,Ada Boost Classifier,0.7865,0.8271,0.4791,0.6282,0.542,0.4065,0.4137,0.169
rf,Random Forest Classifier,0.78,0.8054,0.4806,0.6095,0.5363,0.3949,0.4004,0.263
et,Extra Trees Classifier,0.7749,0.7857,0.4928,0.5908,0.5362,0.3895,0.3929,0.196
knn,K Neighbors Classifier,0.7594,0.745,0.4292,0.563,0.4857,0.3327,0.3386,0.022
qda,Quadratic Discriminant Analysis,0.7536,0.8185,0.7035,0.5266,0.602,0.4289,0.4386,0.013


In [8]:
#viewing best model
best_model

<catboost.core.CatBoostClassifier at 0x138ce4610>

Best Model is CatBoost Xlassiferier followed by gbc and ridge. Accuracy range is close together with majority of models.

In [9]:
df.iloc[-2:-1].shape

(1, 10)

In [10]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,MonthlyC_TotalC_ratio,TotalC_tenure_ratio,Label,Score
7030,5923,4,1,0,1,74.4,306.6,1,0.242661,76.65,1,0.5673


# Saving and loading the best model

In [11]:
#Saving best model
save_model(best_model, 'catboost')

Transformation Pipeline and Model Succesfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=['customerID'],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ['trained_model',
                  <catboost.core.CatBoostClassifier object at 0x138ce4610>]],
          verbose=False),
 'catboost.pkl')

In [12]:
import pickle
with open('catboost_model.pk', 'wb') as f:        #Write binary 
    pickle.dump(best_model, f)

In [13]:
#Read binary 

with open('catboost_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [14]:
df.iloc[-1].shape

(10,)

In [15]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

array([1])

In [17]:
loaded_catboost = load_model('catboost')

Transformation Pipeline and Model Successfully Loaded


In [18]:
predict_model(loaded_catboost, new_data)

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,MonthlyC_TotalC_ratio,TotalC_tenure_ratio,Label,Score
7030,5923,4,1,0,1,74.4,306.6,0.242661,76.65,1,0.5673


# Making a Python module to make predictions

In [26]:
from IPython.display import Code

Code('predict_churn.py')

In [27]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
0          Churn
1       No Churn
2       No Churn
3       No Churn
4          Churn
          ...   
7027    No Churn
7028    No Churn
7029    No Churn
7030       Churn
7031    No Churn
Name: Churn_prediction, Length: 7032, dtype: object


In [28]:
#Saved to Github

# Summary

Write a short summary of the process and results here.