# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

## Loading Data

In [21]:
import pandas as pd

df = pd.read_csv('C:/Users/Owner/Desktop/Regis University/Summer 2021/MSDS600 - Intro to Data Science/Week 2/cleaned_churn_data.csv', index_col = 0)
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_charges_to_tenure_ratio,month_of_total_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,29.850000,1.000000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,55.573529,0.030140
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075000,0.497920
7795-CFOCW,45,0,1,2,42.30,1840.75,0,40.905556,0.022980
9237-HQITU,2,1,0,0,70.70,151.65,1,75.825000,0.466205
...,...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,82.937500,0.042602
2234-XADUH,72,1,1,3,103.20,7362.90,0,102.262500,0.014016
4801-JZAZL,11,0,0,0,29.60,346.45,0,31.495455,0.085438
8361-LTMKD,4,1,0,1,74.40,306.60,1,76.650000,0.242661


In [22]:
df = df.drop('total_charges_to_tenure_ratio', axis = 1)
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,month_of_total_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,1.000000
5575-GNVDE,34,1,1,1,56.95,1889.50,0,0.030140
3668-QPYBK,2,1,0,1,53.85,108.15,1,0.497920
7795-CFOCW,45,0,1,2,42.30,1840.75,0,0.022980
9237-HQITU,2,1,0,0,70.70,151.65,1,0.466205
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0,0.042602
2234-XADUH,72,1,1,3,103.20,7362.90,0,0.014016
4801-JZAZL,11,0,0,0,29.60,346.45,0,0.085438
8361-LTMKD,4,1,0,1,74.40,306.60,1,0.242661


In [23]:
df = df.drop('month_of_total_ratio', axis = 1)
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0
5575-GNVDE,34,1,1,1,56.95,1889.50,0
3668-QPYBK,2,1,0,1,53.85,108.15,1
7795-CFOCW,45,0,1,2,42.30,1840.75,0
9237-HQITU,2,1,0,0,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,1,84.80,1990.50,0
2234-XADUH,72,1,1,3,103.20,7362.90,0
4801-JZAZL,11,0,0,0,29.60,346.45,0
8361-LTMKD,4,1,0,1,74.40,306.60,1


## AutoML with PyCaret

In [4]:
pip install pycaret




In [5]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [25]:
automl = setup(df, target = 'Churn')

Unnamed: 0,Description,Value
0,session_id,6575
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7043, 7)"
5,Missing Values,False
6,Numeric Features,3
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


In [26]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.7947,0.8305,0.5293,0.6342,0.5761,0.4422,0.446,0.018
gbc,Gradient Boosting Classifier,0.7943,0.8427,0.4977,0.6424,0.5601,0.4288,0.4352,0.295
ridge,Ridge Classifier,0.7923,0.0,0.4576,0.6521,0.5366,0.4082,0.4195,0.012
lr,Logistic Regression,0.7915,0.8391,0.5146,0.6305,0.5646,0.4297,0.4348,0.047
ada,Ada Boost Classifier,0.7915,0.8414,0.503,0.6313,0.5588,0.4248,0.4301,0.143
lightgbm,Light Gradient Boosting Machine,0.7801,0.8267,0.51,0.5973,0.5497,0.4055,0.4081,0.066
knn,K Neighbors Classifier,0.772,0.7488,0.453,0.5884,0.5109,0.3657,0.3716,0.042
rf,Random Forest Classifier,0.7698,0.7973,0.5046,0.5724,0.5359,0.3837,0.3854,0.338
et,Extra Trees Classifier,0.7542,0.7715,0.4884,0.537,0.511,0.3474,0.3485,0.341
svm,SVM - Linear Kernel,0.7341,0.0,0.4501,0.568,0.4562,0.3019,0.3301,0.025


In [27]:
best_model

LinearDiscriminantAnalysis(n_components=None, priors=None, shrinkage=None,
                           solver='svd', store_covariance=False, tol=0.0001)

## Saving and Loading the Model

In [28]:
save_model(best_model, 'LDA')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Churn',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strate...
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), ('fix_multi', 'passthrough'),
                 ('dfs

In [29]:
import pickle

with open('LDA_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [30]:
with open('LDA_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

## Making a Python module to make predictions

In [31]:
from IPython.display import Code

Code('predict_churn.py')

In [32]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC     Churned
1452-KNGVK    Retained
6723-OKKJM    Retained
7832-POPKP    Retained
6348-TACGU    Retained
Name: Churn_prediction, dtype: object


# Summary

Write a short summary of the process and results here.