# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [99]:
import pandas as pd

df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,AverageMonthlyCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,1,0,2,29.85,29.85,1,29.850000
5575-GNVDE,34,0,1,3,56.95,1889.50,1,55.573529
3668-QPYBK,2,0,0,3,53.85,108.15,0,54.075000
7795-CFOCW,45,1,1,0,42.30,1840.75,1,40.905556
9237-HQITU,2,0,0,2,70.70,151.65,0,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24,0,1,3,84.80,1990.50,1,82.937500
2234-XADUH,72,0,1,1,103.20,7362.90,1,102.262500
4801-JZAZL,11,1,0,2,29.60,346.45,1,31.495455
8361-LTMKD,4,0,0,3,74.40,306.60,0,76.650000


In [52]:
! conda install -c conda-forge pycaret -y

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

# All requested packages already installed.



In [100]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [101]:
?setup

In [102]:
automl = setup(df, target='Churn',fold_shuffle=True)

Unnamed: 0,Description,Value
0,session_id,2500
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,"0: 0, 1: 1"
4,Original Data,"(7032, 8)"
5,Missing Values,False
6,Numeric Features,4
7,Categorical Features,3
8,Ordinal Features,False
9,High Cardinality Features,False


AttributeError: 'Simple_Imputer' object has no attribute 'fill_value_categorical'

In [104]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.7871,0.8348,0.8981,0.8263,0.8606,0.4129,0.4205,0.061
gbc,Gradient Boosting Classifier,0.7869,0.8345,0.9012,0.8242,0.8608,0.4092,0.4178,0.132
lightgbm,Light Gradient Boosting Machine,0.7855,0.8233,0.8864,0.8316,0.858,0.4204,0.4249,0.227
lr,Logistic Regression,0.785,0.8278,0.8851,0.8321,0.8576,0.4204,0.4245,0.785
ridge,Ridge Classifier,0.7836,0.0,0.9084,0.8167,0.86,0.3893,0.4012,0.011
lda,Linear Discriminant Analysis,0.7834,0.8194,0.8853,0.8302,0.8567,0.4144,0.4189,0.009
catboost,CatBoost Classifier,0.7822,0.8297,0.897,0.8219,0.8577,0.397,0.4046,1.313
xgboost,Extreme Gradient Boosting,0.7714,0.8096,0.8751,0.8238,0.8485,0.3843,0.3881,0.172
rf,Random Forest Classifier,0.7603,0.7865,0.8631,0.8192,0.8405,0.3592,0.3616,0.156
knn,K Neighbors Classifier,0.7519,0.7321,0.877,0.8027,0.838,0.3119,0.3182,0.017


In [105]:
best_model

In [106]:
df.iloc[-2:-1].shape

(1, 8)

In [107]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,AverageMonthlyCharges,Label,Score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4,0,0,3,74.4,306.6,0,76.65,0,0.5006


In [108]:
save_model(best_model, 'ada')

AttributeError: 'Simple_Imputer' object has no attribute 'fill_value_categorical'

In [109]:
import pickle

with open('ada_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [110]:
with open('ada_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [112]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_model.predict(new_data)

ValueError: X has 7 features, but AdaBoostClassifier is expecting 11 features as input.

In [113]:
loaded_lda = load_model('ada')

Transformation Pipeline and Model Successfully Loaded


FileNotFoundError: [Errno 2] No such file or directory: 'ada.pkl'

In [97]:
from IPython.display import Code

Code('predict_diabetes.py')

In [96]:
%run predict_diabetes.py

FileNotFoundError: [Errno 2] No such file or directory: 'data/new_diabetes_data.csv'

Summary - Issues this week with the code as I continued to get errors once I established which model was the best, which in the churn data, it was the ada model.  No issues with creating the GetHub or using VSCode.

# Summary

Write a short summary of the process and results here.