# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
#use pycaret to find an ML algorithm that performs best on the data
##First, load the data
import pandas as pd
df = pd.read_csv('clean_churn_data.csv', index_col='customerID')#'Unnamed: 0')
df = df.drop(['Unnamed: 0', 'monthly_total_chg_ratio'], axis=1)
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,12,1,29.85,29.85,0
5575-GNVDE,34,1,1,2,56.95,1889.50,0
3668-QPYBK,2,1,12,2,53.85,108.15,1
7795-CFOCW,45,0,1,3,42.30,1840.75,0
9237-HQITU,2,1,12,1,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,2,84.80,1990.50,0
2234-XADUH,72,1,1,4,103.20,7362.90,0
4801-JZAZL,11,0,12,1,29.60,346.45,0
8361-LTMKD,4,1,12,2,74.40,306.60,1


In [2]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model
automl = setup(df, target='Churn', numeric_features=['PhoneService', 'Contract', 'PaymentMethod'])

Unnamed: 0,Description,Value
0,session_id,2815
1,Target,Churn
2,Target Type,Binary
3,Label Encoded,
4,Original Data,"(7043, 7)"
5,Missing Values,0
6,Numeric Features,6
7,Categorical Features,0
8,Ordinal Features,0
9,High Cardinality Features,0


In [5]:
# index 0 seems to have the original data. let's check the documentation
# after checking the documentation, it's not clear what pycaret.classification.setup returns. it just says 'global variables'. If you have any information on this, please email me! Thank you
automl[0]

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1240-HCBOH,67.0,1.0,2.0,2.0,26.100000,1759.550049
2080-GKCWQ,2.0,1.0,12.0,1.0,74.949997,151.750000
2654-VBVPB,1.0,1.0,12.0,3.0,19.900000,19.900000
4102-OQUPX,1.0,1.0,12.0,1.0,74.400002,74.400002
9074-KGVOX,50.0,0.0,12.0,4.0,39.450001,2021.349976
...,...,...,...,...,...,...
1755-RMCXH,2.0,1.0,12.0,2.0,20.299999,40.250000
5985-TBABQ,32.0,1.0,1.0,2.0,74.750000,2282.949951
0654-PQKDW,62.0,1.0,1.0,3.0,70.750000,4263.450195
1897-RCFUM,39.0,1.0,1.0,2.0,24.200001,914.599976


In [6]:
#now we campare models
##Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
###I will choose the default accuracy, simply because I believe it's a good starting point
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.7976,0.8437,0.5266,0.6561,0.5832,0.4519,0.4573,0.09
gbc,Gradient Boosting Classifier,0.7963,0.8426,0.5064,0.66,0.5722,0.4419,0.449,0.181
lr,Logistic Regression,0.7941,0.8367,0.5372,0.6424,0.5844,0.4492,0.4528,4.519
ridge,Ridge Classifier,0.7923,0.0,0.4861,0.6565,0.5579,0.4262,0.4348,0.008
lda,Linear Discriminant Analysis,0.7899,0.83,0.535,0.6321,0.5788,0.4402,0.4434,0.014
lightgbm,Light Gradient Boosting Machine,0.7892,0.8275,0.5244,0.6323,0.5729,0.4347,0.4383,0.048
svm,SVM - Linear Kernel,0.7728,0.0,0.432,0.62,0.4995,0.3617,0.3756,0.025
knn,K Neighbors Classifier,0.7694,0.7568,0.4635,0.5943,0.5195,0.3712,0.3768,0.028
rf,Random Forest Classifier,0.7669,0.7995,0.474,0.5848,0.5229,0.3711,0.3751,0.208
et,Extra Trees Classifier,0.7568,0.7796,0.4763,0.5598,0.5138,0.3532,0.3558,0.17


In [7]:
best_model

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=2815)

In [8]:
# Therefore, the best model available is the Ada Boost Classifier mode!
#save the model to disk
save_model(best_model, 'ABC_best')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=['PhoneService',
                                                           'Contract',
                                                           'PaymentMethod'],
                                       target='Churn', time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None...
                 ('dummy', Dummify(target='Churn')),
                 ('fix_perfect', Remove_100(target='Churn')),
                 ('clean_names', Clean_Colum_Names()),
                 ('feature_select', 'passthrough'), 

In [9]:
#create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
##your Python file/function should print out the predictions for new data (new_churn_data.csv)
##the true values for the new data are [1, 0, 0, 1, 0] if you're interested
#test your Python module and function with the new data, new_churn_data.csv
%run predict_churn.py

Please enter filename for predicting: new_churn_data.csv
Transformation Pipeline and Model Successfully Loaded
predictions:
customerID
9305-CKSKC    No churn
1452-KNGVK    No churn
6723-OKKJM    No churn
7832-POPKP    No churn
6348-TACGU    No churn
Name: Churn_prediction, dtype: object


# Summary

Write a short summary of the process and results here.

Write a short summary of the process and results at the end of this notebook

Upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

# In this assignment: 
- I used the pandas Python package to load in my clean_churn_data.csv from Week 3.
- Once the data was loaded, I dropped the unnamed column and the monthly_total_chg_ratio column as they are not present in the new_churn_data.csv data, causing an error.
- Using the pycaret Python package, I created a comparison of ML models based on the clean_churn_data, ensuring to set all columns as numeric since they have already been cleaned. 
- Comparing the models accuracies, I chose to use the Ada Boost Classifier model because it was the most accurate.
- I saved this Ada Boost Classifier model to a pickle file.
- I then created a Python script based on the FTE example to prompt the user for a filename, and execute predictions on that file to predict churn vs. no-churn.
- These predictions are outputted to the users console.

Was fun to create some Python scripting!

Thank you, Jeremy