# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [2]:
import pandas as pd

df = pd.read_excel('prepped_churn_data.xlsx')
# df.drop('Diabetes_bool', axis=1, inplace=True)
df

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,0,1,0,0,0,29.85,29.85,0
1,1,34,1,1,1,56.95,1889.50,0
2,2,2,1,0,1,53.85,108.15,1
3,3,45,0,1,2,42.30,1840.75,0
4,4,2,1,0,0,70.70,151.65,1
...,...,...,...,...,...,...,...,...
7038,7038,24,1,1,1,84.80,1990.50,0
7039,7039,72,1,1,3,103.20,7362.90,0
7040,7040,11,0,0,0,29.60,346.45,0
7041,7041,4,1,0,1,74.40,306.60,1


In [1]:
from pycaret.classification import ClassificationExperiment #setup, compare_models, predict_model, save_model, load_model

In [3]:
automl = ClassificationExperiment() #setup(df, target='Diabetes')

In [4]:
automl.setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1401
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 8)"
4,Transformed data shape,"(7043, 8)"
5,Transformed train set shape,"(4930, 8)"
6,Transformed test set shape,"(2113, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


<pycaret.classification.oop.ClassificationExperiment at 0x26958b83e50>

## Examine for Best Model

In [5]:
best_model = automl.compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7945,0.8386,0.5023,0.6455,0.5643,0.4327,0.4389,0.071
ridge,Ridge Classifier,0.7943,0.0,0.4633,0.6613,0.5441,0.4167,0.4281,0.006
lr,Logistic Regression,0.7931,0.8371,0.5252,0.635,0.5738,0.439,0.4431,0.401
lda,Linear Discriminant Analysis,0.7917,0.8271,0.5091,0.6351,0.5644,0.4298,0.4348,0.006
ada,Ada Boost Classifier,0.7905,0.836,0.4939,0.6363,0.5553,0.4212,0.4275,0.029
rf,Random Forest Classifier,0.7811,0.8161,0.4877,0.6106,0.5415,0.4003,0.4051,0.074
lightgbm,Light Gradient Boosting Machine,0.7797,0.825,0.4992,0.6033,0.5454,0.402,0.4056,0.055
et,Extra Trees Classifier,0.7677,0.807,0.4831,0.5754,0.5246,0.3726,0.3755,0.058
qda,Quadratic Discriminant Analysis,0.7519,0.8281,0.75,0.5231,0.6159,0.4413,0.4574,0.005
dummy,Dummy Classifier,0.7347,0.5,0.0,0.0,0.0,0.0,0.0,0.006


In [6]:
best_model

In [7]:
automl.evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [21]:
automl.predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
7041,7041,4,1,0,1,74.400002,306.600006,1,1,0.6879


## Saving Model

In [22]:
automl.save_model(best_model, 'pycaret_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['customerID', 'tenure',
                                              'PhoneService', 'Contract',
                                              'PaymentMethod', 'MonthlyCharges',
                                              'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean',
                                                               verbose='depr...
           

In [23]:
new_pycaret = ClassificationExperiment()
loaded_model = new_pycaret.load_model('pycaret_model')

Transformation Pipeline and Model Successfully Loaded


In [14]:
new_data = df.iloc[-2:-1]

In [24]:
new_pycaret.predict_model(loaded_model, new_data)

Unnamed: 0,customerID,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
7041,7041,4,1,0,1,74.400002,306.600006,1,1,0.6879


## Prediction Section

In [41]:
from IPython.display import Code

Code('predict_churn.py')

In [43]:
%run predict_churn.py

Transformation Pipeline and Model Successfully Loaded
predictions:
0    No
1    No
2    No
3    No
4    No
Name: Churn_prediction, dtype: object


# Summary

We started by importing our data and setting our target for the model. Then, we used automl to figure out which model we thoought would be the best for our data and decided that GBC was the best option. 

After setting up our model we downloaded a newer set of churn data to test the capability of our model on a smaller scale. We found that the model is working okay but not good yet. We had two false negatives and three true positives. The new data was synthesized from existing data so it is a little random but shows that the model is not 
