# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

### Importing modules and loading the data

In [1]:
#got help from week 5 FTE (TPOT)
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

import timeit
#was getting a warning "optional dependency `torch` is not available. - skipping import of NN models." Fixed it with pip install torch
#https://github.com/microsoft/python-language-server/issues/1301

In [2]:
df = pd.read_csv("churn_data_v2.csv", index_col="customerID")
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,tenure_totalcharges_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1,0,0,1,29.85,29.85,0,0.033501
5575-GNVDE,34,1,1,0,56.95,1889.50,0,0.017994
3668-QPYBK,2,1,0,0,53.85,108.15,1,0.018493
7795-CFOCW,45,0,1,3,42.30,1840.75,0,0.024447
9237-HQITU,2,1,0,1,70.70,151.65,1,0.013188
...,...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,0,84.80,1990.50,0,0.012057
2234-XADUH,72,1,1,2,103.20,7362.90,0,0.009779
4801-JZAZL,11,0,0,1,29.60,346.45,0,0.031751
8361-LTMKD,4,1,0,0,74.40,306.60,1,0.013046


### Creating training and testing sets

As always, we need to first split our data into training and testing sets

In [3]:
features = df.drop("Churn", axis=1)
targets = df["Churn"]

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=28)

### Fitting the data using TPOT

Using TPOT when fitting our data to a model will actually fit it to many, many models and tell us which one got the best score. In this case, we are scoring it for accuracy.

In [4]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, cv=5,random_state=28, scoring='accuracy', verbosity=2, n_jobs=-1)

tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))
#https://epistasislab.github.io/tpot/using/

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7968573507066884

Generation 2 - Current best internal CV score: 0.7991286445342738

Generation 3 - Current best internal CV score: 0.7991286445342738

Generation 4 - Current best internal CV score: 0.7991286445342738

Generation 5 - Current best internal CV score: 0.7991288237149164

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=True, criterion=entropy, max_features=0.8, min_samples_leaf=17, min_samples_split=9, n_estimators=100)
0.7978421351504826
CPU times: total: 29.3 s
Wall time: 6min 6s


The best model got a score of almost 80%, which isn't terrible, but also doesn't really beat last week's random forest model which got close to 80% as well

### Saving our TPOT model as a python file

We can save the model TPOT gave us for later use by running a line of code that saves it as a Python file.

In [5]:
tpot.export("tpot_churn_model_raw.py")

Viewing the raw code that was saved:

In [6]:
#Learned from week 5 FTE
from IPython.display import Code

Code("tpot_churn_model_raw.py")

### Modifying the Python file and testing it with new data

The code above won't work because it has generic names for some strings, we need to change those strings to match our specific file name/place on path and also the specific name of the target column in the data set. We can also get rid of the "sep=" and "dtype=" arguments because they are not needed. (and actually caused some errors for me)

In [7]:
from IPython.display import Code

Code("tpot_churn_model_edited.py")

We can now run the code and see if it predicts correctly the targets of our new data set!

In [8]:
%run tpot_churn_model_edited.py
#had to make sure and add the index column as customer ID so I didn't get an error

[0 0]


I had to create a churn column with the true values given in the assignment header above, because there was no churn column in the data I received. I could not figure out how to print out all 5 predictions..

# Summary

I chose to go with TPOT for my auto ML package because it sounded more up-to-date and easier to work with. Fitting the data was actually easier than I thought, but I was surprised to see that the accuracy of the best model it found was not better than the random forest model we used last week. Maybe we need more data? Cleaner data? I am not sure. However, 80% accuracy might be enough for the intent this model will be used for.

I struggled at the end with the python file, and could not get it to print out all five predictions to see how well it did with the new data.