# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
#!conda install numpy scipy scikit-learn pandas joblib pytorch -y

In [2]:
#!pip install deap update_checker tqdm stopit xgboost

In [3]:
#!pip install tpot

In [4]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

import timeit 

After installing TPOT, I first load the prepared data from week 2 where everything had been converted to numbers:

In [5]:
import pandas as pd

df = pd.read_csv('../data/cleaned_churn_data.csv', index_col='customerID')
df.head(10)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1.0,0,0,3,29.85,29.85,0,29.85
5575-GNVDE,34.0,1,1,2,56.95,1889.5,0,55.573529
3668-QPYBK,2.0,1,0,2,53.85,108.15,1,54.075
7795-CFOCW,45.0,0,1,1,42.3,1840.75,0,40.905556
9237-HQITU,2.0,1,0,3,70.7,151.65,1,75.825
9305-CDSKC,8.0,1,0,3,99.65,820.5,1,102.5625
1452-KIOVK,22.0,1,0,0,89.1,1949.4,0,88.609091
6713-OKOMC,10.0,0,0,2,29.75,301.9,0,30.19
7892-POOKP,28.0,1,0,3,104.8,3046.05,1,108.7875
6388-TABGU,62.0,1,1,1,56.15,3487.95,0,56.257258


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   tenure             7043 non-null   float64
 1   PhoneService       7043 non-null   int64  
 2   Contract           7043 non-null   int64  
 3   PaymentMethod      7043 non-null   int64  
 4   MonthlyCharges     7043 non-null   float64
 5   TotalCharges       7043 non-null   float64
 6   Churn              7043 non-null   int64  
 7   charge_per_tenure  7043 non-null   float64
dtypes: float64(4), int64(4)
memory usage: 495.2+ KB


Next, I split the dataset into feature / target sets and then into training and test sets:

In [7]:
features = df.drop('Churn', axis=1)
targets = df['Churn']

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

Next, I will use TPOT to establish an instance of the model, fit the model with the data, and evaluate the data:

In [8]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, cv=5,random_state=42, scoring='accuracy', verbosity=2, n_jobs=-1)

tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7976100885869097

Generation 2 - Current best internal CV score: 0.7976100885869097

Generation 3 - Current best internal CV score: 0.7976100885869097

Generation 4 - Current best internal CV score: 0.7976100885869097

Generation 5 - Current best internal CV score: 0.7985568791032367

Best pipeline: MLPClassifier(ZeroCount(RobustScaler(input_matrix)), alpha=0.0001, learning_rate_init=0.01)
0.7989778534923339
CPU times: user 53.4 s, sys: 3.5 s, total: 56.9 s
Wall time: 3min 2s


Here, TPOT did the work of evaluating the performance of the model against the training data. 

Next I will use this TPOT model to make predictions for the test dataset:

In [9]:
predictions = tpot.predict(x_test)
predictions

array([0, 0, 0, ..., 0, 0, 0])

Next, I will compare the TPOT predictions against the actuals for the test dataset:

In [10]:
print('Predictions for test data set')
print(predictions)
print('Actuals for test data set')
print(y_test)

Predictions for test data set
[0 0 0 ... 0 0 0]
Actuals for test data set
customerID
5343-SGUBI    0
5442-BXVND    0
6434-TTGJP    0
1628-BIZYP    0
0298-XACET    0
             ..
3780-DDGSE    0
4154-AQUGT    1
1116-DXXDF    0
3237-AJGEH    1
6704-UTUKK    0
Name: Churn, Length: 1761, dtype: int64


In [11]:
from sklearn.metrics import accuracy_score
print(f'Accuracy of the TPOT predictions: {accuracy_score(y_test,predictions)}')

Accuracy of the TPOT predictions: 0.7989778534923339


When comparing the two, a similar accuracy is evident to the ones prior.

Next, I will save this trained model so I can use it in a Python file later:

In [12]:
tpot.export('tpot_churn_pipeline.py')

Now, I will use this model in a Python file in order to now take in new data and make a prediction. 

Using Sublime Text, I composed a Python file, as shown below:

In [13]:
from IPython.display import Code

Code('tpot_churn_pipeline.py')

I have altered the generic file and filled it in for what was needed. Now, I am able to see how the rest of it works:

In [14]:
Code('tpot_churn_pipeline_filledin.py')

Next, I will test out running the file with Jupyter's "magic" command, %run:

In [15]:
%run tpot_churn_pipeline_filledin.py

[1 0 0 ... 0 1 0]


In [16]:
predictions

array([0, 0, 0, ..., 0, 0, 0])

This prediction matches the one from earlier.

Next, I will save this code to GitHub.

# Summary

The auto ML package I used for this assignment was TPOT. After downloading this package, I used the prepared and cleaned churn data from week 2 and broke the data into features and targets, as well as train and test sets. Next, I used TPOT to evaluate the performance of the model against the training data. TPOT was able to find an ML algorithm that performed best on the data. This was a MLPClassifier, which is a feedforward neural network that performs the task of classifying the churn dataset. Next, I compared TPOT's predictions to the actuals for the test dataset and found that the two accuracy scores were very similar. Following my comparrison, I saved the trained model into a Python so that It could take in new data and make a prediction.Using Sublime Text, I composed a Python file. Next, I tested and ran the file and was accurate with the true values for the new data, which are [1, 0, 0, 1, 0].