# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

Install Python packages.

In [27]:
!pip install deap update_checker tqdm stopit xgboost



In [28]:
!pip install tpot



Import dependencies.

In [29]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

import timeit 

Import Churn data.

In [30]:
df = pd.read_csv(r"C:\Users\peter\MSDS600\Week_5\churn_data_prepped.csv", index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,totalcharges_tenure_ratio,charges_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,1,0,0,0,29.85,29.85,0,29.85,1.00
2,34,1,1,1,56.95,1889.50,0,55.57,33.18
3,2,1,0,1,53.85,108.15,1,54.08,2.01
4,45,0,1,2,42.30,1840.75,0,40.91,43.52
5,2,1,0,0,70.70,151.65,1,75.83,2.14
...,...,...,...,...,...,...,...,...,...
7028,24,1,1,1,84.80,1990.50,0,82.94,23.47
7029,72,1,1,3,103.20,7362.90,0,102.26,71.35
7030,11,0,0,0,29.60,346.45,0,31.50,11.70
7031,4,1,0,1,74.40,306.60,1,76.65,4.12


In [31]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7032 entries, 1 to 7032
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   tenure                     7032 non-null   int64  
 1   PhoneService               7032 non-null   int64  
 2   Contract                   7032 non-null   int64  
 3   PaymentMethod              7032 non-null   int64  
 4   MonthlyCharges             7032 non-null   float64
 5   TotalCharges               7032 non-null   float64
 6   Churn                      7032 non-null   int64  
 7   totalcharges_tenure_ratio  7032 non-null   float64
 8   charges_ratio              7032 non-null   float64
dtypes: float64(4), int64(5)
memory usage: 549.4 KB


Split Churn dataset into feature/target sets and then into training and test sets.

In [32]:
features = df.drop('Churn', axis=1)
targets = df['Churn']

x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

Run TPOT model.

In [33]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, cv=5,random_state=42, scoring='accuracy', verbosity=2, n_jobs=-1)

tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7971168286914215

Generation 2 - Current best internal CV score: 0.7978756621131865

Generation 3 - Current best internal CV score: 0.7978756621131865

Generation 4 - Current best internal CV score: 0.7980652355729021

Generation 5 - Current best internal CV score: 0.7980652355729021

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.25, min_samples_leaf=7, min_samples_split=6, n_estimators=100)
0.7935153583617748
CPU times: total: 28.6 s
Wall time: 4min 42s


Use the TPOT model to make predictions for the test dataset.

In [34]:
predictions = tpot.predict(x_test)
predictions

array([1, 0, 0, ..., 0, 0, 1], dtype=int64)

Compare our TPOT's predictions against the actuals for the test dataseet.

In [35]:
print('Predictions for test data set')
print(predictions)
print('Actuals for test data set')
print(y_test)

Predictions for test data set
[1 0 0 ... 0 0 1]
Actuals for test data set
customerID
943     1
504     0
4849    1
4144    0
5876    1
       ..
3679    0
6246    0
5497    0
3838    1
5720    1
Name: Churn, Length: 1758, dtype: int64


Score the accuracy of the model.

In [36]:
from sklearn.metrics import accuracy_score
print(f'Accuracy of the TPOT predictions: {accuracy_score(y_test,predictions)}')

Accuracy of the TPOT predictions: 0.7935153583617748


Create a Python file for the trained model.

In [37]:
tpot.export('tpot_churn_pipeline.py')

Review and edit code to fit environment.

In [45]:
from IPython.display import Code

Code('tpot_churn_pipeline.py')

Test run the created Python file.

In [47]:
%run tpot_churn_pipeline.py

[0 0 1 ... 0 1 0]


In [48]:
predictions

array([1, 0, 0, ..., 0, 0, 1], dtype=int64)

Update the code to run new churn data.

In [51]:
Code('tpot_churn_pipeline_v2.py')

Test run the updated file.

In [None]:
%run tpot_churn_pipeline_v2.py

In [53]:
predictions

array([1, 0, 0, ..., 0, 0, 1], dtype=int64)

# Summary

In summary, I completed the following:
1. Used the TPOT autoML Python package to find an optimized ML model for our diabetes dataset.
2. Created a Python script to ingest new data and make predictions on it.
3. Created a GitHub repository and upload the code there.

I was able create a Python script with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe. However, I was not able to test my Python module and function with the new data (new_churn_data.csv) to print out the predictions for the dataset.