# DS Automation Assignment

Using our prepared churn data from week 2:
- use TPOT to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
    - REMEMBER: TPOT only finds the optimized processing pipeline and model. It doesn't create the model. 
        - You can use `tpot.export('my_model_name.py')` (assuming you called your TPOT object tpot) and it will save a Python template with an example of the optimized pipeline. 
        - Use the template code saved from the `export()` function in your program.
- create a Python script/file/module using code from the exported template above that
    - create a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split

from tpot import TPOTClassifier
import timeit 



In [2]:
# we can give an index number or name for our index column, or leave it blank
df = pd.read_csv('/Users/Kristian11rush/My Python Stuff/MSDS600/Week 2/Assignment/prepared_churn_data_TC0.csv', index_col='customerID')
df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_charge_tenure_ratio,monthly_charge_tenure_ratio_equivalence
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0,29.85,1.0
5575-GNVDE,34,1,1,1,56.95,1889.5,0,55.573529,0.97583
3668-QPYBK,2,1,0,1,53.85,108.15,1,54.075,1.004178
7795-CFOCW,45,0,1,2,42.3,1840.75,0,40.905556,0.967034
9237-HQITU,2,1,0,0,70.7,151.65,1,75.825,1.072489


In [3]:
features = df.drop('Churn', axis=1)
targets = df['Churn']

In [4]:
# split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, test_size=0.2, random_state=42)


In [5]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]




Generation 1 - Current best internal CV score: 0.794816083820199

Generation 2 - Current best internal CV score: 0.7951697475654097

Generation 3 - Current best internal CV score: 0.7974778605549873

Generation 4 - Current best internal CV score: 0.7974778605549873

Generation 5 - Current best internal CV score: 0.7980105626311069

Best pipeline: ExtraTreesClassifier(StandardScaler(MultinomialNB(input_matrix, alpha=100.0, fit_prior=False)), bootstrap=True, criterion=entropy, max_features=0.8, min_samples_leaf=19, min_samples_split=5, n_estimators=100)
0.808374733853797
CPU times: user 1min 8s, sys: 3.09 s, total: 1min 11s
Wall time: 8min 16s


In [6]:
tpot.fitted_pipeline_

Pipeline(steps=[('stackingestimator',
                 StackingEstimator(estimator=MultinomialNB(alpha=100.0,
                                                           fit_prior=False))),
                ('standardscaler', StandardScaler()),
                ('extratreesclassifier',
                 ExtraTreesClassifier(bootstrap=True, criterion='entropy',
                                      max_features=0.8, min_samples_leaf=19,
                                      min_samples_split=5, random_state=42))])

In [7]:
tpot.export('predict_churn_tpot_model.py')

### Need to clean new churn data prior to running so matches training data set format

In [8]:
new_df = pd.read_csv('/Users/Kristian11rush/My Python Stuff/MSDS600/Week 5/TPOT/Assignment/new_churn_data_unmodified.csv', index_col='customerID')
new_df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9305-CKSKC,22,Yes,Month-to-month,Electronic check,97.4,811.7
1452-KNGVK,8,No,One year,Mailed check,77.3,1701.95
6723-OKKJM,28,Yes,Month-to-month,Credit card (automatic),28.25,250.9
7832-POPKP,62,Yes,Month-to-month,Electronic check,101.7,3106.56
6348-TACGU,10,No,Two year,Credit card (automatic),51.15,3440.97


In [9]:
#data contains no missing values so we are good there
new_df.isna().sum()

tenure            0
PhoneService      0
Contract          0
PaymentMethod     0
MonthlyCharges    0
TotalCharges      0
dtype: int64

In [10]:
#convert categorical data to numeric data
new_df['PaymentMethod'] = new_df['PaymentMethod'].replace({'Electronic check': 0, 'Mailed check': 1,'Bank transfer (automatic)':2,'Credit card (automatic)':3})
new_df['Contract'] = new_df['Contract'].replace({'Month-to-month': 0, 'One year': 1,'Two year':2})
new_df['PhoneService'] = new_df['PhoneService'].replace({'No': 0, 'Yes': 1})
new_df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
9305-CKSKC,22,1,0,0,97.4,811.7
1452-KNGVK,8,0,1,1,77.3,1701.95
6723-OKKJM,28,1,0,3,28.25,250.9
7832-POPKP,62,1,0,0,101.7,3106.56
6348-TACGU,10,0,2,3,51.15,3440.97


In [11]:
#Add same features found in training data set to new data set
new_df['total_charge_tenure_ratio'] = new_df['TotalCharges'] / new_df['tenure']
new_df['monthly_charge_tenure_ratio_equivalence'] = new_df['total_charge_tenure_ratio'] / new_df['MonthlyCharges']
new_df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,total_charge_tenure_ratio,monthly_charge_tenure_ratio_equivalence
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9305-CKSKC,22,1,0,0,97.4,811.7,36.895455,0.378803
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375,2.752183
6723-OKKJM,28,1,0,3,28.25,250.9,8.960714,0.317193
7832-POPKP,62,1,0,0,101.7,3106.56,50.105806,0.492682
6348-TACGU,10,0,2,3,51.15,3440.97,344.097,6.727214


In [12]:
#pathname is '/Users/Kristian11rush/My Python Stuff/MSDS600/Week 5/TPOT/Assignment/new_prepared_churn_data.csv'
new_df.to_csv('new_prepared_churn_data.csv')

### Use model to predict churn in new data

In [13]:
#Original py script
from IPython.display import Code

Code('Predict_churn_tpot_model.py')

Made some changes to the template to take in new data as input and return prediction of customer churn:

In [14]:
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.preprocessing import StandardScaler
from tpot.builtins import StackingEstimator
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('/Users/Kristian11rush/My Python Stuff/MSDS600/Week 2/Assignment/prepared_churn_data_TC0.csv', index_col='customerID')
features = tpot_data.drop('Churn', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['Churn'], random_state=42)

# Average CV score on the training set was: 0.7980105626311069
exported_pipeline = make_pipeline(
    StackingEstimator(estimator=MultinomialNB(alpha=100.0, fit_prior=False)),
    StandardScaler(),
    ExtraTreesClassifier(bootstrap=True, criterion="entropy", max_features=0.8, min_samples_leaf=19, min_samples_split=5, n_estimators=100)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(training_features, training_target)

if __name__ == "__main__":
    new_data = pd.read_csv('/Users/Kristian11rush/My Python Stuff/MSDS600/Week 5/TPOT/Assignment/new_prepared_churn_data.csv', index_col='customerID')
    print('predictions:')
    print(exported_pipeline.predict(new_data))

predictions:
[0 0 0 0 0]


Incorrect prediction for two out of five customers; true data is [1,0,0,1,0]

In [15]:
new_df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,total_charge_tenure_ratio,monthly_charge_tenure_ratio_equivalence
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9305-CKSKC,22,1,0,0,97.4,811.7,36.895455,0.378803
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375,2.752183
6723-OKKJM,28,1,0,3,28.25,250.9,8.960714,0.317193
7832-POPKP,62,1,0,0,101.7,3106.56,50.105806,0.492682
6348-TACGU,10,0,2,3,51.15,3440.97,344.097,6.727214


Saw previously that higher tenure doesnt usually result in churn, I'm surpirised the true values of churn are for customers with tenure of 22 and 62, do have high monthly charges though.

# Summary

Overall, TPOT automation was utilized to determine the best machine learning model along with parameter optimaization to maximize accuracy for customer churn prediction. Previously prepared customer churn data was loaded into the notebook, and the data was split into training and test datasets, which was then fed into the TPOT automation tool. After a couple of minutes consisting of 5 generations, the best pipeline and corresponding hyperparatmeters was found with an Extra Trees Classifier. This pipeline was then exported as a py script file. Before using this script to predict customer churn on new data, the new dataset was prepared by checking for missing values (none were found), converting categorical data into numeric data, and finally engineering 2 features so the new dataset had the same format of the training dataset. After this was accomplished, the original pyscript was imported and changed to take in a new churn dataset as an input and return predicted churn values for each customer (original script and modified script are side by side to see changes that were made). Ultimately, the model pipeline was originally found to have an accuracy of 80.84%, but the model only correctly predicted customer churn for 3 out of 5 customers (60%). 