# DS Automation Assignment

Using our prepared churn data from week 2:
- use TPOT to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
    - REMEMBER: TPOT only finds the optimized processing pipeline and model. It doesn't create the model. 
        - You can use `tpot.export('my_model_name.py')` (assuming you called your TPOT object tpot) and it will save a Python template with an example of the optimized pipeline. 
        - Use the template code saved from the `export()` function in your program.
- create a Python script/file/module using code from the exported template above that
    - create a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

## Use TPOT to find an ML algorithm that performs best on the Churn data

In [1]:
# import packages
from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
import timeit 
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings("ignore")
from IPython.display import Code
from sklearn.preprocessing import LabelEncoder

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#load chrun data
churn_df = pd.read_csv('../week_3/churn_data.csv',index_col='customerID')
churn_df.head()

Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,month_total_ratio,total_tenure_ratio,log_total_tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
7590-VHVEG,0,1,0,0,0,29.85,29.85,0,1.0,29.85,3.396185
5575-GNVDE,1,34,1,1,1,56.95,1889.5,0,0.03014,55.573529,4.017707
3668-QPYBK,2,2,1,0,1,53.85,108.15,1,0.49792,54.075,3.990372
7795-CFOCW,3,45,0,1,2,42.3,1840.75,0,0.02298,40.905556,3.711266
9237-HQITU,4,2,1,0,0,70.7,151.65,1,0.466205,75.825,4.328428


In [3]:
#define features and targets
features = churn_df.drop('Churn', axis=1)
targets = churn_df['Churn']

In [4]:
# split into training and testing datasets
X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, test_size=0.2, random_state=42)

In [5]:
#run TPOTClassifier - test accuracy
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))

                                                                                
Generation 1 - Current best internal CV score: 0.7969923270708181
                                                                                
Generation 2 - Current best internal CV score: 0.7986029855718182
                                                                                
Generation 3 - Current best internal CV score: 0.7986029855718182
                                                                                
Generation 4 - Current best internal CV score: 0.7986029855718182
                                                                                
Generation 5 - Current best internal CV score: 0.7993185490780794
                                                                                
Best pipeline: XGBClassifier(BernoulliNB(input_matrix, alpha=0.001, fit_prior=True), learning_rate=0.1, max_depth=3, min_child_weight=2, n_estimators=100, n_jobs=1, subsample=0.8500000

Based on the TPOTClassifier score, the XGBClassifier ML model produced the highest accuracy score. I also want to evaluate other metrics (AUC, precision, and recall) to determine the most appropriate model. I can change the default accuracy classification to another metric using the 'scoring' variable to determine if the XGBClassifier is the most appropriate ML model for these data.

In [None]:
#run TPOTClassifier - test AUC score
tpot_accuracy = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, random_state=42, scoring='roc_auc')
tpot_accuracy.fit(X_train, y_train)
print(tpot_accuracy.score(X_test, y_test))

In [None]:
#run TPOTClassifier - test precision
tpot_precision = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, random_state=42, scoring='precision')
tpot_precision.fit(X_train, y_train)
print(tpot_precision.score(X_test, y_test))

In [None]:
#run TPOTClassifier - test recall
tpot_recall = TPOTClassifier(generations=5, population_size=50, verbosity=2, n_jobs=-1, random_state=42, scoring='recall')
tpot_recall.fit(X_train, y_train)
print(tpot_recall.score(X_test, y_test))

By changing the default accuracy scoring classification to different metrics, I can see that the XGBClassifier may not be the most appropriate ML model for the churn dataset in terms of other classification metrics. It looks like the ExtraTreesClassifier model produced the best AUC score, the GradientBoostingClassifier model produced the most precise results, and the GaussianNB model produced the best recall score.

Moving forward I will use the XGBClassifier because it produced the highest accuracy score.

## Create a Python script/file/module using code from the exported template above that

In [6]:
#save XGBClassifier model as a Python template
tpot.export('XGBClassifier.py')

In [7]:
#display template
Code('XGBClassifier.py')

In [16]:
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBClassifier
from tpot.export_utils import set_param_recursive
from sklearn.naive_bayes import BernoulliNB

In [330]:
new_churn_df = pd.read_csv('../week_5/new_churn_data.csv', index_col='customerID')

In [331]:
#define features and targets
features = new_churn_df
targets = churn_df['Churn'][:5]

In [332]:
targets

customerID
7590-VHVEG    0
5575-GNVDE    0
3668-QPYBK    1
7795-CFOCW    0
9237-HQITU    1
Name: Churn, dtype: int64

In [333]:
X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, test_size=0.2, random_state=42)

In [334]:
#define search
model = make_pipeline(
    StackingEstimator(estimator=BernoulliNB(alpha=0.001, fit_prior=True)),
    XGBClassifier(learning_rate=0.1, max_depth=3, min_child_weight=2, n_estimators=100, n_jobs=1, subsample=0.8500000000000001, verbosity=0)
)

#define model evaluation
set_param_recursive(model.steps, 'random_state', 42)


In [335]:
# fit the model
model.fit(X_train, y_train)

In [336]:
y_hats = model.predict(X_test)
len(y_hats)

1

In [337]:
y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(new_churn_df, y_hats_df, how = 'left', left_index = True, right_index = True)
df_out

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,y_hats
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455,
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375,0.0
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714,
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806,
6348-TACGU,10,0,0,1,51.15,3440.97,344.097,


- create a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
- your Python file/function should print out the predictions for new data (new_churn_data.csv)
- the true values for the new data are [1, 0, 0, 1, 0] if you're interested

In [348]:
#define features and targets
features = new_churn_df
targets = churn_df['Churn'][:5]
X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, test_size=0.2, random_state=42)
model=XGBClassifier(learning_rate=0.1, max_depth=3, min_child_weight=2, n_estimators=100, n_jobs=1, subsample=0.8500000000000001, verbosity=0)

In [356]:
def load_data(filepath):
    df = pd.read_csv(filepath)
    
def make_predictions(df):
    features=df
    targets=churn_df['Churn'][:5]
    X_train, X_test, y_train, y_test = train_test_split(features, targets, train_size=0.8, test_size=0.2, random_state=42)
    model=XGBClassifier(learning_rate=0.1, max_depth=3, min_child_weight=2, n_estimators=100, n_jobs=1, subsample=0.8500000000000001, verbosity=0)
    
    predictions=model.predict(X_test)
    y_hats_df = pd.DataFrame(data = predictions, columns = ['y_hats'], index = X_test.index)
    df_out = pd.merge(new_churn_df, y_hats_df, how = 'left', left_index = True, right_index = True)

    #return df_out['y_hats']

if __name__ == "__main__":
    df = load_data('../week_5/new_diabetes_data.csv')
    

In [357]:
y_hats_df

Unnamed: 0_level_0,y_hats
customerID,Unnamed: 1_level_1
1452-KNGVK,0


In [358]:
df_out

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,y_hats
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455,
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375,0.0
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714,
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806,
6348-TACGU,10,0,0,1,51.15,3440.97,344.097,


In [359]:
new_churn_df['predictions']=new_churn_df.apply(make_predictions)

NotFittedError: need to call fit or load_model beforehand

## Summary

I used TPOTClassifier to determine which machine learning model was best suited for the churn dataset. The XGBClassifier model was deemed the most appropriate model by TPOTClassifier using the default metric 'scoring='accuracy''. I tested three other metrics (AUC, precision, and recall) to see if the XGBClassifier model would still be the most appropriate model. My tests demonstrated that the XGBClassifier model would not be the best model if I was concerned with the AUC, precision, or recall scores. 

After deciding to use the XGBClassifier model for further evaluation, I 
