# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

# Install packages for tpot

In [1]:
#pip install tpot

In [2]:
#pip install torch

In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

import timeit 

## Import Churn dataset

In [4]:
#import in prepared churn data 
df = pd.read_csv('/Users/johnxie301/Desktop/Data_Science_600/Assignment_5/churn_data_cleaned.csv',index_col='customerID')
df.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,0,29.85,29.85,0
5575-GNVDE,34,1,1,1,56.95,1889.5,0
3668-QPYBK,2,1,0,1,53.85,108.15,1
7795-CFOCW,45,0,1,2,42.3,1840.75,0
9237-HQITU,2,1,0,0,70.7,151.65,1


In [5]:
#check for data types 
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7043 entries, 7590-VHVEG to 2775-SEFEE
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   tenure          7043 non-null   int64  
 1   PhoneService    7043 non-null   int64  
 2   Contract        7043 non-null   int64  
 3   PaymentMethod   7043 non-null   int64  
 4   MonthlyCharges  7043 non-null   float64
 5   TotalCharges    7043 non-null   float64
 6   Churn           7043 non-null   int64  
dtypes: float64(2), int64(5)
memory usage: 440.2+ KB


In [6]:
# Spliting the data to features and targets
features = df.drop('Churn',axis = 1)
targets = df[['Churn']]

In [7]:
# create a training set and testing set for model use, set random_state to 42 for consisitency
x_train, x_test, y_train, y_test = train_test_split(features, targets, stratify=targets, random_state=42)

# Use tpot 

In [8]:
# use magic command to capture the time used in each generation run
%time
# generation = 5 means 5 iterations to the process. 
# population size is the size in each generation run
# cv means cross validation, is by seperating the training and testing to 5 pieces and make sure each piece can at least in the testing set for one time. 
# scoring is set to accuracy for a clssifier model
# verbosity = 2 means it shows enough information and process bar while running 
# n_jobs to -1 means max out the laptop CPU usage
tpot = TPOTClassifier(generations=5, population_size=50, cv=5,random_state=42, scoring='accuracy', verbosity=2, n_jobs=-1)
# input the training features and targets
tpot.fit(x_train, y_train.values.ravel())
print(tpot.score(x_test, y_test.values.ravel()))
# there was conversion warning about my y test, looks like I did not convert to to a list but each value was a list. i looked up the recommanded function it provide to me .ravel() and found it really useful
# followed is the link where I learned the function 'ravel()' : https://www.javatpoint.com/numpy-ravel#:~:text=ravel%2C%20which%20is%20used%20to,source%20array%20or%20input%20array.

CPU times: user 0 ns, sys: 1 µs, total: 1 µs
Wall time: 2.15 µs


Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.7998853243886357

Generation 2 - Current best internal CV score: 0.7998853243886357

Generation 3 - Current best internal CV score: 0.7998853243886357

Generation 4 - Current best internal CV score: 0.7998853243886357

Generation 5 - Current best internal CV score: 0.7998853243886357

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.1, max_depth=2, min_child_weight=8, n_estimators=100, n_jobs=1, subsample=0.45, verbosity=0)
0.7904599659284497


### Comment: I am not certain if the score should be the same for all 5 generations. However it does come close to my previous scores. the best pipline has an accuracy of 79%, which is lower than using random forest classifier. I looked up XGB classifier for its advantage. It is a good model for datasets that are large and with more missing data.I would not agree with its choice at the point. The data does not seem to miss any data points and is relatively small from a business understanding. 

### XGB resources:https://apmonitor.com/pds/index.php/Main/XGBoostRegressor#:~:text=One%20of%20the%20key%20advantages,the%20trees%2C%20and%20regularization%20parameters.

In [9]:
predictions = tpot.predict(x_test)
predictions[:20]

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])

In [10]:
# use predictions and actual result as comparison. use accuracy score function to get the accuracy score in the old fashion way
from sklearn.metrics import accuracy_score
print(f'Accuracy of the TPOT predictions: {accuracy_score(y_test,predictions)}')
# code source: Week_5_FTE-TPOT.ipynb

Accuracy of the TPOT predictions: 0.7904599659284497


In [11]:
#export the data and store it in a python file form. This helps the code to run the necessary APIs only.
tpot.export('/Users/johnxie301/Desktop/Data_Science_600/Assignment_5/tpot_Churn_pipeline.py')

In [12]:
from IPython.display import Code

Code('/Users/johnxie301/Desktop/Data_Science_600/Assignment_5/tpot_Churn_pipeline_updated.py')

In [14]:
%run /Users/johnxie301/Desktop/Data_Science_600/Assignment_5/tpot_Churn_pipeline_updated.py
results[:20]

array([1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0])

In [15]:
predictions[:20]

array([0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1])

# Summary

## Technical Issues 
Tpot is a really useful and time saving tool. This week I have faced more challenges than past few weeks. First of all, I received a warning on not having parts of the pytorch APIs and causing some import errors. It took a while to find out that pytorch is called torch. Then I received a y_train data conversion warning during the pipeline running process. This warning does not affect anything. It also happened last week but I did not pay enough attention to it. Thanks to the notes given by Professor Pearson, I was able to learn more about the formats and how to turn set of lists to one list using values.ravel(). Last challenge was read_csv in python file. I was receving parsing errors about c engine and python engine due to set the separator to 'COLUMN_SEPARATOR'. Then I double checked with the example and my csv file and find out we do not need to include this separator because my cvs file is separated with commas and it is the default one. I also excluded dtype because I do not need all my data to be floats. 
## tpot 
Although I agree TPOT is a really usful tool, it does not impress me with the result given for this data set. Since it consdiers so many factors including run time. It sometimes does not give the best accuracy but the best performing model. For example, this time it chooses the XGB classifiers and only use one CPU to run it. I believe it definitely save the resources and money on larger datasets. However, I wish to see the best accuracy in this case. I was also impressed by how it can just go from step 0 to result. Overall, I will certainly use this as a good reference and based on the result given, I will decided if I want to pick on certain models and test out the best option myself. 