# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [65]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from tpot import TPOTClassifier
from sklearn.model_selection import train_test_split

import timeit 

In [66]:
import pandas as pd

df = pd.read_csv('Week33_Churn.csv', index_col='customerID')
df.head(10)

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,1,29.85,29.85,1
5575-GNVDE,34,1,1,0,56.95,1889.5,1
3668-QPYBK,2,1,0,0,53.85,108.15,0
7795-CFOCW,45,0,1,3,42.3,1840.75,1
9237-HQITU,2,1,0,1,70.7,151.65,0
9305-CDSKC,8,1,0,1,99.65,820.5,0
1452-KIOVK,22,1,0,2,89.1,1949.4,1
6713-OKOMC,10,0,0,0,29.75,301.9,1
7892-POOKP,28,1,0,1,104.8,3046.05,0
6388-TABGU,62,1,1,3,56.15,3487.95,1


In [52]:
features= df.drop('Churn', axis=1)
targets=df['Churn']

In [53]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(features, targets,stratify=targets, random_state=25,test_size=.2) 

In [54]:
len(x_train)

5634

In [55]:
len(x_test)

1409

feedback: Breaking the data into features and targets. Further splitting data into test and train. We have 5634 train data
1409 test data.It is 80/20 split which is determined by test_size.

In [45]:
%%time
tpot = TPOTClassifier(generations=5, population_size=50, cv=5,random_state=25, scoring='accuracy', verbosity=2, n_jobs=-1)

tpot.fit(x_train, y_train)
print(tpot.score(x_test, y_test))

Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8004966107224417

Generation 2 - Current best internal CV score: 0.8012063022753313

Generation 3 - Current best internal CV score: 0.8012063022753313

Generation 4 - Current best internal CV score: 0.8020937713258135

Generation 5 - Current best internal CV score: 0.8020937713258135

Best pipeline: ExtraTreesClassifier(PCA(SGDClassifier(SelectFwe(input_matrix, alpha=0.045), alpha=0.01, eta0=0.1, fit_intercept=False, l1_ratio=0.0, learning_rate=invscaling, loss=squared_hinge, penalty=elasticnet, power_t=50.0), iterated_power=9, svd_solver=randomized), bootstrap=True, criterion=entropy, max_features=0.9000000000000001, min_samples_leaf=12, min_samples_split=17, n_estimators=100)
0.7892122072391767
Wall time: 9min 42s


In [56]:
predictions = tpot.predict(x_test)
predictions

array([1, 1, 1, ..., 1, 1, 1], dtype=int64)

feedback: testing the posibility of churn in each row.

In [19]:
tpot.export('tpot_churn_nirmal.py')

In [57]:
from IPython.display import Code
Code('tpot_churn_nirmal.py')

In [63]:
from IPython.display import Code
Code('tpot_churn_nirmal_filled.py')

In [64]:
%run tpot_churn_nirmal_filled.py

ValueError: could not convert string to float: '7832-POPKP'

feedback: tried with new_churn_data but got the same error.

# Summary

For this assignment, I proceeded to break the data into features and targets and set the churn as a target. I split the data into 80/20 train and test which had 5634 train subjects and 1409 test subjects. Then I instantiated the model and fitted out the train set to the data. Further, the evaluation was done based on the result. Magic command was used to %% to find the time of execution. During instantiation, the generations are set to 5 i.e. it will run five times. The minimum size is 50. The cross-variation is also set to 5 to ensure the inclusive split of data.
Here we don’t need to score against the training set because TPOT does that for us, it also checks for overfitting and recommends the best-performing algorithm.
For this test, ExtraTreeClassifier was the best-performing algorithm. First, the principle control analysis is done to slim down the dimensionality, and then ExtraTreeClassifier was run. The suggested parameters for this algorithm are bootstrap, criterion, max_features, min_sample_leaf, min_sample_split, and n_estimators. From all these analyses the score came up to 78%. The wall time for this TPOT took 14 minutes and 35 seconds.
After the assignment was completed I saved it in “tpot_churn_nirmal.py” format. The python file takes panda df as input and returns the predictions of churn as demonstrated in the assignment.
Write a short summary of the process and results here.