# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

#### Install Packages :

In [None]:
%pip install pycaret

#### Load Packages :

In [88]:
import pandas as pd
from IPython.display import Code

#### Load Data : 

In [89]:
df = pd.read_csv('data/prepared_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
7590-VHVEG,1.0,0,0,3,29.85,29.85,0,29.850000
5575-GNVDE,34.0,1,1,2,56.95,1889.50,0,55.573529
3668-QPYBK,2.0,1,0,2,53.85,108.15,1,54.075000
7795-CFOCW,45.0,0,1,1,42.30,1840.75,0,40.905556
9237-HQITU,2.0,1,0,3,70.70,151.65,1,75.825000
...,...,...,...,...,...,...,...,...
6840-RESVB,24.0,1,1,2,84.80,1990.50,0,82.937500
2234-XADUH,72.0,1,1,0,103.20,7362.90,0,102.262500
4801-JZAZL,11.0,0,0,3,29.60,346.45,0,31.495455
8361-LTMKD,4.0,1,0,2,74.40,306.60,1,76.650000


#### pycaret to find an ML algorithm that performs best on the data :

Import Auto ML algorithm generator and comparison modules.

In [90]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

Setup by pycaret automatically handles missing values, convert categorical to numeric values, split data into train and test sets. By setting target as Churn model will predict data is churn or not.

In [91]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1398
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 8)"
4,Transformed data shape,"(7043, 8)"
5,Transformed train set shape,"(4930, 8)"
6,Transformed test set shape,"(2113, 8)"
7,Numeric features,7
8,Preprocess,True
9,Imputation type,simple


compare_models() by pycaret automaticially evaluate mulyiple ML algoritms. 
It returns best performed algorithm by using default measuring criteria (Accuracy).

In [92]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7953,0.8376,0.5321,0.6379,0.5796,0.4459,0.4495,0.778
ridge,Ridge Classifier,0.7915,0.8253,0.4488,0.6574,0.5327,0.4047,0.4173,0.013
ada,Ada Boost Classifier,0.7911,0.8405,0.5115,0.6331,0.564,0.4291,0.4343,0.121
gbc,Gradient Boosting Classifier,0.7907,0.8391,0.4863,0.6393,0.5518,0.4186,0.4256,0.286
lda,Linear Discriminant Analysis,0.7892,0.8253,0.5015,0.6299,0.5576,0.4218,0.427,0.014
lightgbm,Light Gradient Boosting Machine,0.7862,0.8299,0.5115,0.618,0.5593,0.4199,0.4235,0.177
rf,Random Forest Classifier,0.7832,0.8115,0.5001,0.6114,0.5496,0.4089,0.4128,0.241
knn,K Neighbors Classifier,0.7746,0.7605,0.4588,0.5978,0.5182,0.3748,0.3808,0.04
et,Extra Trees Classifier,0.77,0.789,0.4901,0.5785,0.5298,0.3792,0.3819,0.174
svm,SVM - Linear Kernel,0.7572,0.7525,0.4663,0.5894,0.4907,0.3424,0.3637,0.021


In [93]:
best_model

We can see LogisticRegression is best performing ML algorithm by using default evaluating metric that is accuracy.

Now we can customize this default matric with different metrics like AUC :

In [94]:
best_model = compare_models(sort='Kappa')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7953,0.8376,0.5321,0.6379,0.5796,0.4459,0.4495,0.068
ada,Ada Boost Classifier,0.7911,0.8405,0.5115,0.6331,0.564,0.4291,0.4343,0.107
qda,Quadratic Discriminant Analysis,0.7467,0.8192,0.7362,0.5169,0.6069,0.4286,0.4435,0.013
lda,Linear Discriminant Analysis,0.7892,0.8253,0.5015,0.6299,0.5576,0.4218,0.427,0.013
lightgbm,Light Gradient Boosting Machine,0.7862,0.8299,0.5115,0.618,0.5593,0.4199,0.4235,0.166
gbc,Gradient Boosting Classifier,0.7907,0.8391,0.4863,0.6393,0.5518,0.4186,0.4256,0.291
nb,Naive Bayes,0.742,0.8171,0.7209,0.5109,0.5977,0.4161,0.4296,0.014
rf,Random Forest Classifier,0.7832,0.8115,0.5001,0.6114,0.5496,0.4089,0.4128,0.259
ridge,Ridge Classifier,0.7915,0.8253,0.4488,0.6574,0.5327,0.4047,0.4173,0.012
et,Extra Trees Classifier,0.77,0.789,0.4901,0.5785,0.5298,0.3792,0.3819,0.174


In [95]:
best_model

#### Saving Model to disk :

Save_model() by pycaret saves the model as pickle file.

In [96]:
save_model(best_model, 'best_churn_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges',
                                              'charge_per_tenure'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('c...
                                                         

Now we can load model by using load_model() by pycaret :

In [97]:
loaded_model = load_model('best_churn_model')

Transformation Pipeline and Model Successfully Loaded


Now we will predict second last row of dataframe by using loaded model.

In [98]:
result = predict_model(loaded_model, df.iloc[-2:-1])
result

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Logistic Regression,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
8361-LTMKD,4.0,1,0,2,74.400002,306.600006,76.650002,1,1,0.5704


#### Python script as a function for Probability of Churn :

In [100]:
!code predict_churn.py

In [101]:
Code('predict_churn.py')

In [102]:
!python predict_churn.py

Transformation Pipeline and Model Successfully Loaded
customerID
9305-CKSKC        Churn
1452-KNGVK    Not Churn
6723-OKKJM    Not Churn
7832-POPKP    Not Churn
6348-TACGU    Not Churn
Name: Churn_prediction, dtype: object


If true values for the new data are [1, 0, 0, 1, 0] then, After comparing each matric best fit matrics can be Accuracy , Kappa, MCC.

- Matric: Accuracy , ML Algorithm : LogisticRegression , Script result: 1 0 0 0 0 = 4 match
- Matric: AUC , ML Algorithm : Gradient Boosting Classifier , Script result: 0 0 0 0 0 = 3 match
- Matric: Recall , ML Algorithm : QuadraticDiscriminantAnalysis , Script result: 1 1 1 1 1 = 2 match
- Matric: Prec. , ML Algorithm : Ridge Classifier , Script result: 1 1 0 0 1 = 2 match
- Matric: F1 , ML Algorithm : QuadraticDiscriminantAnalysis , Script result: 1 1 1 1 1 = 2 match
- Matric: Kappa , ML Algorithm : LogisticRegression , Script result: 1 0 0 0 0 = 4 match
- Matric: MCC , ML Algorithm : LogisticRegression , Script result: 1 0 0 0 0 = 4 match

# Summary


- Loaded necessary packages. And data frame from week 2 as prepared churn data set.
- Use of setup, compare_models, predict_model, save_model, load_model modules from pycaret to find an ML algorithm that performs best on the data.
- Setup by pycaret automatically handles missing values, converts categorical to numeric values, splits data into train and test sets. Setting the target as a Churn model will predict whether data is churn or not.
- First, we checked compare_models() with default matric that is accuracy. Then we sort by using different metrics to check the best fit ML model.
- We can save the best-performing model as a pickle file. And can load it whenever needed.
- By creating a Python script to provide a data frame as input to function in the script from a CSV file. The prediction function within the script can make predictions using the loaded best-performing model with the given data frame.
- we can customize the result by changing the column as 'Churn_prediction' for more understanding and replacing the values 1 and 0 with 'Churn' and 'Not Churn'.
- By comparing each matric result with actual values we can find out the best-performing matrics from all of them.