# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [5]:
import os
import pandas as pd


In [6]:
df = pd.read_csv('prepped_churn_data.csv', index_col='customerID')
df

Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,MonthlyCharges_Tenure_ratio,tenure_ratio
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
7590-VHVEG,0,1,0,0.0,Electronic check,29.85,29.85,No,29.850000,0.033501
5575-GNVDE,1,34,1,1.0,Mailed check,56.95,1889.50,No,1.675000,0.017994
3668-QPYBK,2,2,1,0.0,Mailed check,53.85,108.15,Yes,26.925000,0.018493
7795-CFOCW,3,45,0,1.0,Bank transfer (automatic),42.30,1840.75,No,0.940000,0.024447
9237-HQITU,4,2,1,0.0,Electronic check,70.70,151.65,Yes,35.350000,0.013188
...,...,...,...,...,...,...,...,...,...,...
6840-RESVB,7038,24,1,1.0,Mailed check,84.80,1990.50,No,3.533333,0.012057
2234-XADUH,7039,72,1,1.0,Credit card (automatic),103.20,7362.90,No,1.433333,0.009779
4801-JZAZL,7040,11,0,0.0,Electronic check,29.60,346.45,No,2.690909,0.031751
8361-LTMKD,7041,4,1,0.0,Mailed check,74.40,306.60,Yes,18.600000,0.013046


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7032 entries, 7590-VHVEG to 3186-AJIEK
Data columns (total 10 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   7032 non-null   int64  
 1   tenure                       7032 non-null   int64  
 2   PhoneService                 7032 non-null   int64  
 3   Contract                     5347 non-null   float64
 4   PaymentMethod                7032 non-null   object 
 5   MonthlyCharges               7032 non-null   float64
 6   TotalCharges                 7032 non-null   float64
 7   Churn                        7032 non-null   object 
 8   MonthlyCharges_Tenure_ratio  7032 non-null   float64
 9   tenure_ratio                 7032 non-null   float64
dtypes: float64(5), int64(3), object(2)
memory usage: 604.3+ KB


In [8]:
from pycaret.classification import setup, compare_models, predict_model, save_model, load_model

In [9]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,3627
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7032, 10)"
5,Transformed data shape,"(7032, 13)"
6,Transformed train set shape,"(4922, 13)"
7,Transformed test set shape,"(2110, 13)"
8,Numeric features,8
9,Categorical features,1


In [10]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lda,Linear Discriminant Analysis,0.8005,0.8329,0.8005,0.7891,0.7883,0.4362,0.4477,0.019
gbc,Gradient Boosting Classifier,0.7997,0.8409,0.7997,0.7891,0.7905,0.4459,0.4527,0.161
lr,Logistic Regression,0.7972,0.8332,0.7972,0.7855,0.786,0.4311,0.4405,0.539
ridge,Ridge Classifier,0.7966,0.8314,0.7966,0.7838,0.7797,0.4087,0.4269,0.019
ada,Ada Boost Classifier,0.7899,0.8351,0.7899,0.7786,0.7806,0.4199,0.4263,0.066
rf,Random Forest Classifier,0.7836,0.8173,0.7836,0.7719,0.7744,0.4038,0.4096,0.123
lightgbm,Light Gradient Boosting Machine,0.7818,0.825,0.7818,0.7711,0.7741,0.4055,0.4094,0.118
nb,Naive Bayes,0.7779,0.8006,0.7779,0.7799,0.7786,0.435,0.4355,0.023
et,Extra Trees Classifier,0.77,0.8074,0.77,0.759,0.7623,0.3754,0.3789,0.104
dummy,Dummy Classifier,0.7343,0.5,0.7343,0.5391,0.6217,0.0,0.0,0.024


In [11]:
best_model

In [12]:
df.iloc[-2:-1].shape

(1, 10)

In [13]:
predict_model(best_model, df.iloc[-2:-1])

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Linear Discriminant Analysis,1.0,0,1.0,1.0,1.0,,0.0


Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,MonthlyCharges_Tenure_ratio,tenure_ratio,Churn,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
8361-LTMKD,7041,4,1,0.0,Mailed check,74.400002,306.600006,18.6,0.013046,Yes,Yes,0.5263


In [14]:
save_model(best_model, 'GBC')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['Unnamed: 0', 'tenure',
                                              'PhoneService', 'Contract',
                                              'MonthlyCharges', 'TotalCharges',
                                              'MonthlyCharges_Tenure_ratio',
                                              'tenure_ratio'],
                                     transformer=Sim...
                                                               return_df=True,
                                                               use_cat_names=True,
                                                               verbose=0))),
                 ('cl

In [15]:
import pickle

with open('GBC_model.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [16]:
with open('GBC_model.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [17]:
loaded_gbc = load_model('GBC')

Transformation Pipeline and Model Successfully Loaded


In [18]:
new_data = df.iloc[-2:-1].copy()
new_data.drop('Churn', axis=1, inplace=True)
loaded_gbc.predict(new_data)

0    Yes
Name: Churn, dtype: object

In [19]:
predict_model(loaded_gbc, new_data)

Unnamed: 0_level_0,Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,MonthlyCharges_Tenure_ratio,tenure_ratio,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
8361-LTMKD,7041,4,1,0.0,Mailed check,74.400002,306.600006,18.6,0.013046,Yes,0.5263


In [20]:
from IPython.display import Code

Code('predict_Churn.py')

# Summary

In this week 5 Assigment we develop an ML model to predict customer churn using week2 dataset and we used Pycaret to identify the model which is performing best and then deployed it to a python module for making predictions.

## Data preparaion: 
we loaded churn datset from week2 with customerid as index and we ensure data types are correct with no nun values

## Module selection : 
we used Pycaret to test multiple ML models for the dataset and Pycaret comapare models and gives best performing model based on AUC score and that model we use for final training 

## Training and saving: 
Pycaret gives us the final best performing model and we train the model on the dataset then it is saved to disk for later use in deployment

## Creating a Prediction Module: 
We develop a seperate python module to load the trained model and make predictions

## Results: 
The final model which has a high AUC score has a strong predictive capability and the predictions on new dataset were soo acurate when compared to actual values and it aslo indentifies high rick customers based on churn probability scores.

