# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
df = pd.read_csv('C:/Users/jaypa/OneDrive/Desktop/Python_MSDS600_Projects/updated_churn_dataset.csv', index_col='customerID',)
df

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
7590-VHVEG,1,0,0,2,29.85,29.85,0
5575-GNVDE,34,1,1,3,56.95,1889.50,0
3668-QPYBK,2,1,0,3,53.85,108.15,1
7795-CFOCW,45,0,1,0,42.30,1840.75,0
9237-HQITU,2,1,0,2,70.70,151.65,1
...,...,...,...,...,...,...,...
6840-RESVB,24,1,1,3,84.80,1990.50,0
2234-XADUH,72,1,1,1,103.20,7362.90,0
4801-JZAZL,11,0,0,2,29.60,346.45,0
8361-LTMKD,4,1,0,3,74.40,306.60,1


In [2]:
from pycaret.classification import *

In [4]:
automl = setup(df, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1778
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


In [5]:
best_model = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7945,0.8366,0.4916,0.6492,0.5587,0.4283,0.4358,0.095
ada,Ada Boost Classifier,0.7905,0.8341,0.5015,0.6342,0.5593,0.4244,0.43,0.045
lr,Logistic Regression,0.7892,0.8293,0.5221,0.6237,0.5679,0.43,0.4334,0.862
ridge,Ridge Classifier,0.789,0.0,0.4388,0.6543,0.5246,0.3958,0.4092,0.012
catboost,CatBoost Classifier,0.7876,0.8315,0.484,0.6303,0.5465,0.4112,0.4179,0.915
lda,Linear Discriminant Analysis,0.7846,0.8163,0.4854,0.6218,0.5444,0.4063,0.4121,0.013
lightgbm,Light Gradient Boosting Machine,0.7769,0.8208,0.4863,0.5993,0.5359,0.3914,0.3956,0.128
rf,Random Forest Classifier,0.7651,0.7919,0.4687,0.5708,0.5135,0.3609,0.3647,0.105
knn,K Neighbors Classifier,0.7588,0.7407,0.4251,0.5607,0.4828,0.3296,0.3353,0.437
et,Extra Trees Classifier,0.7542,0.7684,0.4657,0.5439,0.5002,0.339,0.3416,0.079


In [6]:
best_model

In [7]:
save_model(best_model, 'GradientBoostingClassifier')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean',
                                                               verbose='deprecated'))),
                 ('...
                                        

In [8]:
import pickle
with open('GradientBoostingClassifier.pk', 'wb') as f:
    pickle.dump(best_model, f)

In [9]:
with open('GradientBoostingClassifier.pk', 'rb') as f:
    loaded_model = pickle.load(f)

In [10]:
loaded_lda = load_model('GradientBoostingClassifier')

Transformation Pipeline and Model Successfully Loaded


In [20]:
from IPython.display import Code

Code("C:/Users/jaypa/Downloads/predict_churn.py")

In [21]:
%run C:/Users/jaypa/Downloads/predict_churn.py

Transformation Pipeline and Model Successfully Loaded


predictions:
           Churn_prediction
customerID                 
9305-CKSKC         No churn
1452-KNGVK            Churn
6723-OKKJM            Churn
7832-POPKP         No churn
6348-TACGU            Churn


# Summary

Write a short summary of the process and results here.

For data preparation: First,I loaded the churn data from week 2 using pandas, ensuring it was clean and ready for analysis.

After that, with the data prepared, I used "PyCaret", an AutoML library, to automate the ML model selection process. The setup function was called with the churn column as the target, initiating an environment for classification tasks.
The compare_models function was then used to evaluate various models across multiple metrics like accuracy, AUC (Area Under the Curve), recall, precision, F1 score, kappa, and MCC (Matthews Correlation Coefficient).

Then the "Gradient Boosting Classifier" (GBC) emerged as the best model based on the comparison. The choice of GBC was likely influenced by its superior performance across multiple evaluation metrics, particularly accuracy and AUC, which are critical for imbalanced datasets like churn prediction. So, i have choosen Gradient Boosting Classifier model. Then, Accuracy was the primary metric, supplemented by AUC, recall, and precision. These metrics were chosen to balance the need for correctly predicting churn against the cost of false positives and negatives, which are critical considerations in customer retention strategies.

Then, the best model was saved to disk using PyCaret's save_model function and then reloaded for prediction using the load_model function, demonstrating a practical approach to model deployment.

After that, a Python script was created to load the saved model and make predictions on new data. The script included functions to load data and make predictions, indicating a threshold for classifying a customer as churned.
The script’s execution showed the model's ability to predict churn on new data, aligning with the assignment's requirements to assess model performance in practical scenarios.

Then, results from the new data predictions were presented, showcasing the model's application in predicting customer churn.
The whole process illustrated the effectiveness of using AutoML tools like PyCaret for model selection and highlighted the importance of choosing the right metrics for evaluating model performance in a business context.Finally, I uploaded both the notebook and the Python script to a new GitHub repository, for others to access and review.
