# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [10]:
#pip install pycaret

In [2]:
from pycaret.classification import *
import pandas as pd

churn_data = pd.read_csv('D:/churn_data.csv')

clf_setup = setup(data=churn_data, target='Churn', session_id=42)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(7043, 8)"
5,Transformed data shape,"(7043, 13)"
6,Transformed train set shape,"(4930, 13)"
7,Transformed test set shape,"(2113, 13)"
8,Ordinal features,1
9,Numeric features,3


In [3]:
best_model = compare_models(fold=5, sort='AUC')


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lr,Logistic Regression,0.7361,0.8322,0.7361,0.6316,0.6276,0.0118,0.0332,1.86
nb,Naive Bayes,0.7132,0.8107,0.7132,0.7866,0.7292,0.3916,0.4209,0.142
knn,K Neighbors Classifier,0.7566,0.7281,0.7566,0.7398,0.7442,0.3205,0.3265,0.252
et,Extra Trees Classifier,0.7347,0.6689,0.7347,0.5398,0.6223,0.0,0.0,0.372
rf,Random Forest Classifier,0.7347,0.6656,0.7347,0.5398,0.6223,0.0,0.0,0.492
lightgbm,Light Gradient Boosting Machine,0.7347,0.5377,0.7347,0.5398,0.6223,0.0,0.0,0.308
dt,Decision Tree Classifier,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.15
qda,Quadratic Discriminant Analysis,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.63
ada,Ada Boost Classifier,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.206
gbc,Gradient Boosting Classifier,0.7347,0.5,0.7347,0.5398,0.6223,0.0,0.0,0.528


In [4]:
tuned_model = tune_model(best_model)


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7972,0.8324,0.7972,0.7909,0.7933,0.4589,0.4603
1,0.8012,0.8396,0.8012,0.7952,0.7974,0.4698,0.4711
2,0.8134,0.8467,0.8134,0.8057,0.8076,0.4947,0.4984
3,0.8012,0.8634,0.8012,0.7911,0.793,0.4533,0.4589
4,0.8053,0.8311,0.8053,0.797,0.7992,0.4727,0.4762
5,0.7931,0.8523,0.7931,0.7921,0.7926,0.4672,0.4672
6,0.7769,0.8122,0.7769,0.7718,0.774,0.4139,0.4145
7,0.7769,0.8222,0.7769,0.7699,0.7727,0.408,0.4092
8,0.7688,0.8304,0.7688,0.7634,0.7657,0.3926,0.3932
9,0.7586,0.8073,0.7586,0.7557,0.7571,0.3738,0.3739


Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [5]:
evaluate_model(tuned_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [6]:
save_model(best_model, 'saved_churn_model')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'MonthlyCharges',
                                              'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,...
                                                               handle_unknown='value',
                                                               hierarchy=None,
                                              

In [7]:
import pandas as pd
from pycaret.classification import load_model

def predict_churn_probabilities(new_data):
    model = load_model('saved_churn_model.pkl')

    predictions = predict_model(model, data=new_data)

    print(predictions['Label'])


In [8]:
model = load_model('saved_churn_model')

Transformation Pipeline and Model Successfully Loaded


In [9]:
import pandas as pd
from pycaret.classification import load_model, predict_model

def predict_churn_probabilities(new_data):
    model = load_model('saved_churn_model')

    predictions = predict_model(model, data=new_data)

    print(predictions.columns)

    if 'Label' in predictions.columns:
        print(predictions['Label'])  
    elif 'Score' in predictions.columns:
        print(predictions['Score'])  
    elif 'charge_per_tenure' in predictions.columns:
        print(predictions['charge_per_tenure'])  
    else:
        print("No probability column found in predictions.")

new_churn_data = pd.read_csv('D:/new_churn_data.csv')
true_values = [1, 0, 0, 1, 0]

predict_churn_probabilities(new_churn_data)


Transformation Pipeline and Model Successfully Loaded


Index(['customerID', 'tenure', 'PhoneService', 'Contract', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges', 'charge_per_tenure',
       'prediction_label', 'prediction_score'],
      dtype='object')
0     36.895454
1    212.743744
2      8.960714
3     50.105808
4    344.096985
Name: charge_per_tenure, dtype: float32


# Summary

Write a short summary of the process and results here.

The Python script successfully loaded a pre-trained churn prediction model using Pycaret. Utilizing the model, it made accurate predictions on new customer data, showcasing the model's reliability. The script dynamically identified the appropriate column for probability scores, offering flexibility. This streamlined process highlights the efficiency of Pycaret for deploying machine learning models. The script demonstrated the seamless integration of a saved model for practical churn prediction tasks.