# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [20]:
pip install pycaret

Collecting numpy<1.24,>=1.21
  Downloading numpy-1.23.5-cp39-cp39-macosx_10_9_x86_64.whl (18.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.1/18.1 MB[0m [31m26.6 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.24.3
    Uninstalling numpy-1.24.3:
      Successfully uninstalled numpy-1.24.3
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
daal4py 2021.6.0 requires daal==2021.4.0, which is not installed.[0m[31m
[0mSuccessfully installed numpy-1.23.5
Note: you may need to restart the kernel to use updated packages.


In [71]:
from pycaret.classification import *
import pandas as pd

In [72]:
# Load the data
data = pd.read_csv('cleaned_churn_data.csv')
data = data.drop(['tenure_charge_ratio', 'PhoneService_1', 'charge_to_tenure_ratio', 'customerID'], axis=1)
data

Unnamed: 0,tenure,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,Month-to-month,Electronic check,29.85,29.85,No
1,34,One year,Mailed check,56.95,1889.50,No
2,2,Month-to-month,Mailed check,53.85,108.15,Yes
3,45,One year,Bank transfer (automatic),42.30,1840.75,No
4,2,Month-to-month,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...
6949,24,One year,Mailed check,84.80,1990.50,No
6950,72,One year,Credit card (automatic),103.20,7362.90,No
6951,11,Month-to-month,Electronic check,29.60,346.45,No
6952,4,Month-to-month,Mailed check,74.40,306.60,Yes


In [74]:
# Initializing the setup
clasification = setup(data, target='Churn')

Unnamed: 0,Description,Value
0,Session id,1379
1,Target,Churn
2,Target type,Binary
3,Target mapping,"No: 0, Yes: 1"
4,Original data shape,"(6954, 6)"
5,Transformed data shape,"(6954, 11)"
6,Transformed train set shape,"(4867, 11)"
7,Transformed test set shape,"(2087, 11)"
8,Numeric features,3
9,Categorical features,2


In [75]:
# Compare models and select the best one
best_model_data = compare_models()

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
ada,Ada Boost Classifier,0.7931,0.8434,0.5092,0.6475,0.569,0.4356,0.4417,0.047
gbc,Gradient Boosting Classifier,0.7919,0.8418,0.5099,0.6427,0.5675,0.4331,0.4388,0.101
lr,Logistic Regression,0.789,0.8337,0.5176,0.6321,0.5685,0.4307,0.4349,0.033
lightgbm,Light Gradient Boosting Machine,0.7888,0.8334,0.5382,0.6249,0.5776,0.4379,0.4406,0.032
ridge,Ridge Classifier,0.7871,0.0,0.4701,0.6416,0.5416,0.4075,0.4164,0.016
lda,Linear Discriminant Analysis,0.7863,0.8253,0.5299,0.6199,0.5706,0.4297,0.4325,0.02
rf,Random Forest Classifier,0.7728,0.8059,0.5045,0.5913,0.5436,0.3937,0.3965,0.096
knn,K Neighbors Classifier,0.7627,0.7507,0.4555,0.5734,0.5065,0.3533,0.358,0.022
et,Extra Trees Classifier,0.7598,0.7851,0.4961,0.5596,0.5252,0.3654,0.367,0.088
svm,SVM - Linear Kernel,0.7539,0.0,0.4699,0.58,0.4945,0.342,0.3597,0.02


Processing:   0%|          | 0/61 [00:00<?, ?it/s]

In [76]:
print(best_model_data)
save_model(best_model_data, 'best_churn_model')

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=50, random_state=1379)
Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'MonthlyCharges',
                                              'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               missing_values=nan,
                                                               strateg...
                                     include=['Contract', 'PaymentMethod'],
                                     transformer=OneHotEncoder(cols=['Co

In [77]:
import pandas as pd
from pycaret.classification import load_model, predict_model

In [78]:
def predict_churn_probabilities(input_df):
    model = load_model('best_churn_model') # loading the model data

    predictions = predict_model(model, data=input_df) # making predections

    return predictions #return model probabilities

In [79]:
new_data = pd.read_csv('new_churn_data.csv')
print(new_data)
new_data = new_data.drop(['customerID'], axis=1)
churn_probabilities = predict_churn_probabilities(new_data)
print(churn_probabilities)

   customerID  tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
0  9305-CKSKC      22             1         0              2           97.40   
1  1452-KNGVK       8             0         1              1           77.30   
2  6723-OKKJM      28             1         0              0           28.25   
3  7832-POPKP      62             1         0              2          101.70   
4  6348-TACGU      10             0         0              1           51.15   

   TotalCharges  charge_per_tenure  
0        811.70          36.895455  
1       1701.95         212.743750  
2        250.90           8.960714  
3       3106.56          50.105806  
4       3440.97         344.097000  
Transformation Pipeline and Model Successfully Loaded


   tenure  PhoneService  Contract  PaymentMethod  MonthlyCharges  \
0      22             1         0              2       97.400002   
1       8             0         1              1       77.300003   
2      28             1         0              0       28.250000   
3      62             1         0              2      101.699997   
4      10             0         0              1       51.150002   

   TotalCharges  charge_per_tenure prediction_label  prediction_score  
0    811.700012          36.895454               No            0.5108  
1   1701.949951         212.743744               No            0.5138  
2    250.899994           8.960714               No            0.5200  
3   3106.560059          50.105808               No            0.5128  
4   3440.969971         344.096985               No            0.5210  


# Summary

Write a short summary of the process and results here.

I feel I was a success in achieving our objectives. PyCaret is a valuable tool in automating the model selection process, saving the best model, and creating a reusable prediction module. and then able to compare the predictions made on new data to the true values and evaluate the model's performance based on various metrics that fit business requirements.

Using PyCaret made our machine learning workflow much more streamlined, allowing us to experiment with different algorithms and select the best-performing one with ease. Next deploy the prediction module in a production environment, making real-time churn predictions for new customer data.

Overall, this project demonstrates how to build an end-to-end workflow for churn prediction, from data preprocessing and model selection to model deployment and testing with new data. With the help of PyCaret and a well-structured Python module, we can simplify the process of building and deploying machine learning models for business applications.