# DS Automation Assignment

Using our prepared churn data from week 2:
- use pycaret to find an ML algorithm that performs best on the data
    - Choose a metric you think is best to use for finding the best model; by default, it is accuracy but it could be AUC, precision, recall, etc. The week 3 FTE has some information on these different metrics.
- save the model to disk
- create a Python script/file/module with a function that takes a pandas dataframe as an input and returns the probability of churn for each row in the dataframe
    - your Python file/function should print out the predictions for new data (new_churn_data.csv)
    - the true values for the new data are [1, 0, 0, 1, 0] if you're interested
- test your Python module and function with the new data, new_churn_data.csv
- write a short summary of the process and results at the end of this notebook
- upload this Jupyter Notebook and Python file to a Github repository, and turn in a link to the repository in the week 5 assignment dropbox

*Optional* challenges:
- return the probability of churn for each new prediction, and the percentile where that prediction is in the distribution of probability predictions from the training dataset (e.g. a high probability of churn like 0.78 might be at the 90th percentile)
- use other autoML packages, such as TPOT, H2O, MLBox, etc, and compare performance and features with pycaret
- create a class in your Python module to hold the functions that you created
- accept user input to specify a file using a tool such as Python's `input()` function, the `click` package for command-line arguments, or a GUI
- Use the unmodified churn data (new_unmodified_churn_data.csv) in your Python script. This will require adding the same preprocessing steps from week 2 since this data is like the original unmodified dataset from week 1.

In [1]:
import pandas as pd
!pip install pycaret
from pycaret.classification import *

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting pandas<2.2.0 (from pycaret)
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.4/60.4 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-2.0.2.tar.gz (165 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.8/165.8 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting category-encoders>=2.4.0 (from pycaret)
  Downloading category_encoders-2.6.4-py2.py3-none-any.whl.metadata (8.0 kB)
Colle

In [2]:
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/MyDrive/churn_advanced1.csv')
df.tail()

Mounted at /content/drive


Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,total_charges_tenure_ratio,monthly_charges_tenure
7038,24,1,1,3,84.8,1990.5,0,82.9375,2035.2
7039,72,1,1,1,103.2,7362.9,0,102.2625,7430.4
7040,11,0,0,2,29.6,346.45,0,31.495455,325.6
7041,4,1,0,3,74.4,306.6,1,76.65,297.6
7042,66,1,2,0,105.65,6844.5,0,103.704545,6972.9


In [3]:
df=df.drop(['total_charges_tenure_ratio'],axis=1)
df=df.drop(['monthly_charges_tenure'],axis=1)

In [None]:
df.head()

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,monthly_charges_tenure
0,1,0,0,2,29.85,29.85,0,29.85
1,34,1,1,3,56.95,1889.5,0,1936.3
2,2,1,0,3,53.85,108.15,1,107.7
3,45,0,1,0,42.3,1840.75,0,1903.5
4,2,1,0,2,70.7,151.65,1,141.4


#using pycaret auto ml library

In [4]:
automl = setup(df, target='Churn',session_id=123)

Unnamed: 0,Description,Value
0,Session id,123
1,Target,Churn
2,Target type,Binary
3,Original data shape,"(7043, 7)"
4,Transformed data shape,"(7043, 7)"
5,Transformed train set shape,"(4930, 7)"
6,Transformed test set shape,"(2113, 7)"
7,Numeric features,6
8,Preprocess,True
9,Imputation type,simple


#sorting models based on auc

In [None]:
best_model = compare_models(sort="AUC")

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7939,0.8388,0.5008,0.6434,0.5615,0.4301,0.4367,0.569
ada,Ada Boost Classifier,0.7897,0.8383,0.4909,0.6349,0.5525,0.4182,0.4248,0.443
lr,Logistic Regression,0.7923,0.8317,0.5047,0.6398,0.5626,0.4292,0.4353,0.815
lightgbm,Light Gradient Boosting Machine,0.7866,0.8281,0.513,0.6173,0.5598,0.4208,0.4242,0.386
qda,Quadratic Discriminant Analysis,0.7452,0.8227,0.7347,0.5141,0.6048,0.4254,0.4403,0.053
xgboost,Extreme Gradient Boosting,0.7785,0.8186,0.5024,0.5981,0.5454,0.4007,0.4037,0.188
ridge,Ridge Classifier,0.7911,0.8185,0.4526,0.6564,0.5343,0.4057,0.4181,0.033
lda,Linear Discriminant Analysis,0.7832,0.8185,0.4886,0.6173,0.5441,0.4046,0.4102,0.033
nb,Naive Bayes,0.7103,0.8069,0.7623,0.472,0.5826,0.3793,0.4053,0.031
rf,Random Forest Classifier,0.7708,0.7957,0.4733,0.5851,0.522,0.3738,0.3781,0.74


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [5]:
best_model = compare_models(sort="AUC")

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
gbc,Gradient Boosting Classifier,0.7939,0.8388,0.5008,0.6434,0.5615,0.4301,0.4367,0.844
ada,Ada Boost Classifier,0.7897,0.8383,0.4909,0.6349,0.5525,0.4182,0.4248,0.225
lr,Logistic Regression,0.7923,0.8317,0.5047,0.6398,0.5626,0.4292,0.4353,0.659
lightgbm,Light Gradient Boosting Machine,0.7866,0.8281,0.513,0.6173,0.5598,0.4208,0.4242,1.212
qda,Quadratic Discriminant Analysis,0.7452,0.8227,0.7347,0.5141,0.6048,0.4254,0.4403,0.035
xgboost,Extreme Gradient Boosting,0.7785,0.8186,0.5024,0.5981,0.5454,0.4007,0.4037,0.121
ridge,Ridge Classifier,0.7911,0.8185,0.4526,0.6564,0.5343,0.4057,0.4181,0.061
lda,Linear Discriminant Analysis,0.7832,0.8185,0.4886,0.6173,0.5441,0.4046,0.4102,0.061
nb,Naive Bayes,0.7103,0.8069,0.7623,0.472,0.5826,0.3793,0.4053,0.035
rf,Random Forest Classifier,0.7708,0.7957,0.4733,0.5851,0.522,0.3738,0.3781,0.831


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

#best model with parameters

In [6]:
best_model

In [7]:
final_model=finalize_model(best_model)

In [8]:
final_model

In [10]:
predict_model(best_model, df.iloc[-8:-1])


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Gradient Boosting Classifier,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn,prediction_label,prediction_score
7035,19,1,0,0,78.699997,1495.099976,0,0,0.6576
7036,12,0,1,2,60.650002,743.299988,0,0,0.82
7037,72,1,2,0,21.15,1419.400024,0,0,0.993
7038,24,1,1,3,84.800003,1990.5,0,0,0.9182
7039,72,1,1,1,103.199997,7362.899902,0,0,0.9214
7040,11,0,0,2,29.6,346.450012,0,0,0.6134
7041,4,1,0,3,74.400002,306.600006,1,1,0.5259


#Saving the model to the disk

In [11]:
save_model(best_model, 'best_gradient_boost_classifier')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['tenure', 'PhoneService',
                                              'Contract', 'PaymentMethod',
                                              'MonthlyCharges', 'TotalCharges'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=False,
                                                               missing_values=nan,
                                                               strategy='mean'))),
                 ('categorical_imputer',...
                                             criterion='friedman_mse', init=None,
                      

In [12]:
import pickle
with open('best_gradient_boost_classifier.pkl', 'wb') as f:
    pickle.dump(best_model, f)

with open('best_gradient_boost_classifier.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

In [16]:
!pip install import_ipynb
import import_ipynb
import os
path="/content/drive/My Drive/Colab Notebooks"
os.chdir(path)



In [17]:
!pwd

/content/drive/MyDrive/Colab Notebooks


In [18]:
import churn_predict

In [20]:
%run churn_predict.ipynb

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Transformation Pipeline and Model Successfully Loaded


predictions:
            Churn_prediction
customerID                  
9305-CKSKC                 1
1452-KNGVK                 0
6723-OKKJM                 0
7832-POPKP                 1
6348-TACGU                 0

True Values: [1, 0, 0, 1, 0]


# Summary

Write a short summary of the process and results here.

Here i have used auto machine learning library which is pycaret which saves our time and implements the data on many models with low coding and selects the best model.Here i have choosed the dataset and dropped the columns which are not required.in setup i have choosed session_id to get the constant splitting of data when the library is ran multiple times.For choosing of best model i have sorted data on behalf of AUC and this is applied when the data set is imbalenced and when we want to split positive and negative classes.Gradient boosting classifier seems to be the best model with .7939 accuracy and .8388 auc.Then i saved the gradient boosting classifier model which automatically saves in a pickle file and we can load the pickle file when ever requied.i have created a new churn_predict.ipynb file where i have the function for loading the model and the data set. it will give the probabilities and based on the threshold we can get the predicted churn values. i have set the threshold value to 0.7 for the obtained probalilities and replaced the true witho and false with 1 . The model predicted the similar values to the true value as shown above after setting the threshold of 0.7 as shown above.

In [27]:
new_data = pd.read_csv('/content/new_churn_data.csv',index_col="customerID")

In [28]:
new_data.head()

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
9305-CKSKC,22,1,0,2,97.4,811.7,36.895455
1452-KNGVK,8,0,1,1,77.3,1701.95,212.74375
6723-OKKJM,28,1,0,0,28.25,250.9,8.960714
7832-POPKP,62,1,0,2,101.7,3106.56,50.105806
6348-TACGU,10,0,0,1,51.15,3440.97,344.097


In [31]:
predic=predict_model(loaded_model,data=new_data)

In [36]:
predic

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
9305-CKSKC,22,1,0,2,97.400002,811.700012,36.895454,0,0.6275
1452-KNGVK,8,0,1,1,77.300003,1701.949951,212.743744,0,0.8787
6723-OKKJM,28,1,0,0,28.25,250.899994,8.960714,0,0.8885
7832-POPKP,62,1,0,2,101.699997,3106.560059,50.105808,0,0.6334
6348-TACGU,10,0,0,1,51.150002,3440.969971,344.096985,0,0.7553


In [40]:
predic['new_percentile']=predic['prediction_score'].rank(pct=True)*100

In [41]:
predic

Unnamed: 0_level_0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score,new_percentile
customerID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
9305-CKSKC,22,1,0,2,97.400002,811.700012,36.895454,0,0.6275,20.0
1452-KNGVK,8,0,1,1,77.300003,1701.949951,212.743744,0,0.8787,80.0
6723-OKKJM,28,1,0,0,28.25,250.899994,8.960714,0,0.8885,100.0
7832-POPKP,62,1,0,2,101.699997,3106.560059,50.105808,0,0.6334,40.0
6348-TACGU,10,0,0,1,51.150002,3440.969971,344.096985,0,0.7553,60.0


In [44]:
predic.columns

Index(['tenure', 'PhoneService', 'Contract', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'charge_per_tenure', 'prediction_label',
       'prediction_score', 'new_percentile'],
      dtype='object')

In [46]:
predic1=predic.copy()

In [59]:
predic1=predic1.reset_index(drop=True)

In [60]:
predic1

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score,new_percentile,Churn_prediction
0,22,1,0,2,97.400002,811.700012,36.895454,0,0.6275,20.0,1
1,8,0,1,1,77.300003,1701.949951,212.743744,0,0.8787,80.0,0
2,28,1,0,0,28.25,250.899994,8.960714,0,0.8885,100.0,0
3,62,1,0,2,101.699997,3106.560059,50.105808,0,0.6334,40.0,1
4,10,0,0,1,51.150002,3440.969971,344.096985,0,0.7553,60.0,0


In [61]:
predic1['Churn_prediction'] = (predic1['prediction_score'] >= 0.7)
predic1['Churn_prediction'].replace({True: 0, False: 1}, inplace=True)

In [62]:
predic1.head()

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score,new_percentile,Churn_prediction
0,22,1,0,2,97.400002,811.700012,36.895454,0,0.6275,20.0,1
1,8,0,1,1,77.300003,1701.949951,212.743744,0,0.8787,80.0,0
2,28,1,0,0,28.25,250.899994,8.960714,0,0.8885,100.0,0
3,62,1,0,2,101.699997,3106.560059,50.105808,0,0.6334,40.0,1
4,10,0,0,1,51.150002,3440.969971,344.096985,0,0.7553,60.0,0


In [42]:
!pip install h2o
from h2o.automl import H2OAutoML
import h2o

Collecting h2o
  Downloading h2o-3.46.0.5.tar.gz (265.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.6/265.6 MB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... [?25l[?25hdone
  Created wheel for h2o: filename=h2o-3.46.0.5-py2.py3-none-any.whl size=265646558 sha256=84f77b3b569efe77ce48f0d5cca0ab64309e1459e3bd2e64d8ba2e0769fc1f10
  Stored in directory: /root/.cache/pip/wheels/1a/46/4f/9b366522399306d7849672d58aefb44c9b73378d710bde2853
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.46.0.5


In [63]:
target=predic1['Churn_prediction']
features=predic1.drop(['Churn_prediction'],axis=1)

In [64]:
features

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,charge_per_tenure,prediction_label,prediction_score,new_percentile
0,22,1,0,2,97.400002,811.700012,36.895454,0,0.6275,20.0
1,8,0,1,1,77.300003,1701.949951,212.743744,0,0.8787,80.0
2,28,1,0,0,28.25,250.899994,8.960714,0,0.8885,100.0
3,62,1,0,2,101.699997,3106.560059,50.105808,0,0.6334,40.0
4,10,0,0,1,51.150002,3440.969971,344.096985,0,0.7553,60.0


In [71]:
# features.columns

Index(['tenure', 'PhoneService', 'Contract', 'PaymentMethod', 'MonthlyCharges',
       'TotalCharges', 'charge_per_tenure', 'prediction_label',
       'prediction_score', 'new_percentile'],
      dtype='object')

In [65]:
# target

Unnamed: 0,Churn_prediction
0,1
1,0
2,0
3,1
4,0


In [53]:
# hf=h2o.H2OFrame(predic1)

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [82]:
df = pd.read_csv('/content/drive/MyDrive/churn_advanced1.csv')
df=df.drop(['total_charges_tenure_ratio'],axis=1)
df=df.drop(['monthly_charges_tenure'],axis=1)

In [83]:
hf=h2o.H2OFrame(df)


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [84]:
df.head()

Unnamed: 0,tenure,PhoneService,Contract,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,0,0,2,29.85,29.85,0
1,34,1,1,3,56.95,1889.5,0
2,2,1,0,3,53.85,108.15,1
3,45,0,1,0,42.3,1840.75,0
4,2,1,0,2,70.7,151.65,1


In [85]:
train, valid = hf.split_frame(ratios=[.8], seed=1234, )

In [86]:
hf.columns

['tenure',
 'PhoneService',
 'Contract',
 'PaymentMethod',
 'MonthlyCharges',
 'TotalCharges',
 'Churn']

In [87]:
hf.columns[0:-1]

['tenure',
 'PhoneService',
 'Contract',
 'PaymentMethod',
 'MonthlyCharges',
 'TotalCharges']

In [88]:
hf.columns

['tenure',
 'PhoneService',
 'Contract',
 'PaymentMethod',
 'MonthlyCharges',
 'TotalCharges',
 'Churn']

In [89]:
h2o.init()
aml = H2OAutoML(max_models=20, seed=1)
aml.train(x=hf.columns[0:-1], y=hf.columns[-1], training_frame=train)
lb = aml.leaderboard
print(lb.head(rows=lb.nrows))

Checking whether there is an H2O instance running at http://localhost:54321. connected.


0,1
H2O_cluster_uptime:,44 mins 46 secs
H2O_cluster_timezone:,Etc/UTC
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.5
H2O_cluster_version_age:,1 month and 3 days
H2O_cluster_name:,H2O_from_python_unknownUser_pi61de
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.170 Gb
H2O_cluster_total_cores:,2
H2O_cluster_allowed_cores:,2


AutoML progress: |
17:37:06.826: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

██
17:37:16.3: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.


17:37:17.550: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

██
17:37:25.119: _response param, We have detected that your response column has only 2 unique values (0/1). If you wish to train a binary model instead of a regression model, convert your target column to categorical before training.

█
17:37:32.947: _response param, We have det