## AutoML for Modelling

- Katonic Python SDK for Complete ML Model Life Cycle.
- The Auto ML component in the Katonic SDK can be used to train Machine Learning models with just one or two lines of code.
- All the metrics of classification will get catalogued using SDK.

## Imports

In [3]:
import os
os.system("pip install katonic[ml]")

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [46]:
import pandas as pd
from sklearn.model_selection import train_test_split

from katonic.ml.classification import Classifier

from katonic.log.client import load_model

pd.set_option('display.max_columns', 100)

In [5]:
# define experiment name
exp_name = "teleco_customer_churn3"

### Loading pre-proccessed file

In [6]:
X_df = pd.read_csv("preprocessed_customer_churn.csv")

In [7]:
Y_df = X_df[['customerID','Churn']]
X_df.pop('Churn')
Y_df.head()

Unnamed: 0,customerID,Churn
0,5375,0
1,3962,0
2,2564,1
3,5535,0
4,6511,1


In [8]:
X = X_df
y = Y_df

In [9]:
X.columns

Index(['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents',
       'tenure', 'PhoneService', 'MultipleLines', 'InternetService',
       'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
       'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling',
       'PaymentMethod', 'MonthlyCharges', 'TotalCharges'],
      dtype='object')

## Data Splitting

In [10]:
X_train = X.drop(['customerID'], axis=1)
y_train = y['Churn']

X_train, X_test, y_train, y_test = train_test_split(X_train, y_train, test_size=0.20, random_state=42)

In [11]:
X_train.columns

Index(['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure',
       'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod',
       'MonthlyCharges', 'TotalCharges'],
      dtype='object')

## Feature Selection

In [12]:
top = ['tenure',
 'InternetService',
 'OnlineSecurity',
 'OnlineBackup',
 'TechSupport',
 'Contract',
 'PaperlessBilling',
 'PaymentMethod',
 'MonthlyCharges',
 'TotalCharges']

In [13]:
X_train[top]

Unnamed: 0,tenure,InternetService,OnlineSecurity,OnlineBackup,TechSupport,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
2142,21,0,2,0,0,1,0,3,64.85,610
1623,54,1,0,2,0,2,1,0,97.20,4319
6074,1,0,0,0,0,0,1,2,23.45,1940
1362,4,1,0,0,0,0,1,2,70.20,2012
6754,0,0,2,2,2,2,1,0,61.90,0
...,...,...,...,...,...,...,...,...,...,...
3772,1,1,2,0,0,0,1,2,95.00,6440
5191,23,0,2,2,2,2,1,1,91.10,1819
5226,12,2,1,1,1,0,1,2,21.15,2659
5390,12,1,0,0,0,0,1,2,99.45,370


In [14]:
X_train, X_test = X_train[top], X_test[top]


## Modelling

In [21]:
features = list(X_train.columns)
clf = Classifier(X_train, X_test, y_train, y_test, exp_name, source_name='2. modelling.ipynb', features=features)

In [22]:
exp_id = clf.id

print("experiment name : ", clf.name)
print("experiment location : ", clf.location)
print("experiment id : ", clf.id)
print("experiment status : ", clf.stage)

experiment name :  teleco_customer_churn3
experiment location :  s3://models/12
experiment id :  12
experiment status :  active


## Logistic Regression



In [23]:
clf.LogisticRegression()

## GradientBoostingClassifier

In [24]:
clf.GradientBoostingClassifier()

## RandomForestClassifier

In [25]:
clf.RandomForestClassifier(random_state = 42)

## AdaBoostClassifier

In [26]:
clf.AdaBoostClassifier(random_state = 42)

## LightGBMClassifier

In [27]:
clf.LGBMClassifier()

[LightGBM] [Info] Number of positive: 1496, number of negative: 4138
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000505 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 604
[LightGBM] [Info] Number of data points in the train set: 5634, number of used features: 10
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.265531 -> initscore=-1.017418
[LightGBM] [Info] Start training from score -1.017418


### XGBClassifier

In [28]:
clf.XGBClassifier()

### DecisionTreeClassifier

In [29]:
clf.DecisionTreeClassifier()

### SupportVectorClassifier

In [30]:
clf.SupportVectorClassifier()

### RidgeClassifier

In [31]:
clf.RidgeClassifier()

### KNeighborsClassifier

In [32]:
clf.KNeighborsClassifier()

### GaussianNB

In [33]:
clf.GaussianNB()

## CatBoostClassifier

In [34]:
clf.CatBoostClassifier(random_state=42)

Learning rate set to 0.021554
0:	learn: 0.6786690	total: 48.7ms	remaining: 48.7s
1:	learn: 0.6652349	total: 50.9ms	remaining: 25.4s
2:	learn: 0.6523005	total: 53ms	remaining: 17.6s
3:	learn: 0.6400920	total: 55.1ms	remaining: 13.7s
4:	learn: 0.6282613	total: 57.1ms	remaining: 11.4s
5:	learn: 0.6168214	total: 59.2ms	remaining: 9.8s
6:	learn: 0.6063354	total: 61.2ms	remaining: 8.69s
7:	learn: 0.5967113	total: 63.2ms	remaining: 7.84s
8:	learn: 0.5876191	total: 65.2ms	remaining: 7.18s
9:	learn: 0.5787825	total: 67.2ms	remaining: 6.65s
10:	learn: 0.5708681	total: 69.2ms	remaining: 6.22s
11:	learn: 0.5629734	total: 71.2ms	remaining: 5.86s
12:	learn: 0.5555358	total: 73.4ms	remaining: 5.57s
13:	learn: 0.5484440	total: 75.3ms	remaining: 5.31s
14:	learn: 0.5417762	total: 77.4ms	remaining: 5.08s
15:	learn: 0.5354177	total: 79.4ms	remaining: 4.88s
16:	learn: 0.5299474	total: 81.4ms	remaining: 4.7s
17:	learn: 0.5244813	total: 83.5ms	remaining: 4.55s
18:	learn: 0.5189369	total: 85.5ms	remaining: 4.

## Get runs

In [35]:
# Select the run of the experiment
df_runs = clf.search_runs(exp_id)
print("Number of runs done : ", len(df_runs))
df_runs.head()

Number of runs done :  13


Unnamed: 0,artifact_uri,end_time,experiment_id,metrics.accuracy_score,metrics.f1_score,metrics.log_loss,metrics.precision_score,metrics.recall,metrics.roc_auc_score,run_id,run_name,start_time,status,tags.data_path,tags.experiment_id,tags.experiment_name,tags.features,tags.mlflow.log-model.history,tags.run_id,tags.version.mlflow
0,s3://models/12/19682834fdca404eb78b4519e9083c9...,2023-10-23 14:12:26.110000+00:00,12,0.806246,0.586989,6.692094,0.673611,0.520107,0.714687,19682834fdca404eb78b4519e9083c99,teleco_customer_churn3_12_cat_boost_classifier,2023-10-23 14:12:22.451000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""19682834fdca404eb78b4519e9083c99""...",19682834fdca404eb78b4519e9083c99,2.0.1
1,s3://models/12/ce114e8eb8ae4cc1af26333dbd61d81...,2023-10-23 14:12:22.420000+00:00,12,0.757984,0.626506,8.359067,0.52963,0.766756,0.760791,ce114e8eb8ae4cc1af26333dbd61d812,teleco_customer_churn3_12_gaussian_NB_classifier,2023-10-23 14:12:20.750000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""ce114e8eb8ae4cc1af26333dbd61d812""...",ce114e8eb8ae4cc1af26333dbd61d812,2.0.1
2,s3://models/12/8a74f2486fbd4c3c999434f1dd7548b...,2023-10-23 14:12:10.608000+00:00,12,0.756565,0.474732,8.40802,0.553571,0.41555,0.647447,8a74f2486fbd4c3c999434f1dd7548ba,teleco_customer_churn3_12_k_neighbors_classifier,2023-10-23 14:12:08.850000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""8a74f2486fbd4c3c999434f1dd7548ba""...",8a74f2486fbd4c3c999434f1dd7548ba,2.0.1
3,s3://models/12/ecf25f5a4f004c2fa28e12ab7291381...,2023-10-23 14:12:08.826000+00:00,12,0.810504,0.583463,6.545009,0.697761,0.50134,0.711578,ecf25f5a4f004c2fa28e12ab72913817,teleco_customer_churn3_12_ridge_classifier,2023-10-23 14:12:06.981000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""ecf25f5a4f004c2fa28e12ab72913817""...",ecf25f5a4f004c2fa28e12ab72913817,2.0.1
4,s3://models/12/0b5746cf1d4b46d5884b6e192168f73...,2023-10-23 14:12:05.927000+00:00,12,0.735273,0.0,9.143338,0.0,0.0,0.5,0b5746cf1d4b46d5884b6e192168f737,teleco_customer_churn3_12_svm_classifier,2023-10-23 14:12:03.496000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""0b5746cf1d4b46d5884b6e192168f737""...",0b5746cf1d4b46d5884b6e192168f737,2.0.1


In [36]:
df_runs.shape

(13, 20)

## Evaluating  Models

In [37]:
top_runs = df_runs.sort_values(['metrics.roc_auc_score'], ascending=False)
top_runs.head()

Unnamed: 0,artifact_uri,end_time,experiment_id,metrics.accuracy_score,metrics.f1_score,metrics.log_loss,metrics.precision_score,metrics.recall,metrics.roc_auc_score,run_id,run_name,start_time,status,tags.data_path,tags.experiment_id,tags.experiment_name,tags.features,tags.mlflow.log-model.history,tags.run_id,tags.version.mlflow
1,s3://models/12/ce114e8eb8ae4cc1af26333dbd61d81...,2023-10-23 14:12:22.420000+00:00,12,0.757984,0.626506,8.359067,0.52963,0.766756,0.760791,ce114e8eb8ae4cc1af26333dbd61d812,teleco_customer_churn3_12_gaussian_NB_classifier,2023-10-23 14:12:20.750000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""ce114e8eb8ae4cc1af26333dbd61d812""...",ce114e8eb8ae4cc1af26333dbd61d812,2.0.1
11,s3://models/12/5fc1a212366d46d2a803313c3a35d2a...,2023-10-23 14:11:45.705000+00:00,12,0.806955,0.62117,6.667597,0.646377,0.597855,0.740047,5fc1a212366d46d2a803313c3a35d2a3,teleco_customer_churn3_12_logistic_regression,2023-10-23 14:11:43.796000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""5fc1a212366d46d2a803313c3a35d2a3""...",5fc1a212366d46d2a803313c3a35d2a3,2.0.1
12,s3://models/12/29860a406c9d46eb950f52379a4f50b...,2023-10-23 14:10:17.599000+00:00,12,0.806955,0.62117,6.667597,0.646377,0.597855,0.740047,29860a406c9d46eb950f52379a4f50bd,teleco_customer_churn3_12_logistic_regression,2023-10-23 14:10:14.999000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""29860a406c9d46eb950f52379a4f50bd""...",29860a406c9d46eb950f52379a4f50bd,2.0.1
8,s3://models/12/31f02bc179d240be8159e1d15fe37c0...,2023-10-23 14:11:54.953000+00:00,12,0.808375,0.610951,6.618564,0.660436,0.568365,0.731576,31f02bc179d240be8159e1d15fe37c04,teleco_customer_churn3_12_ada_boost_classifier,2023-10-23 14:11:52.964000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""31f02bc179d240be8159e1d15fe37c04""...",31f02bc179d240be8159e1d15fe37c04,2.0.1
7,s3://models/12/b6d9c05894734fe59c524a9ed90358d...,2023-10-23 14:11:57.209000+00:00,12,0.811923,0.602699,6.49599,0.683673,0.538874,0.724553,b6d9c05894734fe59c524a9ed90358dd,teleco_customer_churn3_12_lgbm_classifier,2023-10-23 14:11:55.258000+00:00,FINISHED,-,12,teleco_customer_churn3,"['tenure', 'InternetService', 'OnlineSecurity'...","[{""run_id"": ""b6d9c05894734fe59c524a9ed90358dd""...",b6d9c05894734fe59c524a9ed90358dd,2.0.1


## Selecting Best Model

In [38]:
artifacts = top_runs.iloc[0]["artifact_uri"]
run_id = top_runs.iloc[0]["run_id"]
model_name = top_runs.iloc[0]["run_name"] 


print('Best model_artifacts :', artifacts)
print("=" * 100)
print('Best model run_id :', run_id)
print("=" * 100)
print('Best model :', model_name)
print("=" * 100)
print("Best model experiment id :", exp_id)

Best model_artifacts : s3://models/12/ce114e8eb8ae4cc1af26333dbd61d812/artifacts
Best model run_id : ce114e8eb8ae4cc1af26333dbd61d812
Best model : teleco_customer_churn3_12_gaussian_NB_classifier
Best model experiment id : 12


## Registering Best model

In [41]:
result = clf.register_model(
    run_id=run_id,
    model_name=model_name
)

2023/10/23 14:20:58 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation.                     Model name: teleco_customer_churn3_12_gaussian_NB_classifier, version 1


In [42]:
print('Registered model information :')
print('=='*50)
result

Registered model information :


name: "teleco_customer_churn3_12_gaussian_NB_classifier"
version: "1"
creation_timestamp: 1698070858517
last_updated_timestamp: 1698070858517
user_id: ""
current_stage: "None"
description: ""
source: "s3://models/12/ce114e8eb8ae4cc1af26333dbd61d812/artifacts/teleco_customer_churn3_12_gaussian_NB_classifier"
run_id: "ce114e8eb8ae4cc1af26333dbd61d812"
status: READY
run_link: ""

In [43]:
clf.change_stage(
    model_name=model_name,
    ver_list = [1],
    stage='Production'
)

## Fetching the Model

In [44]:
location = f"{artifacts}/{model_name}"

In [47]:
model = load_model(location)

## Predict

In [48]:
y_pred = model.predict(X_test)

In [49]:
# Prepare variable as DataFrame in pandas
df = pd.DataFrame(X_test)

# Add the target variable to df
df["y_pred"] = y_pred

In [50]:
df

Unnamed: 0,tenure,InternetService,OnlineSecurity,OnlineBackup,TechSupport,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,y_pred
185,1,0,0,0,0,0,1,2,24.80,2044,1
2715,41,2,1,1,1,0,1,0,25.25,6522,0
3825,52,2,1,1,1,2,0,3,19.35,67,0
1807,1,1,0,0,0,0,0,2,76.35,5822,1
132,67,0,0,0,2,2,0,0,50.55,2837,0
...,...,...,...,...,...,...,...,...,...,...,...
6366,64,0,0,2,2,2,1,3,68.30,3716,0
315,51,1,2,2,2,1,0,1,110.05,4697,0
2439,17,2,1,1,1,1,0,0,19.90,2856,0
5002,69,0,2,0,0,2,1,1,43.95,2556,0
