# Customer Churn Prediction 
Machine Learning Project - Decison Tree Classification 

Authors:
* José Marcos Leal Barbosa Filho
* Lucas Ismael Campos Medeiros

Institution: Universidade Federal do Rio Grande do Norte - Brazil.

## Dataset Description

It is a **customer churn** modeling dataset containing 10.000 rows (each representing an unique customer) with 14 columns: 13 general features and one target feature (**Exited**). The data is composed of both numerical and categorical features:

**Numeric Features:**

    RowNumber: The sequence number of the rows. 
    CustomerId: A unique ID of the customer.
    CreditScore: The credit score of the customer,
    Age: The age of the customer,
    Tenure: The number of months the client has been with the firm.
    Balance: Balance remaining in the customer account,
    NumOfProducts: The number of products purchased by the customer.
    EstimatedSalary: The estimated salary of the customer.

**Categorical Features:**

    Surname: The surname of the customer.
    Geography: The country of the customer.
    Gender: M/F
    HasCrCard: Whether the customer has a credit card or not.
    IsActiveMember: Whether the customer is active or not.

**The target column:** 

    Exited — Whether the customer churned or not.

The dataset can be seen and downloaded [here](https://www.kaggle.com/datasets/aakash50897/churn-modellingcsv?resource=download).

## Load Libraries

In [1]:
%%capture
!pip install wandb
#!pip install wandb==0.10.17
!pip install pytest pytest-sugar
!pip install pandas-profiling==3.1.0

In [2]:
import wandb
import logging
import tempfile
import os
import joblib
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.metrics import fbeta_score, precision_score, recall_score, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

import tensorflow as tf
from tensorflow import keras
from wandb.keras import WandbCallback
from keras.callbacks import EarlyStopping
import h5py
import time
import datetime
import pytz
import IPython

-----------------------------------

## Login to wandb

In [3]:
!wandb login --relogin

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 1 - Extract, Transform and Load (ETL)

### 1.1 Fetch Data

In [37]:
# columns used
columns = ['RowNumber', 'CustomerId', 'Surname', 'CreditScore',
           'Geography', 'Gender', 'Age', 'Tenure', 
           'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
           'EstimatedSalary', 'Exited']
# importing the dataset
churndf = pd.read_csv("https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/customer-churn-detection/Churn_Modelling.csv",
                      header=None,
                      names=columns)
churndf.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
1,1,15634602,Hargrave,619,France,Female,42,2,0,1,1,1,101348.88,1
2,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,4,15701354,Boni,699,France,Female,39,1,0,2,0,0,93826.63,0


* It was chosen to remove the following columns:

  * **RowNumber:** Indicates only the sequence number of the lines;
  * **CustomerId:** High cardinality column with 10,000 unique IDs;
  * **Surname:** Column with high cardinality, showing the last names of each consumer.

In [38]:
# removing unecessary columns and reseting indexes
churndf = churndf.drop([0,])
churndf.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
churndf.reset_index(drop=True,inplace=True)
churndf.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [39]:
# saving to csv
churndf.to_csv("raw_data.csv", index=False)

In [40]:
# Saving artifact to wandb
!wandb artifact put \
       --name churn_prediction_project_nn/raw_data.csv \
       --type raw_data \
       --description "Customer Churn NN" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "eec1509/churn_prediction_project_nn/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mmacleal[0m ([33meec1509[0m). Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.21
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220714_121421-9syst0ri[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mcelestial-sun-8[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/eec1509/churn_prediction_project_nn[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/eec1509/churn_prediction_project_nn/runs/9syst0ri[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("eec1509/churn_prediction_project_nn/raw_data.csv:v0")

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwa

----------------

### 1.2 Exploratory Data Analysis (EDA)

In [41]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="churn_prediction_project_nn", save_code=True)

In [42]:
# download the latest version of artifact raw_data.csv
artifact = run.use_artifact("churn_prediction_project_nn/raw_data.csv:latest")

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

#### 1.2.1 Pandas Profiling

In [43]:
ProfileReport(df, title= "Pandas Profiling Report", explorative=True)

Output hidden; open in https://colab.research.google.com to view.

In [44]:
run.finish()

VBox(children=(Label(value='0.507 MB of 0.507 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

--------------------------

### 1.3 Preprocessing

In [45]:
input_artifact="churn_prediction_project_nn/raw_data.csv:latest"
artifact_name="preprocessed_data.csv"
artifact_type="clean_data"
artifact_description="Data after preprocessing"

In [46]:
# create a new job_type
run = wandb.init(project="churn_prediction_project_nn", job_type="process_data")

In [47]:
# download the latest version of artifact raw_data.csv
artifact=run.use_artifact(input_artifact)

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

In [48]:
# delete duplicated rows
df.drop_duplicates(inplace=True)

# generate a "clean data file"
df.to_csv(artifact_name, index=False)

In [49]:
# Create a new artifact and configure with the necessary arguments
artifact = wandb.Artifact(name=artifact_name,
                         type=artifact_type,
                         description=artifact_description)
artifact.add_file(artifact_name)

<ManifestEntry digest: 2VNPzyBON65Yp9cxPORlnA==>

In [50]:
run.log_artifact(artifact)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f83f87086d0>

In [51]:
run.finish()

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

----------------------------

## 2 - Data Check

In [52]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="churn_prediction_project_nn", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("churn_prediction_project_nn/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    We test that we have enough data to continue
    """
    assert len(data) > 1000


def test_number_of_columns(data):
    """
    We test that we have enough data to continue
    """
    assert data.shape[1] == 11

def test_column_presence_and_type(data):

    required_columns = {
        #"CustomerId": pd.api.types.is_int64_dtype,
        #"Surname": pd.api.types.is_object_dtype,
        "CreditScore": pd.api.types.is_int64_dtype,
        "Geography": pd.api.types.is_object_dtype,
        "Gender": pd.api.types.is_object_dtype,
        "Age": pd.api.types.is_int64_dtype,
        "Tenure": pd.api.types.is_int64_dtype,
        "Balance": pd.api.types.is_float_dtype,
        "NumOfProducts": pd.api.types.is_int64_dtype,
        "HasCrCard": pd.api.types.is_int64_dtype,
        "IsActiveMember": pd.api.types.is_int64_dtype,
        "EstimatedSalary": pd.api.types.is_float_dtype,  
        "Exited": pd.api.types.is_int64_dtype
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(data[col_name]), f"Column {col_name} failed test {format_verification_funct}"


def test_class_names(data):

    # Check that only the known classes are present
    known_classes = [
        0,
        1
    ]

    assert data["Exited"].isin(known_classes).all()


def test_column_ranges(data):

    ranges = {
        "CreditScore": (0, 1000),
        "Age": (0,100),
        "Tenure": (0,10),
        "Balance": (0, 1.484705e+06),
        "NumOfProducts": (1,4),
        "HasCrCard": (0,1),
        "IsActiveMember": (0,1),
        "EstimatedSalary": (0, 1.484705e+06),
        "Exited": (0, 1)
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert data[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={data[col_name].min()} and max={data[col_name].max()}"
        )

Overwriting test_data.py


In [53]:
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.7.13, pytest 3.6.4, pytest-sugar 0.9.5)[0m
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1, sugar-0.9.5

 [36mtest_data.py[0m::test_data_length[0m [32m✓[0m                                 [32m20% [0m[40m[32m█[0m[40m[32m█        [0m
 [36mtest_data.py[0m::test_number_of_columns[0m [32m✓[0m                           [32m40% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█      [0m
 [36mtest_data.py[0m::test_column_presence_and_type[0m [32m✓[0m                    [32m60% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█    [0m
 [36mtest_data.py[0m::test_class_names[0m [32m✓[0m                                 [32m80% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█  [0m
 [36mtest_data.py[0m::test_column_ranges[0m [32m✓[0m                             

In [54]:
run.finish()

----------------------------------------

## 3 - Data Segregation

In [55]:
# global variables

# ratio - 70% train / 30% test
test_size = 0.30

# seed used to reproduce purposes
seed = 42

# reference (column) to stratify the data
stratify = "Exited"

# name of the input artifact
artifact_input_name = "churn_prediction_project_nn/preprocessed_data.csv:latest"

# type of the artifact
artifact_type = "segregated_data"

In [56]:
# configure logging 
logging.basicConfig(level=logging.INFO,
                   format="%(asctime)s %(message)s",
                   datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging object
logger = logging.getLogger()

# init wandb project
run = wandb.init(project="churn_prediction_project_nn", job_type="split_data")

logger.info("Downloading and reading artifact")
artifact=run.use_artifact(artifact_input_name)
artifact_path=artifact.file()
df = pd.read_csv(artifact_path)

# Split in train/test
logger.info("Splitting data into train and test")
splits = {}

splits["train"], splits["test"] = train_test_split(df,
                                                  test_size=test_size,
                                                  random_state=seed,
                                                  stratify=df[stratify])



14-07-2022 12:15:49 Downloading and reading artifact
14-07-2022 12:15:50 Splitting data into train and test


In [57]:
# Save artifacts
with tempfile.TemporaryDirectory() as tmp_dir:
    
    for split, df in splits.items():
        
        # Make the artifact name from the name of the split plus the provided root
        artifact_name = f"{split}.csv"
        
        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir,artifact_name)
        
        logger.info(f"Uploading the {split} dataset to {artifact_name}")
        
        # Save then upload to W&B
        df.to_csv(temp_path,index=False)
        
        artifact = wandb.Artifact(name=artifact_name,
                                 type=artifact_type,
                                 description=f"{split} split of dataset {artifact_input_name}")
        artifact.add_file(temp_path)
        
        logger.info("Logging artifact")
        run.log_artifact(artifact)
        
        artifact.wait()

14-07-2022 12:15:50 Uploading the train dataset to train.csv
14-07-2022 12:15:50 Logging artifact
14-07-2022 12:15:53 Uploading the test dataset to test.csv
14-07-2022 12:15:53 Logging artifact


In [58]:
run.finish()

VBox(children=(Label(value='0.448 MB of 0.448 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

------------------------

## 4 - Trainning

### 4.1 Holdout Configuration

In [59]:
# global variables

# ratio used to split train and validation data
val_size = 0.30

# seed used to reproduce purposes
seed = 42

# reference (column) to stratify the data
stratify = "Exited"

# name of the input artifact
artifact_input_name = "churn_prediction_project_nn/train.csv:latest"

#entity
entity_name = "eec1509"

# project name
project_name = "churn_prediction_project_nn"

# type of the artifact
artifact_type = "Train"

In [60]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

# initiate the wandb project
run = wandb.init(project="churn_prediction_project_nn", job_type="train")

logger.info("Downloading and reading train artifact")
local_path = run.use_artifact(artifact_input_name).file()
df_train = pd.read_csv(local_path)

# Spliting train.csv into train and validation dataset
logger.info("Spliting data into train/val")
# split-out train/validation and test dataset
x_train, x_val, y_train, y_val = train_test_split(df_train.drop(labels=stratify,axis=1),
                                                  df_train[stratify],
                                                  test_size=val_size,
                                                  random_state=seed,
                                                  shuffle=True,
                                                  stratify=df_train[stratify])

14-07-2022 12:16:27 Downloading and reading train artifact
14-07-2022 12:16:28 Spliting data into train/val


In [61]:
logger.info("x train: {}".format(x_train.shape))
logger.info("y train: {}".format(y_train.shape))
logger.info("x val: {}".format(x_val.shape))
logger.info("y val: {}".format(y_val.shape))

14-07-2022 12:16:28 x train: (4900, 10)
14-07-2022 12:16:28 y train: (4900,)
14-07-2022 12:16:28 x val: (2100, 10)
14-07-2022 12:16:28 y val: (2100,)


### 4.2 Data Preparation


#### 4.2.1 Outlier Removal

In [62]:
logger.info("Outlier Removal")
# temporary variable
x = x_train.select_dtypes("float64").copy()

# identify outlier in the dataset
lof = LocalOutlierFactor()
outlier = lof.fit_predict(x)
mask = outlier != -1

14-07-2022 12:16:38 Outlier Removal


In [63]:
logger.info("x_train shape [original]: {}".format(x_train.shape))
logger.info("x_train shape [outlier removal]: {}".format(x_train.loc[mask,:].shape))

14-07-2022 12:16:38 x_train shape [original]: (4900, 10)
14-07-2022 12:16:38 x_train shape [outlier removal]: (4882, 10)


In [64]:
logger.info("y_train shape [original]: {}".format(y_train.shape))
logger.info("y_train shape [outlier removal]: {}".format(y_train.loc[mask].shape))

14-07-2022 12:16:38 y_train shape [original]: (4900,)
14-07-2022 12:16:38 y_train shape [outlier removal]: (4882,)


In [65]:
# AVOID data leakage and you should not do this procedure in the preprocessing stage
# Note that we did not perform this procedure in the validation set
x_train = x_train.loc[mask,:].copy()
y_train = y_train[mask].copy()

#### 4.2.2 Target Variable Encoding

In this case, the target variable is already encoded, but let's create an encoder to transform the numeric variable into categorical.

In [66]:
logger.info("Encoding a Target Variable")

# define a categorical encoding for target variable
le = LabelEncoder()
le.fit(["Contiuned", "Exited"])
teste = le.inverse_transform(y_train)

14-07-2022 12:16:38 Encoding a Target Variable


### 4.3 Creating a Data Transform Pipeline

#### 4.3.1 Transformers

In [67]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    # Class Constructor
    def __init__(self, feature_names):
        self.feature_names = feature_names
        
    # Return self nothing to do here
    def fit(self, X, y=None):
        return self
    
    # Method that describes what tis custom transformer need to do
    def transform(self, X, y=None):
        return X[self.feature_names]
    
# Handling categorical features
class CategoricalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes one boolean as argument
    def __init__(self, new_features=True, colnames=None):
        self.new_features = new_features
        self.colnames = colnames
        
    # Return self nothing else to do here
    def fit(self, X, y=None):
        return self
    
    def get_feature_names_out(self):
        return self.colnames.tolist()
    
    # Transformer method we wrote for this transformer
    def transform(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # Remove white space in categorical features
        df = df.apply(lambda row: row.str.strip())
        
        # customize feature?
        # How can I identify what needs to be modified? EDA!!!!!!
        if self.new_features:
            
            # replace ? with unknown
            edit_cols = ['Geography', 'Gender']
            for col in edit_cols:
                df.loc[df[col].str.contains("\?"), col] = 'unknown'
        
        # update column names
        self.colnames = df.columns
        df = pd.DataFrame(X, columns=self.colnames)
        
        return df

# transform numerical features
class NumericalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes a model parameter as its argument
    # model 0: minmax
    # model 1: standard
    # model 2: without scaler
    def __init__(self, model=0, colnames=None):
        self.model = model
        self.colnames = colnames
        self.scaler = None

    # Fit is used only to learn statistical about Scalers
    def fit(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # minmax
        if self.model == 0:
            self.scaler = MinMaxScaler()
            self.scaler.fit(df)
        # standard scaler
        elif self.model == 1:
            self.scaler = StandardScaler()
            self.scaler.fit(df)
        return self

    # return columns names after transformation
    def get_feature_names_out(self):
        return self.colnames

    # Transformer method we wrote for this transformer
    # Use fitted scalers
    def transform(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # update columns name
        self.colnames = df.columns.tolist()

        # minmax
        if self.model == 0:
            # transform data
            df = self.scaler.transform(df)
        elif self.model == 1:
            # transform data
            df = self.scaler.transform(df)
        else:
            df = df.values

        return df

#### 4.3.2 Holdout Pipeline

In [68]:
# model = 0 (min-max), 1 (z-score), 2 (without normalization)
numerical_model = 1

# Categrical features to pass down the categorical pipeline
categorical_features = x_train.select_dtypes("object").columns.to_list()

# Numerical features to pass down the numerical pipeline
numerical_features = x_train.select_dtypes(["int64","float"]).columns.to_list()

# Defining the steps for the categorical pipeline
categorical_pipeline = Pipeline(steps=[('cat_selector', FeatureSelector(categorical_features)),
                                       ('imputer_cat', SimpleImputer(strategy="most_frequent")),
                                       ('cat_transformer', CategoricalTransformer(colnames=categorical_features)),
                                       ('cat_encoder', OneHotEncoder(sparse=False, drop="first"))])

# Defining the steps in the numerical pipeline
numerical_pipeline = Pipeline(steps=[('num_selector', FeatureSelector(numerical_features)),
                                     ('imputer_num', SimpleImputer(strategy="median")),
                                     ('num_transformer', NumericalTransformer(numerical_model, 
                                                                              colnames=numerical_features))])

# Combine numerical and categorical pieplines into one full big pipeline horizontally
pipe = FeatureUnion(transformer_list=[('cat_pipeline', categorical_pipeline),
                                                             ('num_pipeline', numerical_pipeline)])

#### 4.3.3 Transforming

In [69]:
x_train_orig = x_train
x_val_orig = x_val

In [70]:
# Transforming
logger.info("Transforming")
x_train = pipe.fit_transform(x_train_orig, y_train)
x_val = pipe.transform(x_val_orig)

14-07-2022 12:17:18 Transforming


### 4.4 Base Model

In [71]:
class MyCustomCallback(tf.keras.callbacks.Callback):

  def on_train_begin(self, batch, logs=None):
    self.begins = time.time()
    print('Training: begins at {}'.format(datetime.datetime.now(pytz.timezone('America/Fortaleza')).strftime("%a, %d %b %Y %H:%M:%S")))

  def on_train_end(self, logs=None):
    print('Training: ends at {}'.format(datetime.datetime.now(pytz.timezone('America/Fortaleza')).strftime("%a, %d %b %Y %H:%M:%S")))
    print('Duration: {:.2f} seconds'.format(time.time() - self.begins)) 

In [72]:
N, D = x_train.shape

In [73]:
# Instantiate a simple classification model
model = tf.keras.Sequential([
  tf.keras.layers.Dense(13, kernel_initializer = 'uniform', activation='relu', input_dim = D),
  tf.keras.layers.Dense(12, kernel_initializer = 'uniform', activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

# Instantiate a logistic loss function that expects integer targets.
loss = tf.keras.losses.BinaryCrossentropy()

# Instantiate an accuracy metric.
accuracy = tf.keras.metrics.BinaryAccuracy()

# Instantiate an optimizer.
#optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.02)

# configure the optimizer, loss, and metrics to monitor.
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=[accuracy])

# training 
history = model.fit(x=x_train,
                    y=y_train,
                    batch_size=64,
                    epochs=5,
                    validation_data=(x_val,y_val),
                    callbacks=[MyCustomCallback()],
                    verbose=1)

Training: begins at Thu, 14 Jul 2022 09:18:01
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Training: ends at Thu, 14 Jul 2022 09:18:05
Duration: 4.02 seconds


In [74]:
loss, acc = model.evaluate(x=x_train,y=y_train, batch_size=32)
print('Train loss: %.4f - acc: %.4f' % (loss, acc))

loss_, acc_ = model.evaluate(x=x_val,y=y_val, batch_size=32)
print('Test loss: %.4f - acc: %.4f' % (loss_, acc_))

Train loss: 0.4152 - acc: 0.8327
Test loss: 0.4109 - acc: 0.8314


In [75]:
predict = model.predict(x_val)

In [76]:
predict

array([[0.05999011],
       [0.07585558],
       [0.33386686],
       ...,
       [0.10198307],
       [0.21416071],
       [0.07554609]], dtype=float32)

--------------------------

### 4.5 Hyperparameter Tuning

#### 4.5.1 Monitoring a neural network

In [77]:
# Default values for hyperparameters
defaults = dict(layer_1 = 13,
                layer_2 = 12,
                learn_rate = 0.02218,
                batch_size = 64,
                epoch = 10)

wandb.init(project=project_name, config= defaults, name="run_01")
config = wandb.config

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

In [78]:
# Instantiate a simple classification model
model = tf.keras.Sequential([
  tf.keras.layers.Dense(config.layer_1, activation=tf.nn.relu),
  tf.keras.layers.Dense(config.layer_2, activation=tf.nn.relu),
  tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

# Instantiate a logistic loss function that expects integer targets.
loss = tf.keras.losses.BinaryCrossentropy()

# Instantiate an accuracy metric.
accuracy = tf.keras.metrics.BinaryAccuracy()

# Instantiate an optimizer.
optimizer = tf.keras.optimizers.SGD(learning_rate=config.learn_rate)

# configure the optimizer, loss, and metrics to monitor.
model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy])

In [79]:
%%wandb
# Add WandbCallback() to the fit function
model.fit(x=x_train,
          y=y_train,
          batch_size=config.batch_size,
          epochs=config.epoch,
          validation_data=(x_val,y_val),
          callbacks=[WandbCallback(log_weights=True)],
          verbose=0)



<keras.callbacks.History at 0x7f8402674dd0>

### 4.5.2 Sweeps

In [80]:
# The sweep calls this function with each set of hyperparameters
def train():
    # Default values for hyper-parameters we're going to sweep over
    defaults = dict(layer_1 = 13,
                layer_2 = 12,
                learn_rate = 0.02,
                batch_size = 64,
                epoch = 600)
    
    # Initialize a new wandb run
    wandb.init(project=project_name, config= defaults)

    # Config is a variable that holds and saves hyperparameters and inputs
    config = wandb.config
    
    # Instantiate a simple classification model
    model = tf.keras.Sequential([
                                 tf.keras.layers.Dense(config.layer_1, activation=tf.nn.relu, dtype='float64'),
                                 tf.keras.layers.Dense(config.layer_2, activation=tf.nn.relu, dtype='float64'),
                                 tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
                                 ])

    # Instantiate a logistic loss function that expects integer targets.
    loss = tf.keras.losses.BinaryCrossentropy()

    # Instantiate an accuracy metric.
    accuracy = tf.keras.metrics.BinaryAccuracy()

    # Instantiate an optimizer.
    optimizer = tf.keras.optimizers.SGD(learning_rate=config.learn_rate)

    # configure the optimizer, loss, and metrics to monitor.
    model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy])  

    model.fit(x_train, y_train, batch_size=config.batch_size,
              epochs=config.epoch,
              validation_data=(x_val, y_val),
              callbacks=[WandbCallback(),
                          EarlyStopping(patience=100)]
              )   

In [81]:
# Configure the sweep – specify the parameters to search through, the search strategy, the optimization metric et all.
sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'binary_accuracy',
      'goal': 'maximize'   
    },
    'parameters': {
        'layer_1': {
            'max': 32,
            'min': 8,
            'distribution': 'int_uniform',
        },
        'layer_2': {
            'max': 32,
            'min': 8,
            'distribution': 'int_uniform',
        },
        'learn_rate': {
            'min': -4,
            'max': -2,
            'distribution': 'log_uniform',  
        },
        'epoch': {
            'values': [300,400,600]
        },
        'batch_size': {
            'values': [32,64]
        }
    }
}

In [82]:
# Initialize a new sweep
# Arguments:
#     – sweep_config: the sweep config dictionary defined above
#     – entity: Set the username for the sweep
#     – project: Set the project name for the sweep
sweep_id = wandb.sweep(sweep_config, entity=entity_name, project=project_name)



Create sweep with ID: 9mnf75ns
Sweep URL: https://wandb.ai/eec1509/churn_prediction_project_nn/sweeps/9mnf75ns


In [83]:
# Initialize a new sweep
# Arguments:
#     – sweep_id: the sweep_id to run - this was returned above by wandb.sweep()
#     – function: function that defines your model architecture and trains it
wandb.agent(sweep_id = sweep_id, function=train,count=2)

[34m[1mwandb[0m: Agent Starting Run: q3dvclh5 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epoch: 300
[34m[1mwandb[0m: 	layer_1: 14
[34m[1mwandb[0m: 	layer_2: 31
[34m[1mwandb[0m: 	learn_rate: 0.12156142714535824




Epoch 1/300


14-07-2022 12:24:55 Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0014s vs `on_train_batch_end` time: 0.0022s). Check your callbacks.


Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78/300
Epoch 7

VBox(children=(Label(value='0.024 MB of 0.024 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
binary_accuracy,▁▅▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇▇▇████▇████████████
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
loss,█▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
val_binary_accuracy,▁▆▇▇█▇▇▆▇███▇██▇▇▇█▇▇▇▇▇▇▇▇█▇▆▆▇▆▇▆▇██▅▇
val_loss,█▃▂▁▁▁▂▂▂▁▁▁▁▁▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▃▂▂▃▂▃▃▃▃▅▃

0,1
best_epoch,20.0
best_val_loss,0.34055
binary_accuracy,0.8687
epoch,120.0
loss,0.30595
val_binary_accuracy,0.85381
val_loss,0.37119


[34m[1mwandb[0m: Sweep Agent: Waiting for job.
[34m[1mwandb[0m: Job received.
[34m[1mwandb[0m: Agent Starting Run: 1zcuptg0 with config:
[34m[1mwandb[0m: 	batch_size: 64
[34m[1mwandb[0m: 	epoch: 300
[34m[1mwandb[0m: 	layer_1: 30
[34m[1mwandb[0m: 	layer_2: 9
[34m[1mwandb[0m: 	learn_rate: 0.02177180316952747




Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78

VBox(children=(Label(value='0.025 MB of 0.025 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
binary_accuracy,▁▃▅▆▆▆▆▆▆▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇█▇▇██████████
epoch,▁▁▁▂▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
loss,█▆▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁
val_binary_accuracy,▁▄▇▇██▇█████▇███▇█████████████████▇▇▇██▇
val_loss,█▅▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▁

0,1
best_epoch,178.0
best_val_loss,0.34109
binary_accuracy,0.87423
epoch,278.0
loss,0.30481
val_binary_accuracy,0.85095
val_loss,0.35291


In [84]:
run.finish()

### 4.6 Export the best model

In [119]:
run = wandb.init(project=project_name,job_type="best_model")



#### 4.6.1 Import the best wandb sweep

In [86]:
# restore the raw model file (Insert the wandb best sweep path)
best_model_path = "eec1509/churn_prediction_project_nn/q3dvclh5"
best_model = wandb.restore('model-best.h5', run_path=best_model_path)

# restore the model for tf.keras
model = tf.keras.models.load_model(best_model.name)

In [87]:
# execute the loss and accuracy using the test dataset
loss_, acc_ = model.evaluate(x=x_val,y=y_val, batch_size=64)
print('Test loss: %.3f - acc: %.3f' % (loss_, acc_))

Test loss: 0.341 - acc: 0.861


In [88]:
# source: https://github.com/wandb/awesome-dl-projects/blob/master/ml-tutorial/EMNIST_Dense_Classification.ipynb
import seaborn as sns
from sklearn.metrics import confusion_matrix

predictions = np.greater_equal(model.predict(x_val),0.5).astype(int)
cm = confusion_matrix(y_true = y_val, y_pred = predictions)

plt.figure(figsize=(6,6));
sns.heatmap(cm, annot=True)
plt.savefig('confusion_matrix.png', bbox_inches='tight')
plt.show()

In [89]:
wandb.log({"image_confusion_matrix": [wandb.Image('confusion_matrix.png')]})

#### 4.6.2 Export Encoders and Best Model

In [92]:
# types and names of the artifacts
artifact_type = "inference_artifact"
artifact_transform = "data_transform"
artifact_encoder = "target_encoder"
artifact_model = "model_export"

In [93]:
logger.info("Dumping the artifacts to disk")

14-07-2022 12:40:40 Dumping the artifacts to disk


In [120]:
# Export Model artifact
artifact = wandb.Artifact(artifact_model,
                          type=artifact_type,
                          description="Neural Network Model for Classification Purpose"
                          )

logger.info("Logging model artifact")
model.save("path")
artifact.add_dir("path")
run.log_artifact(artifact)

14-07-2022 12:59:56 Logging model artifact


INFO:tensorflow:Assets written to: path/assets


14-07-2022 12:59:56 Assets written to: path/assets
[34m[1mwandb[0m: Adding directory to artifact (./path)... Done. 0.1s


<wandb.sdk.wandb_artifacts.Artifact at 0x7f840485afd0>

In [95]:
# Export the pipe data transform using joblib
joblib.dump(pipe, artifact_transform)

# Pipe Data Transform
artifact = wandb.Artifact(artifact_transform,
                          type=artifact_type,
                          description="Pipeline for Data Transform"
                          )

logger.info("Logging Pipeline for Data Transform")
artifact.add_file(artifact_transform)
run.log_artifact(artifact)

14-07-2022 12:41:11 Logging Pipeline for Data Transform


<wandb.sdk.wandb_artifacts.Artifact at 0x7f84056dec90>

In [96]:
# Export the target encoder using joblib
joblib.dump(le, artifact_encoder)

# Target encoder artifact
artifact = wandb.Artifact(artifact_encoder,
                          type=artifact_type,
                          description="The encoder used to encode the target variable"
                          )

logger.info("Logging target enconder artifact")
artifact.add_file(artifact_encoder)
run.log_artifact(artifact)

14-07-2022 12:41:42 Logging target enconder artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f84056ef190>

In [121]:
run.finish()

VBox(children=(Label(value='0.085 MB of 0.085 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

## 5 - Testing

In [122]:
# initiate the wandb project
run = wandb.init(project="churn_prediction_project_nn",job_type="test")



In [115]:
# global variables
# path of the artifact related to test dataset
artifact_test_path = "churn_prediction_project_nn/test.csv:latest"
# path of the pipeline for data transform
artifact_transform_path = "churn_prediction_project_nn/data_transform:latest"
# path of the model artifact
artifact_model_path = "model_export:latest"
# path of the target encoder artifact
artifact_encoder_path = "churn_prediction_project_nn/target_encoder:latest"

In [101]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

### 5.1 Download and Transform Test Data

In [134]:
logger.info("Downloading and reading test artifact")
test_data_path = run.use_artifact(artifact_test_path).file()
df_test = pd.read_csv(test_data_path)

# Extract the target from the features
logger.info("Extracting target from dataframe")
x_test_orig = df_test.copy()
y_test = x_test_orig.pop("Exited")

14-07-2022 13:22:09 Downloading and reading test artifact
14-07-2022 13:22:10 Extracting target from dataframe


In [105]:
x_test_orig.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,790,Spain,Male,37,6,0.0,2,1,1,119484.01
1,521,France,Male,35,6,96423.84,1,1,0,10488.44
2,712,France,Female,37,1,106881.5,2,0,0,169386.81
3,729,Spain,Female,38,10,0.0,2,1,0,189727.12
4,695,Germany,Male,52,8,103023.26,1,1,1,22485.64


In [109]:
# Class for the data transform pipeline
class FeatureSelector(BaseEstimator, TransformerMixin):
    # Class Constructor
    def __init__(self, feature_names):
        self.feature_names = feature_names
        
    # Return self nothing to do here
    def fit(self, X, y=None):
        return self
    
    # Method that describes what tis custom transformer need to do
    def transform(self, X, y=None):
        return X[self.feature_names]
    
# Handling categorical features
class CategoricalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes one boolean as argument
    def __init__(self, new_features=True, colnames=None):
        self.new_features = new_features
        self.colnames = colnames
        
    # Return self nothing else to do here
    def fit(self, X, y=None):
        return self
    
    def get_feature_names_out(self):
        return self.colnames.tolist()
    
    # Transformer method we wrote for this transformer
    def transform(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # Remove white space in categorical features
        df = df.apply(lambda row: row.str.strip())
        
        # customize feature?
        # How can I identify what needs to be modified? EDA!!!!!!
        if self.new_features:
            
            # replace ? with unknown
            edit_cols = ['Geography', 'Gender']
            for col in edit_cols:
                df.loc[df[col].str.contains("\?"), col] = 'unknown'
        
        # update column names
        self.colnames = df.columns
        df = pd.DataFrame(X, columns=self.colnames)
        
        return df

# transform numerical features
class NumericalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes a model parameter as its argument
    # model 0: minmax
    # model 1: standard
    # model 2: without scaler
    def __init__(self, model=0, colnames=None):
        self.model = model
        self.colnames = colnames
        self.scaler = None

    # Fit is used only to learn statistical about Scalers
    def fit(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # minmax
        if self.model == 0:
            self.scaler = MinMaxScaler()
            self.scaler.fit(df)
        # standard scaler
        elif self.model == 1:
            self.scaler = StandardScaler()
            self.scaler.fit(df)
        return self

    # return columns names after transformation
    def get_feature_names_out(self):
        return self.colnames

    # Transformer method we wrote for this transformer
    # Use fitted scalers
    def transform(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # update columns name
        self.colnames = df.columns.tolist()

        # minmax
        if self.model == 0:
            # transform data
            df = self.scaler.transform(df)
        elif self.model == 1:
            # transform data
            df = self.scaler.transform(df)
        else:
            df = df.values

        return df

In [106]:
# Download the pipeline for dada transform 
logger.info("Downloading the dada transform pipeline")
data_transform_path = run.use_artifact(artifact_transform_path).file()
pipe = joblib.load(data_transform_path)

# Transform test data
x_test = pipe.transform(x_test_orig)

14-07-2022 12:46:54 Downloading the dada transform pipeline


In [107]:
x_test.shape

(3000, 11)

In [108]:
x_test

array([[ 0.        ,  1.        ,  1.        , ...,  0.64362459,
         0.95944516,  0.34273475],
       [ 0.        ,  0.        ,  1.        , ...,  0.64362459,
        -1.04226906, -1.54920337],
       [ 0.        ,  0.        ,  0.        , ..., -1.55370075,
        -1.04226906,  1.20894435],
       ...,
       [ 1.        ,  0.        ,  1.        , ..., -1.55370075,
        -1.04226906,  0.46514135],
       [ 1.        ,  0.        ,  0.        , ...,  0.64362459,
        -1.04226906, -1.40281106],
       [ 0.        ,  0.        ,  1.        , ...,  0.64362459,
        -1.04226906,  0.99520047]])

In [111]:
y_test_orig

0       0
1       0
2       0
3       0
4       0
       ..
2995    0
2996    0
2997    0
2998    1
2999    0
Name: Exited, Length: 3000, dtype: int64

### 5.2 Encoding Target Variable

In [135]:
# Download the target variable encoder
logger.info("Extracting the encoding of the target variable")
encoder_path = run.use_artifact(artifact_encoder_path).file()
le = joblib.load(encoder_path)
# Target encoding
y_test_encoded = le.inverse_transform(y_test)

14-07-2022 13:22:35 Extracting the encoding of the target variable


In [133]:
y_test_encoded

array(['Contiuned', 'Contiuned', 'Contiuned', ..., 'Contiuned', 'Exited',
       'Contiuned'], dtype='<U9')

### 5.3 Download the Best Model and Test

In [123]:
# Download inference artifact
logger.info("Downloading and load the exported model")
# use the latest version of the model
model_at = run.use_artifact(artifact_model_path)
# download the directory in which the model is saved
model_dir= model_at.download()
print("model: ", model_dir)
model = keras.models.load_model(model_dir)

14-07-2022 13:00:56 Downloading and load the exported model


model:  ./artifacts/model_export:v1


In [125]:
# predict
logger.info("Infering")
predict = model.predict(x_test)

14-07-2022 13:01:25 Infering


In [138]:
predict

array([[0.0202989 ],
       [0.06348157],
       [0.14023054],
       ...,
       [0.33095452],
       [0.11221516],
       [0.01008829]], dtype=float32)

In [142]:
predict_rounded = np.rint(predict)

In [140]:
predict_rounded

array([[0.],
       [0.],
       [0.],
       ...,
       [0.],
       [0.],
       [0.]], dtype=float32)

In [136]:
# execute the loss and accuracy using the test dataset
loss_, acc_ = model.evaluate(x=x_test,y=y_test, batch_size=64)
print('Test loss: %.3f - acc: %.3f' % (loss_, acc_))

Test loss: 0.327 - acc: 0.867


In [143]:
# Evaluation Metrics
logger.info("Test Evaluation metrics")
fbeta = fbeta_score(y_test, predict_rounded, beta=1, zero_division=1)
precision = precision_score(y_test, predict_rounded, zero_division=1)
recall = recall_score(y_test, predict_rounded, zero_division=1)
acc = accuracy_score(y_test, predict_rounded)

logger.info("Test Accuracy: {}".format(acc))
logger.info("Test Precision: {}".format(precision))
logger.info("Test Recall: {}".format(recall))
logger.info("Test F1: {}".format(fbeta))

run.summary["Acc"] = acc
run.summary["Precision"] = precision
run.summary["Recall"] = recall
run.summary["F1"] = fbeta

14-07-2022 13:27:02 Test Evaluation metrics
14-07-2022 13:27:02 Test Accuracy: 0.8666666666666667
14-07-2022 13:27:02 Test Precision: 0.790633608815427
14-07-2022 13:27:02 Test Recall: 0.469721767594108
14-07-2022 13:27:02 Test F1: 0.5893223819301847


In [145]:
# Compare the accuracy, precision, recall with previous ones
print(classification_report(y_test,predict_rounded))

              precision    recall  f1-score   support

           0       0.88      0.97      0.92      2389
           1       0.79      0.47      0.59       611

    accuracy                           0.87      3000
   macro avg       0.83      0.72      0.75      3000
weighted avg       0.86      0.87      0.85      3000



In [146]:
fig_confusion_matrix, ax = plt.subplots(1,1,figsize=(7,4))
ConfusionMatrixDisplay(confusion_matrix(predict_rounded,y_test,labels=[1,0]),
                       display_labels=["1","0"]).plot(values_format=".0f",ax=ax)

ax.set_xlabel("True Label")
ax.set_ylabel("Predicted Label")
plt.show()

In [147]:
run.finish()

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Acc,0.86667
F1,0.58932
Precision,0.79063
Recall,0.46972
