# Customer Churn Prediction 
Machine Learning Project - Decison Tree Classification 

Authors:
* José Marcos Leal Barbosa Filho
* Lucas Ismael Campos Medeiros

Institution: Universidade Federal do Rio Grande do Norte - Brazil.

## Dataset Description

It is a **customer churn** modeling dataset containing 10.000 rows (each representing an unique customer) with 14 columns: 13 general features and one target feature (**Exited**). The data is composed of both numerical and categorical features:

**Numeric Features:**

    RowNumber: The sequence number of the rows. 
    CustomerId: A unique ID of the customer.
    CreditScore: The credit score of the customer,
    Age: The age of the customer,
    Tenure: The number of months the client has been with the firm.
    Balance: Balance remaining in the customer account,
    NumOfProducts: The number of products purchased by the customer.
    EstimatedSalary: The estimated salary of the customer.

**Categorical Features:**

    Surname: The surname of the customer.
    Geography: The country of the customer.
    Gender: M/F
    HasCrCard: Whether the customer has a credit card or not.
    IsActiveMember: Whether the customer is active or not.

**The target column:** 

    Exited — Whether the customer churned or not.

The dataset can be seen and downloaded [here](https://www.kaggle.com/datasets/aakash50897/churn-modellingcsv?resource=download).

## Load Libraries

In [2]:
%%capture
!pip install wandb
#!pip install wandb==0.10.17
!pip install pytest pytest-sugar
!pip install pandas-profiling==3.1.0

In [3]:
import wandb
import logging
import tempfile
import os
import joblib
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.metrics import fbeta_score, precision_score, recall_score, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

import tensorflow as tf
from tensorflow import keras
from keras.callbacks import EarlyStopping
import h5py
import time
import datetime
import pytz
import IPython

-----------------------------------

## Login to wandb

In [102]:
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


## 1.Extract, Transform and Load (ETL)

### 1.1.Fetch Data

In [4]:
# columns used
columns = ['RowNumber', 'CustomerId', 'Surname', 'CreditScore',
           'Geography', 'Gender', 'Age', 'Tenure', 
           'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
           'EstimatedSalary', 'Exited']
# importing the dataset
churndf = pd.read_csv("https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/customer-churn-detection/Churn_Modelling.csv",
                      header=None,
                      names=columns)
churndf.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
1,1,15634602,Hargrave,619,France,Female,42,2,0,1,1,1,101348.88,1
2,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,4,15701354,Boni,699,France,Female,39,1,0,2,0,0,93826.63,0


* It was chosen to remove the following columns:

  * **RowNumber:** Indicates only the sequence number of the lines;
  * **CustomerId:** High cardinality column with 10,000 unique IDs;
  * **Surname:** Column with high cardinality, showing the last names of each consumer.

In [5]:
# removing unecessary columns and reseting indexes
churndf = churndf.drop([0,])
churndf.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
churndf.reset_index(drop=True,inplace=True)
churndf.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [6]:
# saving to csv
churndf.to_csv("raw_data.csv", index=False)

In [7]:
# Saving artifact to wandb
!wandb artifact put \
       --name churn_prediction_project_nn/raw_data.csv \
       --type raw_data \
       --description "Customer Churn NN" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "eec1509/churn_prediction_project_nn/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33meec1509[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.10.17
[34m[1mwandb[0m: Syncing run [33mstilted-oath-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/eec1509/churn_prediction_project_nn[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/eec1509/churn_prediction_project_nn/runs/1u5nsujn[0m
[34m[1mwandb[0m: Run data is saved locally in /content/wandb/run-20220712_153322-1u5nsujn
[34m[1mwandb[0m: Run `wandb offline` to turn off syncing.

Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("eec1509/churn_prediction_project_nn/raw_data.csv:latest")


----------------

### 1.2.Exploratory Data Analysis (EDA)

In [8]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="churn_prediction_project_nn", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33meec1509[0m (use `wandb login --relogin` to force relogin)
[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [9]:
# download the latest version of artifact raw_data.csv
artifact = run.use_artifact("churn_prediction_project_nn/raw_data.csv:latest")

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

#### 1.2.1 Pandas Profiling

In [10]:
ProfileReport(df, title= "Pandas Profiling Report", explorative=True)

Output hidden; open in https://colab.research.google.com to view.

In [11]:
run.finish()

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

--------------------------

### 1.3.Preprocessing

In [12]:
input_artifact="churn_prediction_project_nn/raw_data.csv:latest"
artifact_name="preprocessed_data.csv"
artifact_type="clean_data"
artifact_description="Data after preprocessing"

In [13]:
# create a new job_type
run = wandb.init(project="churn_prediction_project_nn", job_type="process_data")

[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [14]:
# download the latest version of artifact raw_data.csv
artifact=run.use_artifact(input_artifact)

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

In [15]:
# delete duplicated rows
df.drop_duplicates(inplace=True)

# generate a "clean data file"
df.to_csv(artifact_name, index=False)

In [16]:
# Create a new artifact and configure with the necessary arguments
artifact = wandb.Artifact(name=artifact_name,
                         type=artifact_type,
                         description=artifact_description)
artifact.add_file(artifact_name)

<ManifestEntry digest: 2VNPzyBON65Yp9cxPORlnA==>

In [17]:
run.log_artifact(artifact)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f57309a7390>

In [18]:
run.finish()

VBox(children=(Label(value=' 0.00MB of 0.45MB uploaded (0.00MB deduped)\r'), FloatProgress(value=0.00112617407…

----------------------------

## 2.Data Check

In [19]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="churn_prediction_project_nn", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("churn_prediction_project_nn/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    We test that we have enough data to continue
    """
    assert len(data) > 1000


def test_number_of_columns(data):
    """
    We test that we have enough data to continue
    """
    assert data.shape[1] == 11

def test_column_presence_and_type(data):

    required_columns = {
        #"CustomerId": pd.api.types.is_int64_dtype,
        #"Surname": pd.api.types.is_object_dtype,
        "CreditScore": pd.api.types.is_int64_dtype,
        "Geography": pd.api.types.is_object_dtype,
        "Gender": pd.api.types.is_object_dtype,
        "Age": pd.api.types.is_int64_dtype,
        "Tenure": pd.api.types.is_int64_dtype,
        "Balance": pd.api.types.is_float_dtype,
        "NumOfProducts": pd.api.types.is_int64_dtype,
        "HasCrCard": pd.api.types.is_int64_dtype,
        "IsActiveMember": pd.api.types.is_int64_dtype,
        "EstimatedSalary": pd.api.types.is_float_dtype,  
        "Exited": pd.api.types.is_int64_dtype
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(data[col_name]), f"Column {col_name} failed test {format_verification_funct}"


def test_class_names(data):

    # Check that only the known classes are present
    known_classes = [
        0,
        1
    ]

    assert data["Exited"].isin(known_classes).all()


def test_column_ranges(data):

    ranges = {
        "CreditScore": (0, 1000),
        "Age": (0,100),
        "Tenure": (0,10),
        "Balance": (0, 1.484705e+06),
        "NumOfProducts": (1,4),
        "HasCrCard": (0,1),
        "IsActiveMember": (0,1),
        "EstimatedSalary": (0, 1.484705e+06),
        "Exited": (0, 1)
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert data[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={data[col_name].min()} and max={data[col_name].max()}"
        )

Overwriting test_data.py


In [20]:
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.7.13, pytest 3.6.4, pytest-sugar 0.9.5)[0m
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1, sugar-0.9.5

 [36mtest_data.py[0m::test_data_length[0m [32m✓[0m                                 [32m20% [0m[40m[32m█[0m[40m[32m█        [0m
 [36mtest_data.py[0m::test_number_of_columns[0m [32m✓[0m                           [32m40% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█      [0m
 [36mtest_data.py[0m::test_column_presence_and_type[0m [32m✓[0m                    [32m60% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█    [0m
 [36mtest_data.py[0m::test_class_names[0m [32m✓[0m                                 [32m80% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█  [0m
 [36mtest_data.py[0m::test_column_ranges[0m [32m✓[0m                             

In [24]:
run.finish()

AttributeError: ignored

----------------------------------------

## 3.Data Segregation

In [50]:
# global variables

# ratio - 70% train / 30% test
test_size = 0.30

# seed used to reproduce purposes
seed = 42

# reference (column) to stratify the data
stratify = "Exited"

# name of the input artifact
artifact_input_name = "churn_prediction_project_nn/preprocessed_data.csv:latest"

# type of the artifact
artifact_type = "segregated_data"

In [51]:
# configure logging 
logging.basicConfig(level=logging.INFO,
                   format="%(asctime)s %(message)s",
                   datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging object
logger = logging.getLogger()

# init wandb project
run = wandb.init(project="churn_prediction_project_nn", job_type="split_data")

logger.info("Downloading and reading artifact")
artifact=run.use_artifact(artifact_input_name)
artifact_path=artifact.file()
df = pd.read_csv(artifact_path)



VBox(children=(Label(value=' 0.44MB of 0.44MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


12-07-2022 15:45:05 Downloading and reading artifact


In [52]:
# Split in train/test
train, test = train_test_split(df,test_size=test_size,
                                  random_state=seed,
                                  stratify=df[stratify])

In [53]:
# Data Encoding
# seperate data as idepedent variables and dependent variable
x_train = train.iloc[:,0:-1]
y_train = train.iloc[:,-1:]
x_test = test.iloc[:,0:-1]
y_test = test.iloc[:,-1:]

In [54]:
# Numeric data transform
sc = StandardScaler()
for name in x_train.select_dtypes(exclude = "object").columns.to_list():
    # fit using x_train
    sc.fit(x_train[name].values.reshape(-1,1))

    # transform train and validation
    x_train[name] = sc.transform(x_train[name].values.reshape(-1,1))
    x_test[name] = sc.transform(x_test[name].values.reshape(-1,1))

In [55]:
# Categorical data transform 
for name in x_train.select_dtypes("object").columns.to_list():
    onehot = OneHotEncoder(sparse=False,drop="first")
    # fit using x_train
    onehot.fit(x_train[name].values.reshape(-1,1))

    # transform train and validation
    x_train[onehot.get_feature_names_out()] = onehot.transform(x_train[name].values.reshape(-1,1))
    x_test[onehot.get_feature_names_out()] = onehot.transform(x_test[name].values.reshape(-1,1))


In [56]:
cols = x_train.select_dtypes("object").columns.to_list()

In [57]:
x_train.drop(labels=cols,axis=1,inplace=True)
x_test.drop(labels=cols,axis=1,inplace=True)

In [58]:
train_encoded = pd.concat([x_train, y_train], axis = 1)
test_encoded = pd.concat([x_test, y_test], axis = 1)

In [59]:
splits={}
splits["train"], splits["test"] = [train_encoded, test_encoded]

In [60]:
# Save artifacts
with tempfile.TemporaryDirectory() as tmp_dir:
    
    for split, df in splits.items():
        
        # Make the artifact name from the name of the split plus the provided root
        artifact_name = f"{split}.csv"
        
        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir,artifact_name)
        
        logger.info(f"Uploading the {split} dataset to {artifact_name}")
        
        # Save then upload to W&B
        df.to_csv(temp_path,index=False)
        
        artifact = wandb.Artifact(name=artifact_name,
                                 type=artifact_type,
                                 description=f"{split} split of dataset {artifact_input_name}")
        artifact.add_file(temp_path)
        
        logger.info("Logging artifact")
        run.log_artifact(artifact)
        
        #artifact.wait()

12-07-2022 15:45:06 Uploading the train dataset to train.csv
12-07-2022 15:45:06 Logging artifact
12-07-2022 15:45:06 Uploading the test dataset to test.csv
12-07-2022 15:45:06 Logging artifact


In [61]:
run.finish()

VBox(children=(Label(value=' 1.13MB of 1.62MB uploaded (0.00MB deduped)\r'), FloatProgress(value=0.70013473549…

------------------------

## 4.Trainning

### 4.1.Holdout Configuration

In [10]:
# global variables

# ratio used to split train and validation data
val_size = 0.30

# seed used to reproduce purposes
seed = 42

# reference (column) to stratify the data
stratify = "Exited"

# name of the input artifact
artifact_input_name = "churn_prediction_project_nn/train.csv:latest"

#entity
entity_name = "eec1509"

# project name
project_name = "churn_prediction_project_nn"

# type of the artifact
artifact_type = "Train"

In [11]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

# initiate the wandb project
run = wandb.init(project="churn_prediction_project_nn", job_type="train")

logger.info("Downloading and reading train artifact")
local_path = run.use_artifact(artifact_input_name).file()
df_train = pd.read_csv(local_path)

# Spliting train.csv into train and validation dataset
logger.info("Spliting data into train/val")
# split-out train/validation and test dataset
x_train, x_val, y_train, y_val = train_test_split(df_train.drop(labels=stratify,axis=1),
                                                  df_train[stratify],
                                                  test_size=val_size,
                                                  random_state=seed,
                                                  shuffle=True,
                                                  stratify=df_train[stratify])

VBox(children=(Label(value=' 0.00MB of 0.00MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


12-07-2022 17:47:49 Downloading and reading train artifact
12-07-2022 17:47:50 Spliting data into train/val


In [12]:
logger.info("x train: {}".format(x_train.shape))
logger.info("y train: {}".format(y_train.shape))
logger.info("x val: {}".format(x_val.shape))
logger.info("y val: {}".format(y_val.shape))

12-07-2022 17:47:50 x train: (4900, 11)
12-07-2022 17:47:50 y train: (4900,)
12-07-2022 17:47:50 x val: (2100, 11)
12-07-2022 17:47:50 y val: (2100,)


### 4.2.Data Preparation


#### 4.2.1.Outlier Removal

In [13]:
logger.info("Outlier Removal")
# temporary variable
x = x_train.select_dtypes("float64").copy()

# identify outlier in the dataset
lof = LocalOutlierFactor()
outlier = lof.fit_predict(x)
mask = outlier != -1

12-07-2022 17:47:51 Outlier Removal


In [14]:
logger.info("x_train shape [original]: {}".format(x_train.shape))
logger.info("x_train shape [outlier removal]: {}".format(x_train.loc[mask,:].shape))

12-07-2022 17:47:52 x_train shape [original]: (4900, 11)
12-07-2022 17:47:52 x_train shape [outlier removal]: (4898, 11)


In [15]:
logger.info("y_train shape [original]: {}".format(y_train.shape))
logger.info("y_train shape [outlier removal]: {}".format(y_train.loc[mask].shape))

12-07-2022 17:47:52 y_train shape [original]: (4900,)
12-07-2022 17:47:52 y_train shape [outlier removal]: (4898,)


In [16]:
# AVOID data leakage and you should not do this procedure in the preprocessing stage
# Note that we did not perform this procedure in the validation set
x_train = x_train.loc[mask,:].copy()
y_train = y_train[mask].copy()

#### 4.2.2.Target Variable Encoding

In this case, the target variable is already encoded, but let's create an encoder to transform the numeric variable into categorical.

In [17]:
logger.info("Encoding a Target Variable")

# define a categorical encoding for target variable
le = LabelEncoder()
le.fit(["Contiuned", "Exited"])
teste = le.inverse_transform(y_train)

12-07-2022 17:47:52 Encoding a Target Variable


### 4.3 Base Model

In [111]:
class MyCustomCallback(tf.keras.callbacks.Callback):

  def on_train_begin(self, batch, logs=None):
    self.begins = time.time()
    print('Training: begins at {}'.format(datetime.datetime.now(pytz.timezone('America/Fortaleza')).strftime("%a, %d %b %Y %H:%M:%S")))

  def on_train_end(self, logs=None):
    print('Training: ends at {}'.format(datetime.datetime.now(pytz.timezone('America/Fortaleza')).strftime("%a, %d %b %Y %H:%M:%S")))
    print('Duration: {:.2f} seconds'.format(time.time() - self.begins)) 

In [109]:
N, D = x_train.shape

In [142]:
# Instantiate a simple classification model
model = tf.keras.Sequential([
  tf.keras.layers.Dense(13, kernel_initializer = 'uniform', activation='relu', input_dim = D),
  tf.keras.layers.Dense(12, kernel_initializer = 'uniform', activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

# Instantiate a logistic loss function that expects integer targets.
loss = tf.keras.losses.BinaryCrossentropy()

# Instantiate an accuracy metric.
accuracy = tf.keras.metrics.BinaryAccuracy()

# Instantiate an optimizer.
#optimizer = tf.keras.optimizers.SGD(learning_rate=0.001)
optimizer = tf.keras.optimizers.Adam(learning_rate=0.02)

# configure the optimizer, loss, and metrics to monitor.
model.compile(optimizer=optimizer, loss='binary_crossentropy', metrics=[accuracy])

# training 
history = model.fit(x=x_train,
                    y=y_train,
                    batch_size=64,
                    epochs=5,
                    validation_data=(x_val,y_val),
                    callbacks=[MyCustomCallback()],
                    verbose=1)

Training: begins at Tue, 12 Jul 2022 16:05:21
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5
Training: ends at Tue, 12 Jul 2022 16:05:25
Duration: 3.13 seconds


In [143]:
loss, acc = model.evaluate(x=x_train,y=y_train, batch_size=32)
print('Train loss: %.4f - acc: %.4f' % (loss, acc))

loss_, acc_ = model.evaluate(x=x_val,y=y_val, batch_size=32)
print('Test loss: %.4f - acc: %.4f' % (loss_, acc_))

Train loss: 0.4015 - acc: 0.8328
Test loss: 0.3949 - acc: 0.8371


In [144]:
predict = model.predict(x_val)

In [145]:
predict

array([[0.06163195],
       [0.07241371],
       [0.33485508],
       ...,
       [0.15238258],
       [0.3548177 ],
       [0.08758289]], dtype=float32)

--------------------------

### 4.4.Hyperparameter Tuning

#### 4.4.1 Monitoring a neural network

In [107]:
from tensorflow import keras
from wandb.keras import WandbCallback
from keras.callbacks import EarlyStopping

In [None]:
# Default values for hyperparameters
defaults = dict(layer_1 = 13,
                layer_2 = 12,
                learn_rate = 0.02218,
                batch_size = 64,
                epoch = 10)

wandb.init(project=project_name, config= defaults, name="run_01")
config = wandb.config

In [103]:
# Instantiate a simple classification model
model = tf.keras.Sequential([
  tf.keras.layers.Dense(config.layer_1, activation=tf.nn.relu),
  tf.keras.layers.Dense(config.layer_2, activation=tf.nn.relu),
  tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
])

# Instantiate a logistic loss function that expects integer targets.
loss = tf.keras.losses.BinaryCrossentropy()

# Instantiate an accuracy metric.
accuracy = tf.keras.metrics.BinaryAccuracy()

# Instantiate an optimizer.
optimizer = tf.keras.optimizers.SGD(learning_rate=config.learn_rate)

# configure the optimizer, loss, and metrics to monitor.
model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy])

In [None]:
%%wandb
# Add WandbCallback() to the fit function
model.fit(x=x_train,
          y=y_train,
          batch_size=config.batch_size,
          epochs=config.epoch,
          validation_data=(x_val,y_val),
          callbacks=[WandbCallback(log_weights=True)],
          verbose=0)

### 4.4.2 Sweeps

In [68]:
# The sweep calls this function with each set of hyperparameters
def train():
    # Default values for hyper-parameters we're going to sweep over
    defaults = dict(layer_1 = 13,
                layer_2 = 12,
                learn_rate = 0.02,
                batch_size = 64,
                epoch = 600)
    
    # Initialize a new wandb run
    wandb.init(project=project_name, config= defaults)

    # Config is a variable that holds and saves hyperparameters and inputs
    config = wandb.config
    
    # Instantiate a simple classification model
    model = tf.keras.Sequential([
                                 tf.keras.layers.Dense(config.layer_1, activation=tf.nn.relu, dtype='float64'),
                                 tf.keras.layers.Dense(config.layer_2, activation=tf.nn.relu, dtype='float64'),
                                 tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)
                                 ])

    # Instantiate a logistic loss function that expects integer targets.
    loss = tf.keras.losses.BinaryCrossentropy()

    # Instantiate an accuracy metric.
    accuracy = tf.keras.metrics.BinaryAccuracy()

    # Instantiate an optimizer.
    optimizer = tf.keras.optimizers.SGD(learning_rate=config.learn_rate)

    # configure the optimizer, loss, and metrics to monitor.
    model.compile(optimizer=optimizer, loss=loss, metrics=[accuracy])  

    model.fit(x_train, y_train, batch_size=config.batch_size,
              epochs=config.epoch,
              validation_data=(x_val, y_val),
              callbacks=[WandbCallback(),
                          EarlyStopping(patience=100)]
              )   

In [69]:
# Configure the sweep – specify the parameters to search through, the search strategy, the optimization metric et all.
sweep_config = {
    'method': 'random', #grid, random
    'metric': {
      'name': 'binary_accuracy',
      'goal': 'maximize'   
    },
    'parameters': {
        'layer_1': {
            'max': 32,
            'min': 8,
            'distribution': 'int_uniform',
        },
        'layer_2': {
            'max': 32,
            'min': 8,
            'distribution': 'int_uniform',
        },
        'learn_rate': {
            'min': -4,
            'max': -2,
            'distribution': 'log_uniform',  
        },
        'epoch': {
            'values': [300,400,600]
        },
        'batch_size': {
            'values': [32,64]
        }
    }
}

In [70]:
# Initialize a new sweep
# Arguments:
#     – sweep_config: the sweep config dictionary defined above
#     – entity: Set the username for the sweep
#     – project: Set the project name for the sweep
sweep_id = wandb.sweep(sweep_config, entity=entity_name, project=project_name)



Create sweep with ID: 9flfl8wj
Sweep URL: https://wandb.ai/eec1509/churn_prediction_project_nn/sweeps/9flfl8wj


In [72]:
# Initialize a new sweep
# Arguments:
#     – sweep_id: the sweep_id to run - this was returned above by wandb.sweep()
#     – function: function that defines your model architecture and trains it
wandb.agent(sweep_id = sweep_id, function=train,count=20)

[34m[1mwandb[0m: Agent Starting Run: 1u43kk80 with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epoch: 300
[34m[1mwandb[0m: 	layer_1: 29
[34m[1mwandb[0m: 	layer_2: 18
[34m[1mwandb[0m: 	learn_rate: 0.08794759082337072
[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78

VBox(children=(Label(value=' 0.03MB of 0.03MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,126.0
loss,0.28073
binary_accuracy,0.88158
val_loss,0.41259
val_binary_accuracy,0.84762
_step,126.0
_runtime,65.0
_timestamp,1657650667.0
best_val_loss,0.33955
best_epoch,26.0


0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
loss,█▄▄▃▃▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁
binary_accuracy,▁▄▆▆▆▆▆▆▆▆▆▆▆▇▇▆▇▇▇▇▇▇▇▇▇▇██▇███████████
val_loss,▇▂▂▂▁▁▁▅▂▆▁▁▁▂▅█▁▄▂▂▂▂▅▃▂▂▂█▂▄▃▄▃▇▄▃▄▅▆▅
val_binary_accuracy,▃▇▇████▄▇▅███▇▄▁█▅███▇▇▇█▇▇▃▇▆▇▆▇▅▇▇▆▅▅▇
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
_runtime,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
_timestamp,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███


[34m[1mwandb[0m: Agent Starting Run: wvx0irnc with config:
[34m[1mwandb[0m: 	batch_size: 32
[34m[1mwandb[0m: 	epoch: 600
[34m[1mwandb[0m: 	layer_1: 19
[34m[1mwandb[0m: 	layer_2: 27
[34m[1mwandb[0m: 	learn_rate: 0.0494194796654794
[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Epoch 1/600
Epoch 2/600
Epoch 3/600
Epoch 4/600
Epoch 5/600
Epoch 6/600
Epoch 7/600
Epoch 8/600
Epoch 9/600
Epoch 10/600
Epoch 11/600
Epoch 12/600
Epoch 13/600
Epoch 14/600
Epoch 15/600
Epoch 16/600
Epoch 17/600
Epoch 18/600
Epoch 19/600
Epoch 20/600
Epoch 21/600
Epoch 22/600
Epoch 23/600
Epoch 24/600
Epoch 25/600
Epoch 26/600
Epoch 27/600
Epoch 28/600
Epoch 29/600
Epoch 30/600
Epoch 31/600
Epoch 32/600
Epoch 33/600
Epoch 34/600
Epoch 35/600
Epoch 36/600
Epoch 37/600
Epoch 38/600
Epoch 39/600
Epoch 40/600
Epoch 41/600
Epoch 42/600
Epoch 43/600
Epoch 44/600
Epoch 45/600
Epoch 46/600
Epoch 47/600
Epoch 48/600
Epoch 49/600
Epoch 50/600
Epoch 51/600
Epoch 52/600
Epoch 53/600
Epoch 54/600
Epoch 55/600
Epoch 56/600
Epoch 57/600
Epoch 58/600
Epoch 59/600
Epoch 60/600
Epoch 61/600
Epoch 62/600
Epoch 63/600
Epoch 64/600
Epoch 65/600
Epoch 66/600
Epoch 67/600
Epoch 68/600
Epoch 69/600
Epoch 70/600
Epoch 71/600
Epoch 72/600
Epoch 73/600
Epoch 74/600
Epoch 75/600
Epoch 76/600
Epoch 77/600
Epoch 78

VBox(children=(Label(value=' 0.03MB of 0.03MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,128.0
loss,0.30091
binary_accuracy,0.87423
val_loss,0.37461
val_binary_accuracy,0.8519
_step,128.0
_runtime,65.0
_timestamp,1657650737.0
best_val_loss,0.34188
best_epoch,28.0


0,1
epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
loss,█▆▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
binary_accuracy,▁▃▆▆▆▇▆▇▇▇▇▇▇▇▇▇▇▇▇▇██▇▇▇█▇▇█▇██▇███████
val_loss,▆▄▂▁▂▁▁▃▂▂▂▁▁▁▃▂▁▂▂▂▂▂▃▂█▂▂▂▂▂▃▂▃▂▂▂▂▂▂▂
val_binary_accuracy,▄▆██▇█▇▆▆▇████▅████▇██▆█▁██▇█▇█▇▆█████▇▇
_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
_runtime,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███
_timestamp,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▃▄▄▄▄▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇▇███


In [73]:
run.finish()

### 4.5 Export the best model

In [74]:
run = wandb.init(project=project_name,job_type="best_model")

[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [75]:
# restore the raw model file "XXXXX" from a specific run "YYYYYY"
best_model = wandb.restore('model-best.h5', run_path="eec1509/churn_prediction_project_nn/1u43kk80")

In [76]:
# restore the model for tf.keras
model = tf.keras.models.load_model(best_model.name)

In [79]:
# execute the loss and accuracy using the test dataset
loss_, acc_ = model.evaluate(x=x_val,y=y_val, batch_size=64)
print('Test loss: %.3f - acc: %.3f' % (loss_, acc_))

Test loss: 0.340 - acc: 0.862


In [80]:
# source: https://github.com/wandb/awesome-dl-projects/blob/master/ml-tutorial/EMNIST_Dense_Classification.ipynb
import seaborn as sns
from sklearn.metrics import confusion_matrix

predictions = np.greater_equal(model.predict(x_val),0.5).astype(int)
cm = confusion_matrix(y_true = y_val, y_pred = predictions)

plt.figure(figsize=(6,6));
sns.heatmap(cm, annot=True)
plt.savefig('confusion_matrix.png', bbox_inches='tight')
plt.show()

In [81]:
wandb.log({"image_confusion_matrix": [wandb.Image('confusion_matrix.png')]})

In [82]:
# types and names of the artifacts
artifact_type = "inference_artifact"
artifact_encoder = "target_encoder"
artifact_model = "model_export"

In [83]:
logger.info("Dumping the artifacts to disk")
# Save the model using joblib
joblib.dump(model, artifact_model)

# Save the target encoder using joblib
joblib.dump(le, artifact_encoder)

12-07-2022 18:35:51 Dumping the artifacts to disk


INFO:tensorflow:Assets written to: ram://9b44ed5d-7791-4543-8437-4293607d4a15/assets


12-07-2022 18:35:51 Assets written to: ram://9b44ed5d-7791-4543-8437-4293607d4a15/assets


['target_encoder']

In [84]:
# Model artifact
artifact = wandb.Artifact(artifact_model,
                          type=artifact_type,
                          description="Neural Network Model for Classification Purpose"
                          )

logger.info("Logging model artifact")
artifact.add_file(artifact_model)
run.log_artifact(artifact)

12-07-2022 18:35:53 Logging model artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f038ae67e90>

In [85]:
# Target encoder artifact
artifact = wandb.Artifact(artifact_encoder,
                          type=artifact_type,
                          description="The encoder used to encode the target variable"
                          )

logger.info("Logging target enconder artifact")
artifact.add_file(artifact_encoder)
run.log_artifact(artifact)

12-07-2022 18:35:56 Logging target enconder artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f0384aee0d0>

In [86]:
run.finish()

VBox(children=(Label(value=' 0.11MB of 0.11MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
_step,0
_runtime,166
_timestamp,1657650946


0,1
_step,▁
_runtime,▁
_timestamp,▁


## 5.Testing

In [87]:
# initiate the wandb project
run = wandb.init(project="churn_prediction_project_nn",job_type="test")

[34m[1mwandb[0m: wandb version 0.12.21 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


In [88]:
# global variables

# name of the artifact related to test dataset
artifact_test_name = "churn_prediction_project_nn/test.csv:latest"

# name of the model artifact
artifact_model_name = "churn_prediction_project_nn/model_export:latest"

# name of the target encoder artifact
artifact_encoder_name = "churn_prediction_project_nn/target_encoder:latest"

In [89]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

In [90]:
logger.info("Downloading and reading test artifact")
test_data_path = run.use_artifact(artifact_test_name).file()
df_test = pd.read_csv(test_data_path)

# Extract the target from the features
logger.info("Extracting target from dataframe")
x_test = df_test.copy()
y_test = x_test.pop("Exited")

12-07-2022 18:36:19 Downloading and reading test artifact
12-07-2022 18:36:20 Extracting target from dataframe


In [91]:
x_test.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,x0_Germany,x0_Spain,x0_Male
0,1.444462,-0.188991,0.342162,-1.223574,0.799493,0.643094,0.966559,0.348453,0.0,1.0,1.0
1,-1.342163,-0.378131,0.342162,0.32242,-0.912483,0.643094,-1.034598,-1.548861,0.0,0.0,1.0
2,0.636444,-0.188991,-1.386929,0.490091,0.799493,-1.554982,-1.034598,1.217124,0.0,0.0,0.0
3,0.81255,-0.094421,1.725435,-1.223574,0.799493,0.643094,-1.034598,1.571193,0.0,1.0,0.0
4,0.460338,1.229557,1.033799,0.428231,-0.912483,0.643094,0.966559,-1.340022,1.0,0.0,1.0


In [92]:
y_test.head()

0    0
1    0
2    0
3    0
4    0
Name: Exited, dtype: int64

In [93]:
# Extract the encoding of the target variable
logger.info("Extracting the encoding of the target variable")
encoder_export_path = run.use_artifact(artifact_encoder_name).file()
le = joblib.load(encoder_export_path)

12-07-2022 18:36:26 Extracting the encoding of the target variable


In [94]:
y_test

0       0
1       0
2       0
3       0
4       0
       ..
2995    0
2996    0
2997    0
2998    1
2999    0
Name: Exited, Length: 3000, dtype: int64

In [95]:
# Download inference artifact
logger.info("Downloading and load the exported model")
model_export_path = run.use_artifact(artifact_model_name).file()
model = joblib.load(model_export_path)

12-07-2022 18:36:34 Downloading and load the exported model


In [96]:
# predict
logger.info("Infering")
predict = model.predict(x_test)

# Evaluation Metrics
logger.info("Test Evaluation metrics")
fbeta = fbeta_score(y_test, predict, beta=1, zero_division=1)
precision = precision_score(y_test, predict, zero_division=1)
recall = recall_score(y_test, predict, zero_division=1)
acc = accuracy_score(y_test, predict)

logger.info("Test Accuracy: {}".format(acc))
logger.info("Test Precision: {}".format(precision))
logger.info("Test Recall: {}".format(recall))
logger.info("Test F1: {}".format(fbeta))

run.summary["Acc"] = acc
run.summary["Precision"] = precision
run.summary["Recall"] = recall
run.summary["F1"] = fbeta

12-07-2022 18:36:39 Infering
12-07-2022 18:36:39 Test Evaluation metrics


ValueError: ignored

In [97]:
predict

array([[0.01399809],
       [0.14686862],
       [0.02276677],
       ...,
       [0.51512253],
       [0.28225142],
       [0.00341931]], dtype=float32)

In [55]:
# Compare the accuracy, precision, recall with previous ones
print(classification_report(y_test,predict))

ValueError: ignored

In [56]:
fig_confusion_matrix, ax = plt.subplots(1,1,figsize=(7,4))
ConfusionMatrixDisplay(confusion_matrix(predict,y_test,labels=[1,0]),
                       display_labels=["1","0"]).plot(values_format=".0f",ax=ax)

ax.set_xlabel("True Label")
ax.set_ylabel("Predicted Label")
plt.show()

ValueError: ignored

In [None]:
run.finish()

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Acc,0.79467
F1,0.51496
Precision,0.49621
Recall,0.53519
