# Churn Prediction - Classification
Authors:
* Lucas Ismael Campos Medeiros
* José Marcos Leal Barbosa Filho

## Dataset Description
It is a **customer churn** modeling dataset containing 10.000 rows (each representing an unique customer) with 15 columns: 14 features with one target feature (**Exited**).

The data is composed of both numerical and categorical features.

## Load Libraries

In [4]:
!pip install wandb
!pip install pytest pytest-sugar
!pip install pandas-profiling==3.1.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [5]:
import wandb
import logging
import tempfile
import os
import joblib
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from pandas_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.tree import DecisionTreeClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import fbeta_score, precision_score, recall_score, accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

-----------------------------------

## 1.Extract, Transform and Load (ETL)

### 1.1.Fetch Data

In [3]:
# columns used
columns = ['RowNumber', 'CustomerId', 'Surname', 'CreditScore',
           'Geography', 'Gender', 'Age', 'Tenure', 
           'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember',
           'EstimatedSalary', 'Exited']
# importing the dataset
churndf = pd.read_csv("https://github.com/x4nth055/pythoncode-tutorials/raw/master/machine-learning/customer-churn-detection/Churn_Modelling.csv",
                      header=None,
                      names=columns)
churndf.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
1,1,15634602,Hargrave,619,France,Female,42,2,0,1,1,1,101348.88,1
2,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
3,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
4,4,15701354,Boni,699,France,Female,39,1,0,2,0,0,93826.63,0


* Optamos por remover as seguintes colunas:
  * **RowNumber:**Apenas indica o número da linha;
  * **CustomerId:**Coluna com alta cardinalidade, com 10.000 IDs únicos;
  * **Surname:**Coluna com alta cardinalidade, apresentando os sobrenomes de cada consumidor.

In [4]:
# removing unecessary columns and reseting indexes
churndf = churndf.drop([0,])
churndf.drop(columns=['RowNumber', 'CustomerId', 'Surname'], inplace=True)
churndf.reset_index(drop=True,inplace=True)
churndf.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0


In [5]:
# saving to csv
churndf.to_csv("raw_data.csv", index=False)

In [6]:
# Login to wandb
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [7]:
# Saving artifact to wandb
!wandb artifact put \
       --name churn_prediction_project/raw_data.csv \
       --type raw_data \
       --description "Customer Churn" raw_data.csv

[34m[1mwandb[0m: Uploading file raw_data.csv to: "lucasismael/churn_prediction_project/raw_data.csv:latest" (raw_data)
[34m[1mwandb[0m: Currently logged in as: [33mlucasismael[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Tracking run with wandb version 0.12.16
[34m[1mwandb[0m: Run data is saved locally in [35m[1m/content/wandb/run-20220524_185843-3mk82fb7[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33mbrisk-universe-1[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/lucasismael/churn_prediction_project[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/lucasismael/churn_prediction_project/runs/3mk82fb7[0m
Artifact uploaded, use this artifact in a run by adding:

    artifact = run.use_artifact("lucasismael/churn_prediction_project/raw_data.csv:latest")

[34m[1mwandb[0m: Waiting for W&B process to finish... [32m(success).[0m
[34m[1mwandb[0

----------------

### 1.2.Exploratory Data Analysis (EDA)

In [9]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 146
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [10]:
# save_code tracking all changes of the notebook and sync with Wandb
run = wandb.init(project="churn_prediction_project", save_code=True)

[34m[1mwandb[0m: Currently logged in as: [33mlucasismael[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [11]:
# download the latest version of artifact raw_data.csv
artifact = run.use_artifact("churn_prediction_project/raw_data.csv:latest")

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

#### 1.2.1 Pandas Profiling

In [12]:
ProfileReport(df, title= "Pandas Profiling Report", explorative=True)

Output hidden; open in https://colab.research.google.com to view.

In [13]:
run.finish()

VBox(children=(Label(value='0.050 MB of 0.050 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

--------------------------

### 1.3.Preprocessing

In [14]:
# Login to Weights & Biases
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [15]:
input_artifact="churn_prediction_project/raw_data.csv:latest"
artifact_name="preprocessed_data.csv"
artifact_type="clean_data"
artifact_description="Data after preprocessing"

In [16]:
# create a new job_type
run = wandb.init(project="churn_prediction_project", job_type="process_data")

In [17]:
# download the latest version of artifact raw_data.csv
artifact=run.use_artifact(input_artifact)

# create a dataframe from the artifact
df = pd.read_csv(artifact.file())

In [18]:
# delete duplicated rows
df.drop_duplicates(inplace=True)

# generate a "clean data file"
df.to_csv(artifact_name, index=False)

In [19]:
# Create a new artifact and configure with the necessary arguments
artifact = wandb.Artifact(name=artifact_name,
                         type=artifact_type,
                         description=artifact_description)
artifact.add_file(artifact_name)

<ManifestEntry digest: 2VNPzyBON65Yp9cxPORlnA==>

In [20]:
run.log_artifact(artifact)

<wandb.sdk.wandb_artifacts.Artifact at 0x7f8dd905f2d0>

In [21]:
run.finish()

VBox(children=(Label(value='0.448 MB of 0.448 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

----------------------------

## 2.Data Check

In [None]:
# Login to wandb
!wandb login --relogin

In [26]:
%%file test_data.py
import pytest
import wandb
import pandas as pd

# This is global so all tests are collected under the same run
run = wandb.init(project="churn_prediction_project", job_type="data_checks")

@pytest.fixture(scope="session")
def data():

    local_path = run.use_artifact("churn_prediction_project/preprocessed_data.csv:latest").file()
    df = pd.read_csv(local_path)

    return df

def test_data_length(data):
    """
    We test that we have enough data to continue
    """
    assert len(data) > 1000


def test_number_of_columns(data):
    """
    We test that we have enough data to continue
    """
    assert data.shape[1] == 11

def test_column_presence_and_type(data):

    required_columns = {
        #"CustomerId": pd.api.types.is_int64_dtype,
        #"Surname": pd.api.types.is_object_dtype,
        "CreditScore": pd.api.types.is_int64_dtype,
        "Geography": pd.api.types.is_object_dtype,
        "Gender": pd.api.types.is_object_dtype,
        "Age": pd.api.types.is_int64_dtype,
        "Tenure": pd.api.types.is_int64_dtype,
        "Balance": pd.api.types.is_float_dtype,
        "NumOfProducts": pd.api.types.is_int64_dtype,
        "HasCrCard": pd.api.types.is_int64_dtype,
        "IsActiveMember": pd.api.types.is_int64_dtype,
        "EstimatedSalary": pd.api.types.is_float_dtype,  
        "Exited": pd.api.types.is_int64_dtype
    }

    # Check column presence
    assert set(data.columns.values).issuperset(set(required_columns.keys()))

    for col_name, format_verification_funct in required_columns.items():

        assert format_verification_funct(data[col_name]), f"Column {col_name} failed test {format_verification_funct}"


def test_class_names(data):

    # Check that only the known classes are present
    known_classes = [
        0,
        1
    ]

    assert data["Exited"].isin(known_classes).all()


def test_column_ranges(data):

    ranges = {
        "CreditScore": (0, 1000),
        "Age": (0,100),
        "Tenure": (0,10),
        "Balance": (0, 1.484705e+06),
        "NumOfProducts": (1,4),
        "HasCrCard": (0,1),
        "IsActiveMember": (0,1),
        "EstimatedSalary": (0, 1.484705e+06),
        "Exited": (0, 1)
    }

    for col_name, (minimum, maximum) in ranges.items():

        assert data[col_name].dropna().between(minimum, maximum).all(), (
            f"Column {col_name} failed the test. Should be between {minimum} and {maximum}, "
            f"instead min={data[col_name].min()} and max={data[col_name].max()}"
        )

Overwriting test_data.py


In [27]:
!pytest . -vv

[1mTest session starts (platform: linux, Python 3.7.13, pytest 3.6.4, pytest-sugar 0.9.4)[0m
cachedir: .pytest_cache
rootdir: /content, inifile:
plugins: typeguard-2.7.1, sugar-0.9.4

 [36mtest_data.py[0m::test_data_length[0m [32m✓[0m                                 [32m20% [0m[40m[32m█[0m[40m[32m█        [0m
 [36mtest_data.py[0m::test_number_of_columns[0m [32m✓[0m                           [32m40% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█      [0m
 [36mtest_data.py[0m::test_column_presence_and_type[0m [32m✓[0m                    [32m60% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█    [0m
 [36mtest_data.py[0m::test_class_names[0m [32m✓[0m                                 [32m80% [0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█[0m[40m[32m█  [0m
 [36mtest_data.py[0m::test_column_ranges[0m [32m✓[0m                             

In [28]:
run.finish()

----------------------------------------

## 3.Data Segregation

In [29]:
# Login to wandb
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: [32m[41mERROR[0m API key must be 40 characters long, yours was 336
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [30]:
# global variables

# ratio - 70% train / 30% test
test_size = 0.30

# seed used to reproduce purposes
seed = 42

# reference (column) to stratify the data
stratify = "Exited"

# name of the input artifact
artifact_input_name = "churn_prediction_project/preprocessed_data.csv:latest"

# type of the artifact
artifact_type = "segregated_data"

In [31]:
# configure logging 
logging.basicConfig(level=logging.INFO,
                   format="%(asctime)s %(message)s",
                   datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging object
logger = logging.getLogger()

# init wandb project
run = wandb.init(project="churn_prediction_project", job_type="split_data")

logger.info("Downloading and reading artifact")
artifact=run.use_artifact(artifact_input_name)
artifact_path=artifact.file()
df = pd.read_csv(artifact_path)

# Split in train/test
logger.info("Splitting data into train and test")
splits = {}

splits["train"], splits["test"] = train_test_split(df,
                                                  test_size=test_size,
                                                  random_state=seed,
                                                  stratify=df[stratify])

24-05-2022 19:26:53 Downloading and reading artifact
24-05-2022 19:26:53 Splitting data into train and test


In [34]:
# Save artifacts
with tempfile.TemporaryDirectory() as tmp_dir:
    
    for split, df in splits.items():
        
        # Make the artifact name from the name of the split plus the provided root
        artifact_name = f"{split}.csv"
        
        # Get the path on disk within the temp directory
        temp_path = os.path.join(tmp_dir,artifact_name)
        
        logger.info(f"Uploading the {split} dataset to {artifact_name}")
        
        # Save then upload to W&B
        df.to_csv(temp_path,index=False)
        
        artifact = wandb.Artifact(name=artifact_name,
                                 type=artifact_type,
                                 description=f"{split} split of dataset {artifact_input_name}")
        artifact.add_file(temp_path)
        
        logger.info("Logging artifact")
        run.log_artifact(artifact)
        
        artifact.wait()

24-05-2022 19:27:36 Uploading the train dataset to train.csv
24-05-2022 19:27:36 Logging artifact
24-05-2022 19:27:38 Uploading the test dataset to test.csv
24-05-2022 19:27:38 Logging artifact


In [35]:
run.finish()

VBox(children=(Label(value='0.448 MB of 0.448 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

------------------------

## 4.Trainning

In [6]:
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


### 4.1.Holdout Configuration

In [7]:
# global variables

# ratio used to split train and validation data
val_size = 0.30

# seed used to reproduce purposes
seed = 42

# reference (column) to stratify the data
stratify = "Exited"

# name of the input artifact
artifact_input_name = "churn_prediction_project/train.csv:latest"

# type of the artifact
artifact_type = "Train"

In [8]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

# initiate the wandb project
run = wandb.init(project="churn_prediction_project",job_type="train")

logger.info("Downloading and reading train artifact")
local_path = run.use_artifact(artifact_input_name).file()
df_train = pd.read_csv(local_path)

# Spliting train.csv into train and validation dataset
logger.info("Spliting data into train/val")
# split-out train/validation and test dataset
x_train, x_val, y_train, y_val = train_test_split(df_train.drop(labels=stratify,axis=1),
                                                  df_train[stratify],
                                                  test_size=val_size,
                                                  random_state=seed,
                                                  shuffle=True,
                                                  stratify=df_train[stratify])

[34m[1mwandb[0m: Currently logged in as: [33mlucasismael[0m. Use [1m`wandb login --relogin`[0m to force relogin


24-05-2022 19:51:59 Downloading and reading train artifact
24-05-2022 19:51:59 Spliting data into train/val


In [9]:
logger.info("x train: {}".format(x_train.shape))
logger.info("y train: {}".format(y_train.shape))
logger.info("x val: {}".format(x_val.shape))
logger.info("y val: {}".format(y_val.shape))

24-05-2022 19:52:28 x train: (4900, 10)
24-05-2022 19:52:28 y train: (4900,)
24-05-2022 19:52:28 x val: (2100, 10)
24-05-2022 19:52:28 y val: (2100,)


### 4.2.Data Preparation


#### 4.2.1.Outlier Removal

In [10]:
logger.info("Outlier Removal")
# temporary variable
x = x_train.select_dtypes("int64").copy()

# identify outlier in the dataset
lof = LocalOutlierFactor()
outlier = lof.fit_predict(x)
mask = outlier != -1

24-05-2022 19:52:48 Outlier Removal


In [11]:
logger.info("x_train shape [original]: {}".format(x_train.shape))
logger.info("x_train shape [outlier removal]: {}".format(x_train.loc[mask,:].shape))

24-05-2022 19:52:50 x_train shape [original]: (4900, 10)
24-05-2022 19:52:50 x_train shape [outlier removal]: (4877, 10)


In [12]:
logger.info("y_train shape [original]: {}".format(y_train.shape))
logger.info("y_train shape [outlier removal]: {}".format(y_train.loc[mask].shape))

24-05-2022 19:52:53 y_train shape [original]: (4900,)
24-05-2022 19:52:53 y_train shape [outlier removal]: (4877,)


In [13]:
# AVOID data leakage and you should not do this procedure in the preprocessing stage
# Note that we did not perform this procedure in the validation set
x_train = x_train.loc[mask,:].copy()
y_train = y_train[mask].copy()

#### 4.2.2.Target Variable Encoding

* Nesse caso a variável alvo já vem codificada, mas vamos criar um encoder para transformar a variável numérica em categórica.

In [14]:
logger.info("Encoding a Target Variable")

# define a categorical encoding for target variable
le = LabelEncoder()
le.fit(["Contiuned", "Exited"])
teste = le.inverse_transform(y_train)

24-05-2022 19:52:58 Encoding a Target Variable


-----------------------

### 4.3.Creating a Full-Pipeline

#### 4.3.1.Transformers

In [15]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    # Class Constructor
    def __init__(self, feature_names):
        self.feature_names = feature_names
        
    # Return self nothing to do here
    def fit(self, X, y=None):
        return self
    
    # Method that describes what tis custom transformer need to do
    def transform(self, X, y=None):
        return X[self.feature_names]
    
# Handling categorical features
class CategoricalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes one boolean as argument
    def __init__(self, new_features=True, colnames=None):
        self.new_features = new_features
        self.colnames = colnames
        
    # Return self nothing else to do here
    def fit(self, X, y=None):
        return self
    
    def get_feature_names_out(self):
        return self.colnames.tolist()
    
    # Transformer method we wrote for this transformer
    def transform(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # Remove white space in categorical features
        df = df.apply(lambda row: row.str.strip())
        
        # customize feature?
        # How can I identify what needs to be modified? EDA!!!!!!
        if self.new_features:
            
            # replace ? with unknown
            edit_cols = ['Geography', 'Gender']
            for col in edit_cols:
                df.loc[df[col].str.contains("\?"), col] = 'unknown'
        
        # update column names
        self.colnames = df.columns
        df = pd.DataFrame(X, columns=self.colnames)
        
        return df

# transform numerical features
class NumericalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes a model parameter as its argument
    # model 0: minmax
    # model 1: standard
    # model 2: without scaler
    def __init__(self, model=0, colnames=None):
        self.model = model
        self.colnames = colnames
        self.scaler = None

    # Fit is used only to learn statistical about Scalers
    def fit(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # minmax
        if self.model == 0:
            self.scaler = MinMaxScaler()
            self.scaler.fit(df)
        # standard scaler
        elif self.model == 1:
            self.scaler = StandardScaler()
            self.scaler.fit(df)
        return self

    # return columns names after transformation
    def get_feature_names_out(self):
        return self.colnames

    # Transformer method we wrote for this transformer
    # Use fitted scalers
    def transform(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # update columns name
        self.colnames = df.columns.tolist()

        # minmax
        if self.model == 0:
            # transform data
            df = self.scaler.transform(df)
        elif self.model == 1:
            # transform data
            df = self.scaler.transform(df)
        else:
            df = df.values

        return df

#### 4.3.2.Holdout Training Pipeline

In [16]:
# model = 0 (min-max), 1 (z-score), 2 (without normalization)
numerical_model = 2

# Categrical features to pass down the categorical pipeline
categorical_features = x_train.select_dtypes("object").columns.to_list()

# Numerical features to pass down the numerical pipeline
numerical_features = x_train.select_dtypes(["int64","float"]).columns.to_list()

# Defining the steps for the categorical pipeline
categorical_pipeline = Pipeline(steps=[('cat_selector', FeatureSelector(categorical_features)),
                                       ('imputer_cat', SimpleImputer(strategy="most_frequent")),
                                       ('cat_transformer', CategoricalTransformer(colnames=categorical_features)),
                                       # ('cat_encoder','passthrough'
                                       ('cat_encoder', OneHotEncoder(sparse=False, drop="first"))])

# Defining the steps in the numerical pipeline
numerical_pipeline = Pipeline(steps=[('num_selector', FeatureSelector(numerical_features)),
                                     ('imputer_num', SimpleImputer(strategy="median")),
                                     ('num_transformer', NumericalTransformer(numerical_model, 
                                                                              colnames=numerical_features))])

# Combine numerical and categorical pieplines into one full big pipeline horizontally
full_pipeline_preprocessing = FeatureUnion(transformer_list=[('cat_pipeline', categorical_pipeline),
                                                             ('num_pipeline', numerical_pipeline)])

#### 4.3.3. Trainning

In [48]:
# The full pipeline 
pipe = Pipeline(steps = [('full_pipeline', full_pipeline_preprocessing),
                         ("classifier",DecisionTreeClassifier())])

# training
logger.info("Training")
pipe.fit(x_train, y_train)

# predict
logger.info("Infering")
predict = pipe.predict(x_val)

# Evaluation Metrics
logger.info("Evaluation metrics")
fbeta = fbeta_score(y_val, predict, beta=1, zero_division=1)
precision = precision_score(y_val, predict, zero_division=1)
recall = recall_score(y_val, predict, zero_division=1)
acc = accuracy_score(y_val, predict)

logger.info("Accuracy: {}".format(acc))
logger.info("Precision: {}".format(precision))
logger.info("Recall: {}".format(recall))
logger.info("F1: {}".format(fbeta))

24-05-2022 19:36:55 Training
24-05-2022 19:36:55 Infering
24-05-2022 19:36:55 Evaluation metrics
24-05-2022 19:36:55 Accuracy: 0.7914285714285715
24-05-2022 19:36:55 Precision: 0.48917748917748916
24-05-2022 19:36:55 Recall: 0.5280373831775701
24-05-2022 19:36:55 F1: 0.5078651685393257


In [49]:
run.finish()

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

--------------------------

### 4.4.Hyperparameter Tuning

In [50]:
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [51]:
seed = 42

In [52]:
sweep_config = {
    # try grid or random
    "method": "grid",
    "metric": {
        "name": "Accuracy",
        "goal": "maximize"
        },
    "parameters": {
        "criterion": {
            "values": ["gini","entropy"]
            },
        "splitter": {
            "values": ["random","best"]
        },
        "model": {
            "values": [0,1,2]
        },
        "random_state": {
            "values": [seed]
        }
    }
}

sweep_id = wandb.sweep(sweep_config, project="churn_prediction_project")

Create sweep with ID: zzzgbxfk
Sweep URL: https://wandb.ai/lucasismael/churn_prediction_project/sweeps/zzzgbxfk


In [53]:
def train():
    with wandb.init() as run:

        # The full pipeline 
        pipe = Pipeline(steps = [('full_pipeline', full_pipeline_preprocessing),
                                    ("classifier",DecisionTreeClassifier())
                                    ]
                        )

        # update the parameters of the pipeline that we would like to tuning
        pipe.set_params(**{"full_pipeline__num_pipeline__num_transformer__model": run.config.model})
        pipe.set_params(**{"classifier__criterion": run.config.criterion})
        pipe.set_params(**{"classifier__splitter": run.config.splitter})
        pipe.set_params(**{"classifier__random_state": run.config.random_state})

        # training
        logger.info("Training")
        pipe.fit(x_train, y_train)

        # predict
        logger.info("Infering")
        predict = pipe.predict(x_val)

        # Evaluation Metrics
        logger.info("Evaluation metrics")
        fbeta = fbeta_score(y_val, predict, beta=1, zero_division=1)
        precision = precision_score(y_val, predict, zero_division=1)
        recall = recall_score(y_val, predict, zero_division=1)
        acc = accuracy_score(y_val, predict)

        logger.info("Accuracy: {}".format(acc))
        logger.info("Precision: {}".format(precision))
        logger.info("Recall: {}".format(recall))
        logger.info("F1: {}".format(fbeta))

        run.summary["Accuracy"] = acc
        run.summary["Precision"] = precision
        run.summary["Recall"] = recall
        run.summary["F1"] = fbeta

In [54]:
wandb.agent(sweep_id, train, count=12)

[34m[1mwandb[0m: Agent Starting Run: ealu1vm7 with config:
[34m[1mwandb[0m: 	criterion: gini
[34m[1mwandb[0m: 	model: 0
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: random


24-05-2022 19:39:36 Training
24-05-2022 19:39:36 Infering
24-05-2022 19:39:36 Evaluation metrics
24-05-2022 19:39:36 Accuracy: 0.7785714285714286
24-05-2022 19:39:36 Precision: 0.45951859956236324
24-05-2022 19:39:36 Recall: 0.49065420560747663
24-05-2022 19:39:36 F1: 0.47457627118644063


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=0.0, max…

0,1
Accuracy,0.77857
F1,0.47458
Precision,0.45952
Recall,0.49065


[34m[1mwandb[0m: Agent Starting Run: ia4zxdlc with config:
[34m[1mwandb[0m: 	criterion: gini
[34m[1mwandb[0m: 	model: 0
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: best


24-05-2022 19:39:56 Training
24-05-2022 19:39:56 Infering
24-05-2022 19:39:56 Evaluation metrics
24-05-2022 19:39:56 Accuracy: 0.7947619047619048
24-05-2022 19:39:56 Precision: 0.49682875264270615
24-05-2022 19:39:56 Recall: 0.5490654205607477
24-05-2022 19:39:56 F1: 0.5216426193118757


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.79476
F1,0.52164
Precision,0.49683
Recall,0.54907


[34m[1mwandb[0m: Agent Starting Run: x9v8t42p with config:
[34m[1mwandb[0m: 	criterion: gini
[34m[1mwandb[0m: 	model: 1
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: random


24-05-2022 19:40:16 Training
24-05-2022 19:40:16 Infering
24-05-2022 19:40:16 Evaluation metrics
24-05-2022 19:40:16 Accuracy: 0.7785714285714286
24-05-2022 19:40:16 Precision: 0.45951859956236324
24-05-2022 19:40:16 Recall: 0.49065420560747663
24-05-2022 19:40:16 F1: 0.47457627118644063


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.77857
F1,0.47458
Precision,0.45952
Recall,0.49065


[34m[1mwandb[0m: Agent Starting Run: 2s9mfh65 with config:
[34m[1mwandb[0m: 	criterion: gini
[34m[1mwandb[0m: 	model: 1
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: best


24-05-2022 19:40:37 Training
24-05-2022 19:40:37 Infering
24-05-2022 19:40:37 Evaluation metrics
24-05-2022 19:40:37 Accuracy: 0.7947619047619048
24-05-2022 19:40:37 Precision: 0.4968553459119497
24-05-2022 19:40:37 Recall: 0.5537383177570093
24-05-2022 19:40:37 F1: 0.523756906077348


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.79476
F1,0.52376
Precision,0.49686
Recall,0.55374


[34m[1mwandb[0m: Agent Starting Run: s4yclbev with config:
[34m[1mwandb[0m: 	criterion: gini
[34m[1mwandb[0m: 	model: 2
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: random


24-05-2022 19:40:57 Training
24-05-2022 19:40:57 Infering
24-05-2022 19:40:57 Evaluation metrics
24-05-2022 19:40:57 Accuracy: 0.7785714285714286
24-05-2022 19:40:57 Precision: 0.45951859956236324
24-05-2022 19:40:57 Recall: 0.49065420560747663
24-05-2022 19:40:57 F1: 0.47457627118644063


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.77857
F1,0.47458
Precision,0.45952
Recall,0.49065


[34m[1mwandb[0m: Agent Starting Run: kubwwlcn with config:
[34m[1mwandb[0m: 	criterion: gini
[34m[1mwandb[0m: 	model: 2
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: best


24-05-2022 19:41:18 Training
24-05-2022 19:41:18 Infering
24-05-2022 19:41:18 Evaluation metrics
24-05-2022 19:41:18 Accuracy: 0.7952380952380952
24-05-2022 19:41:18 Precision: 0.4978902953586498
24-05-2022 19:41:18 Recall: 0.5514018691588785
24-05-2022 19:41:18 F1: 0.5232815964523281


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.79524
F1,0.52328
Precision,0.49789
Recall,0.5514


[34m[1mwandb[0m: Agent Starting Run: 98it60c0 with config:
[34m[1mwandb[0m: 	criterion: entropy
[34m[1mwandb[0m: 	model: 0
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: random


24-05-2022 19:41:38 Training
24-05-2022 19:41:39 Infering
24-05-2022 19:41:39 Evaluation metrics
24-05-2022 19:41:39 Accuracy: 0.7871428571428571
24-05-2022 19:41:39 Precision: 0.4789356984478936
24-05-2022 19:41:39 Recall: 0.5046728971962616
24-05-2022 19:41:39 F1: 0.49146757679180886


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.78714
F1,0.49147
Precision,0.47894
Recall,0.50467


[34m[1mwandb[0m: Agent Starting Run: 1nqlm061 with config:
[34m[1mwandb[0m: 	criterion: entropy
[34m[1mwandb[0m: 	model: 0
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: best


24-05-2022 19:41:59 Training
24-05-2022 19:41:59 Infering
24-05-2022 19:41:59 Evaluation metrics
24-05-2022 19:41:59 Accuracy: 0.799047619047619
24-05-2022 19:41:59 Precision: 0.5066079295154186
24-05-2022 19:41:59 Recall: 0.5373831775700935
24-05-2022 19:41:59 F1: 0.5215419501133787


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.79905
F1,0.52154
Precision,0.50661
Recall,0.53738


[34m[1mwandb[0m: Agent Starting Run: qtaqc3dj with config:
[34m[1mwandb[0m: 	criterion: entropy
[34m[1mwandb[0m: 	model: 1
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: random


24-05-2022 19:42:20 Training
24-05-2022 19:42:20 Infering
24-05-2022 19:42:20 Evaluation metrics
24-05-2022 19:42:20 Accuracy: 0.7871428571428571
24-05-2022 19:42:20 Precision: 0.4789356984478936
24-05-2022 19:42:20 Recall: 0.5046728971962616
24-05-2022 19:42:20 F1: 0.49146757679180886


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.78714
F1,0.49147
Precision,0.47894
Recall,0.50467


[34m[1mwandb[0m: Agent Starting Run: wtojb7lp with config:
[34m[1mwandb[0m: 	criterion: entropy
[34m[1mwandb[0m: 	model: 1
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: best


24-05-2022 19:42:40 Training
24-05-2022 19:42:40 Infering
24-05-2022 19:42:40 Evaluation metrics
24-05-2022 19:42:40 Accuracy: 0.8004761904761905
24-05-2022 19:42:40 Precision: 0.5099778270509978
24-05-2022 19:42:40 Recall: 0.5373831775700935
24-05-2022 19:42:40 F1: 0.5233219567690558


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.80048
F1,0.52332
Precision,0.50998
Recall,0.53738


[34m[1mwandb[0m: Agent Starting Run: 6racsbqq with config:
[34m[1mwandb[0m: 	criterion: entropy
[34m[1mwandb[0m: 	model: 2
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: random


24-05-2022 19:43:00 Training
24-05-2022 19:43:00 Infering
24-05-2022 19:43:01 Evaluation metrics
24-05-2022 19:43:01 Accuracy: 0.7871428571428571
24-05-2022 19:43:01 Precision: 0.4789356984478936
24-05-2022 19:43:01 Recall: 0.5046728971962616
24-05-2022 19:43:01 F1: 0.49146757679180886


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.78714
F1,0.49147
Precision,0.47894
Recall,0.50467


[34m[1mwandb[0m: Agent Starting Run: br7a8arz with config:
[34m[1mwandb[0m: 	criterion: entropy
[34m[1mwandb[0m: 	model: 2
[34m[1mwandb[0m: 	random_state: 42
[34m[1mwandb[0m: 	splitter: best


24-05-2022 19:43:21 Training
24-05-2022 19:43:21 Infering
24-05-2022 19:43:21 Evaluation metrics
24-05-2022 19:43:21 Accuracy: 0.8
24-05-2022 19:43:21 Precision: 0.5088495575221239
24-05-2022 19:43:21 Recall: 0.5373831775700935
24-05-2022 19:43:21 F1: 0.5227272727272727


VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Accuracy,0.8
F1,0.52273
Precision,0.50885
Recall,0.53738


--------------------------

### 4.5.Configure and Train the best model

In [17]:
# initiate the wandb project
run = wandb.init(project="churn_prediction_project",job_type="train")

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

In [18]:
# The full pipeline 
pipe = Pipeline(steps = [('full_pipeline', full_pipeline_preprocessing),
                         ("classifier",DecisionTreeClassifier())
                         ]
                )

# update the parameters of the pipeline that we would like to tuning
pipe.set_params(**{"full_pipeline__num_pipeline__num_transformer__model": 1})
pipe.set_params(**{"classifier__criterion": 'entropy'})
pipe.set_params(**{"classifier__splitter": 'best'})
pipe.set_params(**{"classifier__random_state": 42})



# training
logger.info("Training")
pipe.fit(x_train, y_train)

# predict
logger.info("Infering")
predict = pipe.predict(x_val)

# Evaluation Metrics
logger.info("Evaluation metrics")
fbeta = fbeta_score(y_val, predict, beta=1, zero_division=1)
precision = precision_score(y_val, predict, zero_division=1)
recall = recall_score(y_val, predict, zero_division=1)
acc = accuracy_score(y_val, predict)

logger.info("Accuracy: {}".format(acc))
logger.info("Precision: {}".format(precision))
logger.info("Recall: {}".format(recall))
logger.info("F1: {}".format(fbeta))

run.summary["Acc"] = acc
run.summary["Precision"] = precision
run.summary["Recall"] = recall
run.summary["F1"] = fbeta

24-05-2022 19:57:43 Training
24-05-2022 19:57:43 Infering
24-05-2022 19:57:43 Evaluation metrics
24-05-2022 19:57:43 Accuracy: 0.8004761904761905
24-05-2022 19:57:43 Precision: 0.5099778270509978
24-05-2022 19:57:43 Recall: 0.5373831775700935
24-05-2022 19:57:43 F1: 0.5233219567690558


In [19]:
# Get categorical column names
cat_names = pipe.named_steps['full_pipeline'].get_params()["cat_pipeline"][3].get_feature_names_out().tolist()
cat_names

['Geography_Germany', 'Geography_Spain', 'Gender_Male']

In [20]:
# Get numerical column names
num_names = pipe.named_steps['full_pipeline'].get_params()["num_pipeline"][2].get_feature_names_out()
num_names

['CreditScore',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary']

In [21]:
# merge all column names together
all_names = cat_names + num_names
all_names

['Geography_Germany',
 'Geography_Spain',
 'Gender_Male',
 'CreditScore',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary']

In [22]:
# Visualize all classifier plots
# For a complete documentation please see: https://docs.wandb.ai/guides/integrations/scikit
wandb.sklearn.plot_classifier(pipe.get_params()["classifier"],
                              full_pipeline_preprocessing.transform(x_train),
                              full_pipeline_preprocessing.transform(x_val),
                              y_train,
                              y_val,
                              predict,
                              pipe.predict_proba(x_val),
                              [0,1],
                              model_name='BestModel', feature_names=all_names)

[34m[1mwandb[0m: 
[34m[1mwandb[0m: Plotting BestModel.
[34m[1mwandb[0m: Logged feature importances.
[34m[1mwandb[0m: Logged confusion matrix.
[34m[1mwandb[0m: Logged summary metrics.
[34m[1mwandb[0m: Logged class proportions.
[34m[1mwandb[0m: Logged calibration curve.
[34m[1mwandb[0m: Logged roc curve.
[34m[1mwandb[0m: Logged precision-recall curve.


-----------------------------

### 4.6. Export the best model

In [23]:
# types and names of the artifacts
artifact_type = "inference_artifact"
artifact_encoder = "target_encoder"
artifact_model = "model_export"

In [24]:
logger.info("Dumping the artifacts to disk")
# Save the model using joblib
joblib.dump(pipe, artifact_model)

# Save the target encoder using joblib
joblib.dump(le, artifact_encoder)

24-05-2022 19:59:29 Dumping the artifacts to disk


['target_encoder']

In [25]:
# Model artifact
artifact = wandb.Artifact(artifact_model,
                          type=artifact_type,
                          description="A full pipeline composed of a Preprocessing Stage and a Decision Tree model"
                          )

logger.info("Logging model artifact")
artifact.add_file(artifact_model)
run.log_artifact(artifact)

24-05-2022 19:59:39 Logging model artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f2687ec8e50>

In [26]:
# Target encoder artifact
artifact = wandb.Artifact(artifact_encoder,
                          type=artifact_type,
                          description="The encoder used to encode the target variable"
                          )

logger.info("Logging target enconder artifact")
artifact.add_file(artifact_encoder)
run.log_artifact(artifact)

24-05-2022 19:59:48 Logging target enconder artifact


<wandb.sdk.wandb_artifacts.Artifact at 0x7f268ae98150>

In [27]:
run.finish()

VBox(children=(Label(value='0.108 MB of 0.108 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Acc,0.80048
F1,0.52332
Precision,0.50998
Recall,0.53738


----------------------------

## 5.Testing

In [28]:
!wandb login --relogin

[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


In [29]:
run = wandb.init()

In [30]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    # Class Constructor
    def __init__(self, feature_names):
        self.feature_names = feature_names
        
    # Return self nothing to do here
    def fit(self, X, y=None):
        return self
    
    # Method that describes what tis custom transformer need to do
    def transform(self, X, y=None):
        return X[self.feature_names]
    
# Handling categorical features
class CategoricalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes one boolean as argument
    def __init__(self, new_features=True, colnames=None):
        self.new_features = new_features
        self.colnames = colnames
        
    # Return self nothing else to do here
    def fit(self, X, y=None):
        return self
    
    def get_feature_names_out(self):
        return self.colnames.tolist()
    
    # Transformer method we wrote for this transformer
    def transform(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # Remove white space in categorical features
        df = df.apply(lambda row: row.str.strip())
        
        # customize feature?
        # How can I identify what needs to be modified? EDA!!!!!!
        if self.new_features:
            
            # replace ? with unknown
            edit_cols = ['Geography', 'Gender']
            for col in edit_cols:
                df.loc[df[col].str.contains("\?"), col] = 'unknown'
        
        # update column names
        self.colnames = df.columns
        df = pd.DataFrame(X, columns=self.colnames)
        
        return df

# transform numerical features
class NumericalTransformer(BaseEstimator, TransformerMixin):
    # Class constructor method that takes a model parameter as its argument
    # model 0: minmax
    # model 1: standard
    # model 2: without scaler
    def __init__(self, model=0, colnames=None):
        self.model = model
        self.colnames = colnames
        self.scaler = None

    # Fit is used only to learn statistical about Scalers
    def fit(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # minmax
        if self.model == 0:
            self.scaler = MinMaxScaler()
            self.scaler.fit(df)
        # standard scaler
        elif self.model == 1:
            self.scaler = StandardScaler()
            self.scaler.fit(df)
        return self

    # return columns names after transformation
    def get_feature_names_out(self):
        return self.colnames

    # Transformer method we wrote for this transformer
    # Use fitted scalers
    def transform(self, X, y=None):
        df = pd.DataFrame(X, columns=self.colnames)
        
        # update columns name
        self.colnames = df.columns.tolist()

        # minmax
        if self.model == 0:
            # transform data
            df = self.scaler.transform(df)
        elif self.model == 1:
            # transform data
            df = self.scaler.transform(df)
        else:
            df = df.values

        return df

In [31]:
# global variables

# name of the artifact related to test dataset
artifact_test_name = "churn_prediction_project/test.csv:latest"

# name of the model artifact
artifact_model_name = "churn_prediction_project/model_export:latest"

# name of the target encoder artifact
artifact_encoder_name = "churn_prediction_project/target_encoder:latest"

In [32]:
# configure logging
logging.basicConfig(level=logging.INFO,
                    format="%(asctime)s %(message)s",
                    datefmt='%d-%m-%Y %H:%M:%S')

# reference for a logging obj
logger = logging.getLogger()

In [33]:
# initiate the wandb project
run = wandb.init(project="churn_prediction_project",job_type="test")

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

In [34]:
logger.info("Downloading and reading test artifact")
test_data_path = run.use_artifact(artifact_test_name).file()
df_test = pd.read_csv(test_data_path)

# Extract the target from the features
logger.info("Extracting target from dataframe")
x_test = df_test.copy()
y_test = x_test.pop("Exited")

24-05-2022 20:03:02 Downloading and reading test artifact
24-05-2022 20:03:02 Extracting target from dataframe


In [35]:
x_test.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,790,Spain,Male,37,6,0.0,2,1,1,119484.01
1,521,France,Male,35,6,96423.84,1,1,0,10488.44
2,712,France,Female,37,1,106881.5,2,0,0,169386.81
3,729,Spain,Female,38,10,0.0,2,1,0,189727.12
4,695,Germany,Male,52,8,103023.26,1,1,1,22485.64


In [36]:
y_test.head()

0    0
1    0
2    0
3    0
4    0
Name: Exited, dtype: int64

In [37]:
# Extract the encoding of the target variable
logger.info("Extracting the encoding of the target variable")
encoder_export_path = run.use_artifact(artifact_encoder_name).file()
le = joblib.load(encoder_export_path)

24-05-2022 20:03:33 Extracting the encoding of the target variable


In [38]:
y_test

0       0
1       0
2       0
3       0
4       0
       ..
2995    0
2996    0
2997    0
2998    1
2999    0
Name: Exited, Length: 3000, dtype: int64

In [39]:
# Download inference artifact
logger.info("Downloading and load the exported model")
model_export_path = run.use_artifact(artifact_model_name).file()
pipe = joblib.load(model_export_path)

24-05-2022 20:04:15 Downloading and load the exported model


In [40]:
# predict
logger.info("Infering")
predict = pipe.predict(x_test)

# Evaluation Metrics
logger.info("Test Evaluation metrics")
fbeta = fbeta_score(y_test, predict, beta=1, zero_division=1)
precision = precision_score(y_test, predict, zero_division=1)
recall = recall_score(y_test, predict, zero_division=1)
acc = accuracy_score(y_test, predict)

logger.info("Test Accuracy: {}".format(acc))
logger.info("Test Precision: {}".format(precision))
logger.info("Test Recall: {}".format(recall))
logger.info("Test F1: {}".format(fbeta))

run.summary["Acc"] = acc
run.summary["Precision"] = precision
run.summary["Recall"] = recall
run.summary["F1"] = fbeta

24-05-2022 20:04:26 Infering
24-05-2022 20:04:26 Test Evaluation metrics
24-05-2022 20:04:26 Test Accuracy: 0.7946666666666666
24-05-2022 20:04:26 Test Precision: 0.496206373292868
24-05-2022 20:04:26 Test Recall: 0.5351882160392799
24-05-2022 20:04:26 Test F1: 0.5149606299212598


In [41]:
# Compare the accuracy, precision, recall with previous ones
print(classification_report(y_test,predict))

              precision    recall  f1-score   support

           0       0.88      0.86      0.87      2389
           1       0.50      0.54      0.51       611

    accuracy                           0.79      3000
   macro avg       0.69      0.70      0.69      3000
weighted avg       0.80      0.79      0.80      3000



In [43]:
fig_confusion_matrix, ax = plt.subplots(1,1,figsize=(7,4))
ConfusionMatrixDisplay(confusion_matrix(predict,y_test,labels=[1,0]),
                       display_labels=["1","0"]).plot(values_format=".0f",ax=ax)

ax.set_xlabel("True Label")
ax.set_ylabel("Predicted Label")
plt.show()

In [44]:
run.finish()

VBox(children=(Label(value='0.000 MB of 0.000 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
Acc,0.79467
F1,0.51496
Precision,0.49621
Recall,0.53519
