# Model Tracking exercise solution

This notebook contains an exercise solution to train a Random Forest on an UCI dataset using the MLflow tracking server to log all the used parameters, metrics and models.

The model is trained using the [banknote authentication Data Set](https://archive.ics.uci.edu/ml/datasets/banknote+authentication) from the UCI repository.

The first step is to import all the required libraries

In [2]:
import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
import boto3
from sklearn.model_selection import GridSearchCV
from hpsklearn import HyperoptEstimator, random_forest

from sklearn.metrics import accuracy_score, recall_score, f1_score, average_precision_score

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

warnings.filterwarnings("ignore")
np.random.seed(40)

Set the tracking URI to let the mlflow client know where the tracking server is running. In this case, the server is running locally. If the tracnking server is runnig in a remote host, its IP must be set.

In [3]:
mlflow.set_tracking_uri("http://localhost:80")

Verify that the MLflow client is pointing to the correct endpoints

In [4]:
print(mlflow.get_tracking_uri())

http://localhost:80


Define a function to evaluate the model

In [12]:
def eval_metrics(actual, pred):
    accuracy = accuracy_score(actual, pred)
    recall = recall_score(actual, pred)
    f1 = f1_score(actual, pred)
    return accuracy, recall, f1

Download the Iris dataset from the UCI repository.

Hint: this dataset does not contain headers, set columns names manually (check the repository site to know the name of the variables)

In [6]:
# Read the data_banknote_authentication.txt file from the URL
colnames=["variance", "skewness","curtosis","entropy", "class"]
data_url = (
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00267/data_banknote_authentication.txt"
)

try:
    data = pd.read_csv(data_url, sep=",", header=None, names= colnames)
except Exception as e:
    logger.exception(
        "Unable to download training & test CSV, check your internet connection. Error: %s", e
    )

Explore briefly the dataset

In [7]:
data

Unnamed: 0,variance,skewness,curtosis,entropy,class
0,3.62160,8.66610,-2.8073,-0.44699,0
1,4.54590,8.16740,-2.4586,-1.46210,0
2,3.86600,-2.63830,1.9242,0.10645,0
3,3.45660,9.52280,-4.0112,-3.59440,0
4,0.32924,-4.45520,4.5718,-0.98880,0
...,...,...,...,...,...
1367,0.40614,1.34920,-1.4501,-0.55949,1
1368,-1.38870,-4.87730,6.4774,0.34179,1
1369,-3.75030,-13.45860,17.5932,-2.77710,1
1370,-3.56370,-8.38270,12.3930,-1.28230,1


Split the dataset into train a test datasets. Note the target variable is the column "class" (the class to predict), and the rest of the variables refer to features of the model.

In [8]:
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

train_x = train.drop(["class"], axis=1)
test_x = test.drop(["class"], axis=1)
train_y = train[["class"]]
test_y = test[["class"]]

print('Train shape:', train_x.shape)
print('Train shape:', test_x.shape)

Train shape: (1029, 4)
Train shape: (343, 4)


Define a function to train the model. In this example, the Random Forest model takes 2 input parameters or more (hyperparameters), which can be modified to chnage the model's behavior:
- n_estimators
- max_depth

To track the hyperparameters introduced by the user, MLflow provides a ``log_param(name, value)`` function in whict the name of the parameter and its corresponding values must be set.

To evaluate the model, 3 metrics are defined:
- accuracy
- recall
- f1

To track the metrics after evaluating the model, MLflow provides a ``log_metric(name, value)`` function in whict the name of the parameter and its corresponding values must be set.

Define a function to track the model manually

In [17]:
def train_gridsearch(parameters: dict):
    mlflow.set_experiment('randomForest')
        
    for n_estimators in parameters['n_estimators']:
        for max_depth in parameters['max_depth']:
            with mlflow.start_run():

                rf = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
                rf.fit(train_x, train_y)

                predicted_y = rf.predict(test_x)

                (accuracy, recall, f1) = eval_metrics(test_y, predicted_y)

                print("RandomForestClassifier model (n_estimators=%f, C=%f):" % (n_estimators, max_depth))
                print("  accuracy: %s" % accuracy)
                print("  recall: %s" % recall)
                print("  f1: %s" % f1)

                mlflow.log_param("n_estimators", n_estimators)
                mlflow.log_param("max_depth", max_depth)
                mlflow.log_metric("accuracy", accuracy)
                mlflow.log_metric("recall", recall)
                mlflow.log_metric("f1", f1)

                mlflow.sklearn.log_model(rf, "model")

Create a parameters dictionary with hyperparameter values

In [18]:
parameters = {"n_estimators": (100, 200, 300),
              "max_depth": (2, 6, 10)}

Execute the training function with manual tracking

In [19]:
train_gridsearch(parameters)

INFO: 'randomForest' does not exist. Creating a new experiment
RandomForestClassifier model (n_estimators=100.000000, C=2.000000):
  accuracy: 0.9387755102040817
  recall: 0.9387755102040817
  f1: 0.938739021444444
RandomForestClassifier model (n_estimators=100.000000, C=6.000000):
  accuracy: 0.9737609329446064
  recall: 0.9737609329446064
  f1: 0.9737587007633837
RandomForestClassifier model (n_estimators=100.000000, C=10.000000):
  accuracy: 0.9737609329446064
  recall: 0.9737609329446064
  f1: 0.9737587007633837
RandomForestClassifier model (n_estimators=200.000000, C=2.000000):
  accuracy: 0.9358600583090378
  recall: 0.9358600583090378
  f1: 0.9358108980457853
RandomForestClassifier model (n_estimators=200.000000, C=6.000000):
  accuracy: 0.9737609329446064
  recall: 0.9737609329446064
  f1: 0.9737587007633837
RandomForestClassifier model (n_estimators=200.000000, C=10.000000):
  accuracy: 0.9766763848396501
  recall: 0.9766763848396501
  f1: 0.9766763848396501
RandomForestClassi

This is fine but it has more sense if we use the power of the GridSeach library provided by sckit-learn

In [20]:
def train_gridsearch_autolog(parameters: dict):
    mlflow.set_experiment('randomForest_gridsearch')
    mlflow.sklearn.autolog()
    
    rf = RandomForestClassifier()
    rf_gridsearch = GridSearchCV(rf, parameters)
        
    with mlflow.start_run() as run:
        rf_gridsearch.fit(train_x, train_y)

In [21]:
train_gridsearch_autolog(parameters)

INFO: 'randomForest_gridsearch' does not exist. Creating a new experiment


In [10]:
def train_hyperop_autolog():
    mlflow.set_experiment('randomForest_hyperop')
    mlflow.sklearn.autolog()
    
    estim = HyperoptEstimator(
        classifier=random_forest('rf'),
        max_evals=10,
    )
        
    with mlflow.start_run() as run:
        estim.fit(train_x, train_y)

In [11]:
train_hyperop_autolog()

INFO: 'randomForest_hyperop' does not exist. Creating a new experiment
  0%|          | 0/1 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 1/1 [00:01<00:00,  1.32s/trial, best loss: 0.009708737864077666]
 50%|█████     | 1/2 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 2/2 [00:00<00:00,  1.69trial/s, best loss: 0.009708737864077666]
 67%|██████▋   | 2/3 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 3/3 [00:02<00:00,  2.79s/trial, best loss: 0.009708737864077666]
 75%|███████▌  | 3/4 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 4/4 [00:05<00:00,  5.19s/trial, best loss: 0.009708737864077666]
 80%|████████  | 4/5 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 5/5 [00:05<00:00,  5.14s/trial, best loss: 0.009708737864077666]
 83%|████████▎ | 5/6 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 6/6 [00:00<00:00,  1.49trial/s, best loss: 0.009708737864077666]
 86%|████████▌ | 6/7 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 7/7 [00:00<00:00,  1.66trial/s, best loss: 0.009708737864077666]
 88%|████████▊ | 7/8 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 8/8 [00:00<00:00,  1.54trial/s, best loss: 0.009708737864077666]
 89%|████████▉ | 8/9 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 9/9 [00:00<00:00,  1.63trial/s, best loss: 0.009708737864077666]
 90%|█████████ | 9/10 [00:00<?, ?trial/s, best loss=?]



100%|██████████| 10/10 [00:01<00:00,  1.77s/trial, best loss: 0.009708737864077666]




Go to the MLflow UI (http://localhost:80) to see the results of all the runs within the experiment and its corresponding models 