# Model Tracking UCI exercise

This notebook contains an exercise solution to train a Support Vecto Machine (SVM) on an UCI dataset using the MLflow tracking server to log all the used parameters, metrics and models.

The used dataset can be downloaded from this [site]()

The model is trained using the popular [Iris](http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/) from the UCI repository.

The first step is to import all the required libraries

In [34]:
# The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
import boto3
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import accuracy_score, recall_score, f1_score, average_precision_score

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

warnings.filterwarnings("ignore")
np.random.seed(40)

Set the tracking URI to let the mlflow client know where the tracking server is running. In this case, the server is running locally. If the tracnking server is runnig in a remote host, its IP must be set.

In [22]:
mlflow.set_tracking_uri("http://localhost:80")

Verify that the MLflow client is pointing to the correct endpoints

In [23]:
print(mlflow.get_tracking_uri())

http://localhost:80


Define a function to evaluate the model

In [48]:
def eval_metrics(actual, pred):
    accuracy = accuracy_score(actual, pred)
    recall = recall_score(actual, pred, average='weighted')
    f1 = f1_score(actual, pred, average='weighted')
    return accuracy, recall, f1

Download the Iris dataset from the UCI repository.

Hint: this dataset does not contain headers, set columns names manually

In [15]:
# Read the iris .data file from the URL
colnames=["sepal_length_in_cm", "sepal_width_in_cm","petal_length_in_cm","petal_width_in_cm", "class"]
data_url = (
    "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
)

try:
    data = pd.read_csv(data_url, sep=",", header=None, names= colnames)
except Exception as e:
    logger.exception(
        "Unable to download training & test CSV, check your internet connection. Error: %s", e
    )

Explore briefly the dataset

In [16]:
data

Unnamed: 0,sepal_length_in_cm,sepal_width_in_cm,petal_length_in_cm,petal_width_in_cm,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,Iris-virginica
146,6.3,2.5,5.0,1.9,Iris-virginica
147,6.5,3.0,5.2,2.0,Iris-virginica
148,6.2,3.4,5.4,2.3,Iris-virginica


Note that the class column is discrete and we need a numerical class

First, chech how many unique values has the class column

In [17]:
data['class'].unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

Replace string classes by integers and explore the data again

Hint: check [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) for further details. 

In [18]:
data = data.replace({"class":{"Iris-setosa":1,"Iris-versicolor":2,"Iris-virginica":3}})

data

Unnamed: 0,sepal_length_in_cm,sepal_width_in_cm,petal_length_in_cm,petal_width_in_cm,class
0,5.1,3.5,1.4,0.2,1
1,4.9,3.0,1.4,0.2,1
2,4.7,3.2,1.3,0.2,1
3,4.6,3.1,1.5,0.2,1
4,5.0,3.6,1.4,0.2,1
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,3
146,6.3,2.5,5.0,1.9,3
147,6.5,3.0,5.2,2.0,3
148,6.2,3.4,5.4,2.3,3


Split the dataset into train a test datasets. Note the target variable is the column "class" (the class to predict), and the rest of the variables refer to features of the model.

In [20]:
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

train_x = train.drop(["class"], axis=1)
test_x = test.drop(["class"], axis=1)
train_y = train[["class"]]
test_y = test[["class"]]

print('Train shape:', train_x.shape)
print('Train shape:', test_x.shape)

Train shape: (112, 4)
Train shape: (38, 4)


Define a function to train the model. In this example, the SVM model takes 2 input parameters or more (hyperparameters), which can be modified to chnage the model's behavior:
- kernel
- C

To track the hyperparameters introduced by the user, MLflow provides a ``log_param(name, value)`` function in whict the name of the parameter and its corresponding values must be set.

To evaluate the model, 3 metrics are defined:
- accuracy
- recall
- f1

To track the metrics after evaluating the model, MLflow provides a ``log_metric(name, value)`` function in whict the name of the parameter and its corresponding values must be set.

Define a function to track the model manually

In [52]:
def train_gridsearch(parameters: dict):
    mlflow.set_experiment('iris_gridsearch')
        
    for kernel in parameters['kernel']:
        for C in parameters['C']:
            with mlflow.start_run():

                svm = SVC(C=C, kernel=kernel, random_state=42)
                svm.fit(train_x, train_y)

                predicted_y = svm.predict(test_x)

                (accuracy, recall, f1) = eval_metrics(test_y, predicted_y)

                print("SVM model (Kernel=%s, C=%f):" % (kernel, C))
                print("  accuracy: %s" % accuracy)
                print("  recall: %s" % recall)
                print("  f1: %s" % f1)

                mlflow.log_param("kernel", kernel)
                mlflow.log_param("C", C)
                mlflow.log_metric("accuracy", accuracy)
                mlflow.log_metric("recall", recall)
                mlflow.log_metric("f1", f1)

                mlflow.sklearn.log_model(svm, "model")

Create a parameters dictionary with hyperparameter values

In [27]:
parameters = {"kernel": ('linear', 'poly', 'rbf'),
              "C": (0.2, 0.4, 0.6, 0.8, 1.0)}

Execute the training function with manual tracking

In [53]:
train_gridsearch(parameters)

SVM model (Kernel=linear, C=0.200000):
  accuracy: 0.9736842105263158
  recall: 0.9736842105263158
  f1: 0.9738894018672412
SVM model (Kernel=linear, C=0.400000):
  accuracy: 1.0
  recall: 1.0
  f1: 1.0
SVM model (Kernel=linear, C=0.600000):
  accuracy: 0.9736842105263158
  recall: 0.9736842105263158
  f1: 0.9738894018672412
SVM model (Kernel=linear, C=0.800000):
  accuracy: 0.9736842105263158
  recall: 0.9736842105263158
  f1: 0.9738894018672412
SVM model (Kernel=linear, C=1.000000):
  accuracy: 0.9736842105263158
  recall: 0.9736842105263158
  f1: 0.9733639372264333
SVM model (Kernel=poly, C=0.200000):
  accuracy: 0.9473684210526315
  recall: 0.9473684210526315
  f1: 0.9479757085020243
SVM model (Kernel=poly, C=0.400000):
  accuracy: 1.0
  recall: 1.0
  f1: 1.0
SVM model (Kernel=poly, C=0.600000):
  accuracy: 0.9736842105263158
  recall: 0.9736842105263158
  f1: 0.9733639372264333
SVM model (Kernel=poly, C=0.800000):
  accuracy: 0.9736842105263158
  recall: 0.9736842105263158
  f1: 0

This is fine but it has more sense if we use the power of the GridSeach library provided by sckit-learn

In [54]:
def train_gridsearch_autolog(parameters: dict):
    mlflow.set_experiment('iris_gridsearch_autolog')
    mlflow.sklearn.autolog()
    
    svm = SVC()
    svm_gridsearch = GridSearchCV(svm, parameters)
        
    with mlflow.start_run() as run:
        svm_gridsearch.fit(train_x, train_y)

In [55]:
train_gridsearch_autolog(parameters)

INFO: 'iris_gridsearch_autolog' does not exist. Creating a new experiment




Go to the MLflow UI (http://localhost:80) to see the results of all the runs within the experiment and its corresponding models 