# Model Tracking example with MLflow

This notebook contains a toy example to see how MLflow tracking works. In this example, a Linear Regression algorithm is trained, concretely, the Elasticnet algorithm from the scikit-learn library.

The model is trained using the popular [winequality-red dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/) from the UCI repository.

The first step is to import all the required libraries

In [None]:
# The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
import boto3

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

warnings.filterwarnings("ignore")
np.random.seed(40)

Set the tracking URI to let the mlflow client know where the tracking server is running. In this case, the server is running locally. If the tracnking server is runnig in a remote host, its IP must be set.

In [None]:
mlflow.set_tracking_uri("http://localhost:80")

Verify that the MLflow client is pointing to the correct endpoints

In [None]:
print(mlflow.get_tracking_uri())

Define a function to evaluate the model

In [None]:
def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

Download the dataset from the UCI repository

In [None]:
# Read the wine-quality csv file from the URL
csv_url = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
)
try:
    data = pd.read_csv(csv_url, sep=";")
except Exception as e:
    logger.exception(
        "Unable to download training & test CSV, check your internet connection. Error: %s", e
    )

Explore briefly the dataset

In [None]:
data

Split the dataset into train a test datasets. Note the target variable is the column "qulity" (the class to predict), and the rest of the variables refer to features of the model.

In [None]:
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]

print('Train shape:', train_x.shape)
print('Train shape:', test_x.shape)

Define a function to train the model. In this example, the ElasticNet model takes 2 input parameters (hyperparameters), which can be modified to chnage the model's behavior:
- alpha
- l1_ratio

To track the hyperparameters introduced by the user, MLflow provides a ``log_param(name, value)`` function in whict the name of the parameter and its corresponding values must be set.

To evaluate the model, 3 metrics are defined:
- rmse
- mae
- r2

To track the metrics after evaluating the model, MLflow provides a ``log_metric(name, value)`` function in whict the name of the parameter and its corresponding values must be set.

In [None]:
def train(alpha=0.5, l1_ratio=0.5):
    mlflow.set_experiment('winequality_elasticnet')
    with mlflow.start_run():
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme
        print(mlflow.get_artifact_uri())
        
        # Model registry does not work with file store
        if tracking_url_type_store != "file":

            # Register the model
            # There are other ways to use the Model Registry, which depends on the use case,
            # please refer to the doc for more information:
            # https://mlflow.org/docs/latest/model-registry.html#api-workflow
            mlflow.sklearn.log_model(lr, "model", registered_model_name="ElasticnetWineModel")
        else:
            mlflow.sklearn.log_model(lr, "model")

Train the model with different hyperparameter values

In [None]:
train(alpha=0.2, l1_ratio=0.2)

In [None]:
train(alpha=0.1, l1_ratio=0.7)

In [None]:
train(alpha=0.33, l1_ratio=0.77)

In [None]:
train(alpha=0.5, l1_ratio=0.5)

In [None]:
train(alpha=0.6, l1_ratio=0.8)

In [None]:
train(alpha=0.5, l1_ratio=0.6)

Go to the MLflow UI (http://localhost:80) to see the results of all the runs within the experiment and its corresponding models 

Instead of tracking manually every parameter and metric, we can use the autolog mode to let mlflow do the job for us.

In [None]:
def train_autolog(alpha=0.5, l1_ratio=0.5):
    mlflow.sklearn.autolog()
    mlflow.set_experiment('winequality_elasticnet_autolog')
    with mlflow.start_run():
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
        
        print(mlflow.get_artifact_uri())

        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

In [None]:
train_autolog(alpha=0.2, l1_ratio=0.2)

In [None]:
train_autolog(alpha=0.1, l1_ratio=0.7)

In [None]:
train_autolog(alpha=0.33, l1_ratio=0.77)

In [None]:
train_autolog(alpha=0.5, l1_ratio=0.5)

In [None]:
train_autolog(alpha=0.6, l1_ratio=0.8)

In [None]:
train_autolog(alpha=0.5, l1_ratio=0.6)