# Model tracking example with MLflow using Auto-logging

This notebook contains a toy example to see how MLflow tracking works. In this example, a Linear Regression algorithm is trained, concretely, the Elasticnet algorithm from the scikit-learn library.

The model is trained using the popular [winequality-red dataset](http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/) from the UCI repository.

The first step is to import all the required libraries

In [1]:
# The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn
import boto3

import logging

logging.basicConfig(level=logging.WARN)
logger = logging.getLogger(__name__)

warnings.filterwarnings("ignore")
np.random.seed(40)

Set the tracking URI to let the mlflow client know where the tracking server is running. In this case, the server is running locally. If the tracnking server is runnig in a remote host, its IP must be set.

In [2]:
mlflow.set_tracking_uri("http://localhost:80")

Verify that the MLflow client is pointing to the correct endpoints

In [3]:
print(mlflow.get_tracking_uri())

http://localhost:80


Define a function to evaluate the model

In [4]:
def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2

Download the dataset from the UCI repository

In [5]:
# Read the wine-quality csv file from the URL
csv_url = (
    "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
)
try:
    data = pd.read_csv(csv_url, sep=";")
except Exception as e:
    logger.exception(
        "Unable to download training & test CSV, check your internet connection. Error: %s", e
    )

Explore briefly the dataset

In [6]:
data

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
1,7.8,0.880,0.00,2.6,0.098,25.0,67.0,0.99680,3.20,0.68,9.8,5
2,7.8,0.760,0.04,2.3,0.092,15.0,54.0,0.99700,3.26,0.65,9.8,5
3,11.2,0.280,0.56,1.9,0.075,17.0,60.0,0.99800,3.16,0.58,9.8,6
4,7.4,0.700,0.00,1.9,0.076,11.0,34.0,0.99780,3.51,0.56,9.4,5
...,...,...,...,...,...,...,...,...,...,...,...,...
1594,6.2,0.600,0.08,2.0,0.090,32.0,44.0,0.99490,3.45,0.58,10.5,5
1595,5.9,0.550,0.10,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.510,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5


Split the dataset into train a test datasets. Note the target variable is the column "qulity" (the class to predict), and the rest of the variables refer to features of the model.

In [7]:
# Split the data into training and test sets. (0.75, 0.25) split.
train, test = train_test_split(data)

# The predicted column is "quality" which is a scalar from [3, 9]
train_x = train.drop(["quality"], axis=1)
test_x = test.drop(["quality"], axis=1)
train_y = train[["quality"]]
test_y = test[["quality"]]

print('Train shape:', train_x.shape)
print('Train shape:', test_x.shape)

Train shape: (1199, 11)
Train shape: (400, 11)


In this example, the autolog flag will be used, which means that mlflow will automatically log all the hyperparameters and metrics, as well as the models 

In [8]:
def train(alpha=0.5, l1_ratio=0.5):
    mlflow.sklearn.autolog()
    mlflow.set_experiment('winequality_elasticnet_autolog')
    with mlflow.start_run():
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)
        
        print(mlflow.get_artifact_uri())

        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

Train the model with different hyperparameter values

In [9]:
train(alpha=0.2, l1_ratio=0.2)

INFO: 'winequality_elasticnet_autolog' does not exist. Creating a new experiment
s3://mlflow-bucket/mlflow/1/d4c7e5ac674643e8bd37d69353109ace/artifacts
Elasticnet model (alpha=0.200000, l1_ratio=0.200000):
  RMSE: 0.7336400911821402
  MAE: 0.5643841279275428
  R2: 0.23739466063584158


In [10]:
train(alpha=0.1, l1_ratio=0.7)

s3://mlflow-bucket/mlflow/1/e458c99ffadb4cb9a07b9274c2be7422/artifacts
Elasticnet model (alpha=0.100000, l1_ratio=0.700000):
  RMSE: 0.7327938109945942
  MAE: 0.5640101718105491
  R2: 0.23915303116151632


In [11]:
train(alpha=0.33, l1_ratio=0.77)

s3://mlflow-bucket/mlflow/1/dc299a55a6c74c9085fcdd9ca122b9d1/artifacts
Elasticnet model (alpha=0.330000, l1_ratio=0.770000):
  RMSE: 0.7893929560420991
  MAE: 0.6235776000236435
  R2: 0.11708230094735095


In [12]:
train(alpha=0.5, l1_ratio=0.5)

s3://mlflow-bucket/mlflow/1/1f5ab48a5cea4118ba8538268c5eb283/artifacts
Elasticnet model (alpha=0.500000, l1_ratio=0.500000):
  RMSE: 0.793164022927685
  MAE: 0.6271946374319586
  R2: 0.10862644997792636


In [13]:
train(alpha=0.6, l1_ratio=0.8)

s3://mlflow-bucket/mlflow/1/c823e563d0c84de78b140f5fafc676cf/artifacts
Elasticnet model (alpha=0.600000, l1_ratio=0.800000):
  RMSE: 0.8326325509502465
  MAE: 0.6676500690618903
  R2: 0.01770824285088779


In [14]:
train(alpha=0.5, l1_ratio=0.6)

s3://mlflow-bucket/mlflow/1/49e52e988cd1436eb48d99104ae0d407/artifacts
Elasticnet model (alpha=0.500000, l1_ratio=0.600000):
  RMSE: 0.8097394716266471
  MAE: 0.6442565454817458
  R2: 0.07098152823463377


Go to the MLflow UI (http://localhost:80) to see the results of all the runs within the experiment and its corresponding models 