# Example test notebook

We are going to see an example of a typical notebook a DataScientist would make to train a model with hiper parameter tuning. Instead of comparing metrics by hand and having to store artifacts one by one, getting to chaos sometimes, we are going to use our MLFlow service in order to get the ordered list of experiments and compare visually the models generated by tracking their generated metrics. The DS would also have models ready to download easily from the UI.

### Import required libs, already installed in the Jupyter Container

In [19]:
import os
import pandas as pd
import numpy as np
import mlflow
from mlflow.exceptions import MlflowException

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet

### Configure environment in order to let mlflow communicate with our database and MinIO

In [20]:
os.environ["MLFLOW_TRACKING_URI"] = "postgresql+psycopg2://postgres:postgres@postgres-db/mlflow"
os.environ["MLFLOW_S3_ENDPOINT_URL"] = "http://minio:9000"
os.environ["AWS_ACCESS_KEY_ID"] = "user_minio"
os.environ["AWS_SECRET_ACCESS_KEY"] = "p4ssW0rD"

In [21]:
experiment_name = "demo_experiment"
try:
    mlflow.create_experiment(experiment_name, artifact_location="s3://mlflow")
except MlflowException as e:
    print(e)
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='s3://mlflow', creation_time=1688057726705, experiment_id='3', last_update_time=1688057726705, lifecycle_stage='active', name='demo_experiment', tags={}>

### Training a model, a linear regression to determine wine quality from a public data

In [22]:
def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2


def train(in_alpha, in_l1_ratio):
    np.random.seed(40)

    # Read the wine-quality csv file from the URL
    csv_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
    data = pd.read_csv(csv_url, sep=";")

    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    # Set default values if no alpha is provided
    if float(in_alpha) is None:
        alpha = 0.5
    else:
        alpha = float(in_alpha)

    # Set default values if no l1_ratio is provided
    if float(in_l1_ratio) is None:
        l1_ratio = 0.5
    else:
        l1_ratio = float(in_l1_ratio)

    # Useful for multiple runs
    with mlflow.start_run():
        # Execute ElasticNet
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        # Evaluate Metrics
        predicted_qualities = lr.predict(test_x)
        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        # Print out metrics
        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        # Log parameter, metrics, and model to MLflow
        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)
        mlflow.sklearn.log_model(lr, "model")

### Set tracking uri, needed for storing the models artifacts

In [23]:
mlflow.set_tracking_uri("http://mlflow-server:5001")

### Make the hiper parameter tuning, so then train and generate multiple models, logging parameters into MLFlow

In [24]:
alphas = [0.25, 0.5, 0.75]
l1_ratios = [0.25, 0.5, 0.75]
for alpha in alphas:
    for l1_ratio in l1_ratios:
        train(alpha, l1_ratio)

Elasticnet model (alpha=0.250000, l1_ratio=0.250000):
  RMSE: 0.7380489682487518
  MAE: 0.5690312554727689
  R2: 0.22820122626467798




Elasticnet model (alpha=0.250000, l1_ratio=0.500000):
  RMSE: 0.7489307838571879
  MAE: 0.5806946169417597
  R2: 0.20527460024945365




Elasticnet model (alpha=0.250000, l1_ratio=0.750000):
  RMSE: 0.7662476663327953
  MAE: 0.5985976516559469
  R2: 0.16809820954205723




Elasticnet model (alpha=0.500000, l1_ratio=0.250000):
  RMSE: 0.7596554775612442
  MAE: 0.5913132541174235
  R2: 0.18235068599935988




Elasticnet model (alpha=0.500000, l1_ratio=0.500000):
  RMSE: 0.7931640229276851
  MAE: 0.6271946374319586
  R2: 0.10862644997792614




Elasticnet model (alpha=0.500000, l1_ratio=0.750000):
  RMSE: 0.8318658159940802
  MAE: 0.6651040854928952
  R2: 0.019516509058132292




Elasticnet model (alpha=0.750000, l1_ratio=0.250000):
  RMSE: 0.7837307525653582
  MAE: 0.6165474987409886
  R2: 0.1297029612600864




Elasticnet model (alpha=0.750000, l1_ratio=0.500000):
  RMSE: 0.8318702776765884
  MAE: 0.6651291355677875
  R2: 0.019505991453757976




Elasticnet model (alpha=0.750000, l1_ratio=0.750000):
  RMSE: 0.8331799787336065
  MAE: 0.669234506901795
  R2: 0.016416170929073992




#### Now go to localhost:5001, MLFLow UI to compare metrics

### Train the model using the model registry

In [25]:
def train_with_model_registry(in_alpha, in_l1_ratio):
    np.random.seed(40)

    # Read the wine-quality csv file from the URL
    csv_url = "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
    data = pd.read_csv(csv_url, sep=";")

    # Split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # The predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis=1)
    test_x = test.drop(["quality"], axis=1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    # Set default values if no alpha is provided
    if float(in_alpha) is None:
        alpha = 0.5
    else:
        alpha = float(in_alpha)

    # Set default values if no l1_ratio is provided
    if float(in_l1_ratio) is None:
        l1_ratio = 0.5
    else:
        l1_ratio = float(in_l1_ratio)

    # Useful for multiple runs
    with mlflow.start_run():
        # Execute ElasticNet
        lr = ElasticNet(alpha=alpha, l1_ratio=l1_ratio, random_state=42)
        lr.fit(train_x, train_y)

        # Evaluate Metrics
        predicted_qualities = lr.predict(test_x)
        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        # Print out metrics
        print("Elasticnet model (alpha=%f, l1_ratio=%f):" % (alpha, l1_ratio))
        print("  RMSE: %s" % rmse)
        print("  MAE: %s" % mae)
        print("  R2: %s" % r2)

        # Log parameter, metrics, and model to MLflow
        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)
        mlflow.sklearn.log_model(lr, "model")
        mlflow.sklearn.log_model(
            sk_model=lr,
            artifact_path="model",
            registered_model_name="ElasticnetWineModel",
        )

In [26]:
train_with_model_registry(0.75, 0.75)

Elasticnet model (alpha=0.750000, l1_ratio=0.750000):
  RMSE: 0.8331799787336065
  MAE: 0.669234506901795
  R2: 0.016416170929073992


Registered model 'ElasticnetWineModel' already exists. Creating a new version of this model...
2023/06/29 16:56:18 INFO mlflow.tracking._model_registry.client: Waiting up to 300 seconds for model version to finish creation. Model name: ElasticnetWineModel, version 3
Created version '3' of model 'ElasticnetWineModel'.
