# ML lifecycle automation tools

## MLflow

MLflow is an open-source platform developed by Databricks that allows ML practitioners, such as data scientists and ML engineerings, to automate the complete end-to-end machine learning lifecycle using a simple and intuitive code interface. It includes tools for tracking experiments, packaging code into reproducible runs and sharing and deploying models. Its open interface design can work with any language or platform, with clients in Python and Java, and is accessible through a REST API. 

MLflow provides APIs for logging parameters, code versions, metrics, and artifacts when running your machine learning code and for later visualizing the results. It also provides a library of reusable components for machine learning tasks, such as pre-processing data and training models, that can be packaged into a reusable, reproducible run.

In addition, MLflow integrates with a variety of machine learning libraries, including TensorFlow, PyTorch, and sci-kit-learn, allowing you to use the tools you are already familiar with while taking advantage of the advanced tracking and management capabilities of MLflow.

For this post, you will need the following prerequisites:

- The latest version of Docker installed in your machine. In case you don't have the latest version, please follow the instructions at the following URL: https://docs.docker.com/get-docker/.
- Access to a bash terminal (Linux or Windows).
- Access to a browser.
- Python 3.5+ installed.
- PIP installed.

## Why MLflow?

MLflow provides a single platform for everyday practitioners to handle the entire machine learning lifecycle, from iterating on model development to deploying it in a scalable and reliable environment that meets modern software system requirements.It facilitates the colaboration across roles and standarize the set of tools use a high level.

### Getting started with Mlflow

In the following code example, the Tracking API will be introduced. It would help us better understand how the MLflow API works and how it can be used to track our experiment's metrics, parameters, and artifacts during experimentation.


In [1]:
import os
import mlflow
from mlflow import log_metric, log_param, log_artifact
from mlflow.tracking import MlflowClient
from mlflow.entities import ViewType

In [6]:
mlflow.set_tracking_uri("http://0.0.0.0:4000")
# Log a parameter (key-value pair)
log_param("param1", 5)
log_param("param3", 2023)
log_param("param4", 2024)

# Log a metric; metrics can be updated throughout the run
log_metric("foo", 1)
log_metric("foo", 2)
log_metric("mse", 0.5)

# Log an artifact (output file)
with open("output.txt", "w") as f:
    f.write("Hello world from mlflow!")
log_artifact("output.txt")

We can observe that a new folder named `mlruns` was created along with some files.

In [3]:
!cat /etc/*-release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


In [4]:
!tree mlruns -L 4

[01;34mmlruns[0m
├── [01;34m0[0m
│   ├── [01;34m5359bd93287c4650a405b4e1e2e1d927[0m
│   │   ├── [01;34martifacts[0m
│   │   ├── [00mmeta.yaml[0m
│   │   ├── [01;34mmetrics[0m
│   │   │   ├── [00mfoo[0m
│   │   │   └── [00mmse[0m
│   │   ├── [01;34mparams[0m
│   │   │   ├── [00mparam1[0m
│   │   │   └── [00mparam3[0m
│   │   └── [01;34mtags[0m
│   │       ├── [00mmlflow.runName[0m
│   │       ├── [00mmlflow.source.name[0m
│   │       ├── [00mmlflow.source.type[0m
│   │       └── [00mmlflow.user[0m
│   └── [00mmeta.yaml[0m
└── [01;34mmodels[0m

7 directories, 10 files


The mlruns folder is the default root directory for storing experiment runs and artifacts in the MLflow tracking component. The folder structure of the mlruns directory consists of multiple subdirectories, one for each run in your experiment.

Each run subdirectory is named after a unique run ID and contains the following files and directories:

- **meta.yaml:** A YAML file that contains metadata about the run, such as the run's status, start time, end time, and user-defined tags and parameters.
- **params:** A directory that contains YAML files for each parameter used in the run. Each file is named after the parameter name and contains its value.
- **tags:** A directory that contains YAML files for each tag applied to the run. Each file is named after the tag name and contains its value.
- **artifacts:** A directory that contains files or directories generated as artifacts during the run. The contents of the artifacts directory are determined by the user and may include models, data files, plots, or other outputs.

This structure allows for easy navigation of experiment runs, as well as efficient storage and retrieval of the metadata, parameters, tags, and artifacts associated with each run.

## Viewing the Tracking UI

By default, wherever you run your program, the tracking API writes data into files into an mlruns directory. You can then run MLflow’s Tracking UI to visualize your experiments metadata and information.

In [15]:
#!mlflow server -h 0.0.0.0 -p 4000

The experiments' metadata, runs information, parameters, and metrics can also be accessed programmatically by using the MLFlow client interface

In [None]:
client = MlflowClient()

# list of experiments
experiments_list = client.search_experiments() # returns a list of mlflow.entities.Experiment
for experiment in experiments_list:
    print(experiment.name, experiment)

In [6]:
# list of runs
runs_list = client.search_runs("0") # returns a list of mlflow.entities.Experiment
for run in runs_list:
    print(run)

<Run: data=<RunData: metrics={'foo': 2.0, 'mse': 0.5}, params={'param1': '5', 'param3': '2023'}, tags={'mlflow.runName': 'rumbling-goose-465',
 'mlflow.source.name': '/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py',
 'mlflow.source.type': 'LOCAL',
 'mlflow.user': 'jovyan'}>, info=<RunInfo: artifact_uri='mlflow-artifacts:/0/5359bd93287c4650a405b4e1e2e1d927/artifacts', end_time=None, experiment_id='0', lifecycle_stage='active', run_id='5359bd93287c4650a405b4e1e2e1d927', run_name='rumbling-goose-465', run_uuid='5359bd93287c4650a405b4e1e2e1d927', start_time=1676747061660, status='RUNNING', user_id='jovyan'>>


Let's review another example where we explore further the mlflow API.

In [7]:
# create experiment
experiment_name = "Social NLP Experiments"
experiment = mlflow.get_experiment_by_name(experiment_name)

# create the experiment
if experiment:
    experiment_id = experiment.experiment_id
else:
    experiment_id = mlflow.create_experiment("Social NLP Experiments")
    
# ends the currently active run, if any, taking an optional run status.
mlflow.end_run()

run_id: bd3ed7421ee64721b2ede3377bf508e7
lifecycle_stage: active
metrics: {'m': 2.5}
tags: {tags}
run_id: 6ad128918c884ce6b05ec005574375b4
lifecycle_stage: active
metrics: {'m': 1.55}
tags: {tags}


In [None]:

# start a run
# the with block defines the experiment scope
with mlflow.start_run(experiment_id=experiment_id, run_name="run1") as run1:
    mlflow.log_metric("m", 1.55)
    mlflow.set_tag("s.release", "1.1.0-RC")
    mlflow.log_artifact("output.txt")
    
    
# start another run
with mlflow.start_run(experiment_id=experiment_id, run_name="run2") as run2:
    mlflow.log_metric("m", 2.50)
    mlflow.set_tag("s.release", "1.2.0-GA")


In [None]:
# Search all runs under experiment id and order them by
# descending value of the metric 'm'
client = MlflowClient()
runs = client.search_runs(experiment_id, order_by=["metrics.m DESC"])
for r in runs:
    print(f"run_id: {r.info.run_id}")
    print(f"lifecycle_stage: {r.info.lifecycle_stage}")
    print(f"metrics: {r.data.metrics}")
    # Exclude mlflow system tags
    tags = {k: v for k, v in r.data.tags.items() if not k.startswith("mlflow.")}
    print(r"tags: {tags}")


# Delete the first run
client.delete_run(run_id=run1.info.run_id)

# # Search only deleted runs under the experiment id and use a case insensitive pattern
# in the filter_string for the tag.
filter_string = "tags.s.release ILIKE '%rc%'"
runs = client.search_runs(experiment_id, run_view_type=ViewType.DELETED_ONLY,
                            filter_string=filter_string)

In [8]:
def get_best_run(experiment_id, metric):
    # Connect to mlflow using the default connection
    client = MlflowClient()
    
    # Get all the runs for the experiment
    runs = client.search_runs(experiment_id)
    
    # Find the run with the highest accuracy metric
    best_run = None
    best_metric_value = 0
    for run in runs:
        metric_value = run.data.metrics[metric]
        if metric_value > best_metric_value:
            best_metric_value = metric_value
            best_run = run
    # Return the best run
    return best_run

In [43]:
get_best_run(experiment_id, metric="m")

<Run: data=<RunData: metrics={'m': 2.5}, params={}, tags={'mlflow.runName': 'run2',
 'mlflow.source.name': '/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py',
 'mlflow.source.type': 'LOCAL',
 'mlflow.user': 'jovyan',
 's.release': '1.2.0-GA'}>, info=<RunInfo: artifact_uri='file:///home/jovyan/mlruns/850128866986301997/0e90a2c847b940b6b48388a365dd96fb/artifacts', end_time=1676745787687, experiment_id='850128866986301997', lifecycle_stage='active', run_id='0e90a2c847b940b6b48388a365dd96fb', run_name='run2', run_uuid='0e90a2c847b940b6b48388a365dd96fb', start_time=1676745787652, status='FINISHED', user_id='jovyan'>>

In [None]:
with mlflow.start_run():
    for epoch in range(0, 3):
        mlflow.log_metric(key="quality", value=2*epoch, step=epoch)

## The Machine Learning Workflow

Machine learning requires experimenting with a wide range of datasets, data preparation steps, and algorithms to build a model that maximizes some target metric. Once you have built a model, you also need to deploy it to a production system, monitor its performance, and continuously retrain it on new data and compare with alternative models.

Being productive with machine learning can therefore be challenging for several reasons:

- **It’s difficult to keep track of experiments.** When you are just working with files on your laptop, or with an interactive notebook, how do you tell which data, code and parameters went into getting a particular result?
- **It’s difficult to reproduce code.** Even if you have meticulously tracked the code versions and parameters, you need to capture the whole environment (for example, library dependencies) to get the same result again. This is especially challenging if you want another data scientist to use your code, or if you want to run the same code at scale on another platform (for example, in the cloud).
- **There’s no standard way to package and deploy models.** Every data science team comes up with its own approach for each ML library that it uses, and the link between a model and the code and parameters that produced it is often lost.

Moreover, although individual ML libraries provide solutions to some of these problems (for example, model serving), to get the best result you usually want to try multiple ML libraries. MLflow lets you train, reuse, and deploy models with any library and package them into reproducible steps that other data scientists can use as a “black box,” without even having to know which library you are using.

