# ML lifecycle automation tools

# MLflow

MLflow is an open-source platform that allows users to automate the complete machine learning lifecycle. It includes tools for tracking experiments, packaging code into reproducible runs and sharing and deploying models. It is based on an open interface design, can work with any language or platform, with clients in Python and Java, and is accessible through a REST API. 

MLflow provides APIs for logging parameters, code versions, metrics, and artifacts when running your machine learning code and for later visualizing the results. It also provides a library of reusable components for machine learning tasks, such as pre-processing data and training models, that can be packaged into a reusable, reproducible run.

In addition, MLflow integrates with a variety of machine learning libraries, including TensorFlow, PyTorch, and sci-kit-learn, allowing you to use the tools you are already familiar with while taking advantage of the advanced tracking and management capabilities of MLflow.

For this chapter, you will need the following prerequisites:

- The latest version of Docker installed in your machine. In case you don't have the latest version, please follow the instructions at the following URL: https://docs.docker.com/get-docker/.
- Access to a bash terminal (Linux or Windows).
- Access to a browser.
- Python 3.5+ installed.
- PIP installed.

# Why MLflow?

MLflow provides a single platform for everyday practitioners to handle the entire machine learning lifecycle, from iterating on model development to deploying it in a scalable and reliable environment that meets modern software system requirements.It facilitates the colaboration across roles and standarize the set of tools use a high level.


# Getting started with Mlflow

# Low Level API

### Using the Tracking API

The MLflow Tracking API lets you log metrics and artifacts (files) from your data science code and see a history of your runs. You can try it out by writing a simple Python script as follows (this example is also included in example/quickstart/test.py):

In [1]:
import os
from mlflow import log_metric, log_param, log_artifact

# Log a parameter (key-value pair)
log_param("param1", 5)

# Log a metric; metrics can be updated throughout the run
log_metric("foo", 1)
log_metric("foo", 2)
log_metric("foo", 3)

# Log an artifact (output file)
with open("output.txt", "w") as f:
    f.write("Hello world!")
log_artifact("output.txt")

We can observe that a new folder named `mlruns` has been created:

In [2]:
!cat /etc/*-release

DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
PRETTY_NAME="Ubuntu 22.04.1 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.1 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy


In [3]:
!tree mlruns -L 4

[01;34mmlruns[0m
└── [01;34m0[0m
    ├── [01;34m2ea55bd70a9d406991c47c77a2c4b44f[0m
    │   ├── [01;34martifacts[0m
    │   │   └── [00moutput.txt[0m
    │   ├── [00mmeta.yaml[0m
    │   ├── [01;34mmetrics[0m
    │   │   └── [00mfoo[0m
    │   ├── [01;34mparams[0m
    │   │   └── [00mparam1[0m
    │   └── [01;34mtags[0m
    │       ├── [00mmlflow.runName[0m
    │       ├── [00mmlflow.source.name[0m
    │       ├── [00mmlflow.source.type[0m
    │       └── [00mmlflow.user[0m
    └── [00mmeta.yaml[0m

6 directories, 9 files


The mlruns folder is the default root directory for storing experiment runs and artifacts in the MLflow tracking component. The folder structure of the mlruns directory consists of multiple subdirectories, one for each run in your experiment.

Each run subdirectory is named after a unique run ID and contains the following files and directories:

- **meta.yaml:** A YAML file that contains metadata about the run, such as the run's status, start time, end time, and user-defined tags and parameters.
- **params:** A directory that contains YAML files for each parameter used in the run. Each file is named after the parameter name and contains its value.
- **tags:** A directory that contains YAML files for each tag applied to the run. Each file is named after the tag name and contains its value.
- **artifacts:** A directory that contains files or directories generated as artifacts during the run. The contents of the artifacts directory are determined by the user and may include models, data files, plots, or other outputs.

This structure allows for easy navigation of experiment runs, as well as efficient storage and retrieval of the metadata, parameters, tags, and artifacts associated with each run.

## Viewing the Tracking UI

By default, wherever you run your program, the tracking API writes data into files into an mlruns directory. You can then run MLflow’s Tracking UI:

In [None]:
!mlflow server -h 0.0.0.0 -p 4000

## Using the Client API

In [5]:
from  mlflow.tracking import MlflowClient

# we can access the experiments that has been created
client = MlflowClient()
experiments_list = client.search_experiments() # returns a list of mlflow.entities.Experiment
for experiment in experiments_list:
    print(experiment.name, experiment)

Default <Experiment: artifact_location='file:///home/jovyan/mlruns/0', creation_time=1675919900130, experiment_id='0', last_update_time=1675919900130, lifecycle_stage='active', name='Default', tags={}>


In [6]:
runs_list = client.search_runs("Default") # returns a list of mlflow.entities.Experiment
for run in runs_list:
    print(run)

## Running MLflow Projects

MLflow allows you to package code and its dependencies as a project that can be run in a reproducible fashion on other data. Each project includes its code and a MLproject file that defines its dependencies (for example, Python environment) as well as what commands can be run into the project and what arguments they take.

You can easily run existing projects with the mlflow run command, which runs a project from either a local directory or a GitHub URI:

In [18]:
!mlflow run https://github.com/mlflow/mlflow-example.git -P alpha=5.0

2023/02/09 00:32:07 INFO mlflow.projects.utils: === Fetching project from https://github.com/mlflow/mlflow-example.git into /tmp/tmp4z4_ap6e ===
2023/02/09 00:32:09 INFO mlflow.projects.utils: Fetched 'master' branch
2023/02/09 00:32:10 INFO mlflow.utils.conda: === Creating conda environment mlflow-2b6b69a3ea30872171ff71aee564367746fae613 ===
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... ^C


## Advance example

In [None]:
import mlflow
from mlflow import MlflowClient
from mlflow.entities import ViewType

experiment_name = "Social NLP Experiments"
try:
    experiment = mlflow.get_experiment_by_name(experiment_name)
    experiment_id = experiment.experiment_id
except mlflow.exceptions.MlflowException as e:
    experiment_id = mlflow.create_experiment("Social NLP Experiments")
    
mlflow.end_run()

# start a run
with mlflow.start_run(experiment_id=experiment_id, run_name="run1") as run1:
    mlflow.log_metric("m", 1.55)
    mlflow.set_tag("s.release", "1.1.0-RC")
    
    
# start run 2
with mlflow.start_run(experiment_id=experiment_id, run_name="run2") as run2:
    mlflow.log_metric("m", 2.50)
    mlflow.set_tag("s.release", "1.2.0-GA")
    
# Search all runs under experiment id and order them by
# descending value of the metric 'm'
client = MlflowClient()
runs = client.search_runs(experiment_id, order_by=["metrics.m DESC"])
for r in runs:
    print("run_id: {}".format(r.info.run_id))
    print("lifecycle_stage: {}".format(r.info.lifecycle_stage))
    print("metrics: {}".format(r.data.metrics))

    # Exclude mlflow system tags
    tags = {k: v for k, v in r.data.tags.items() if not k.startswith("mlflow.")}
    print("tags: {}".format(tags))


# Delete the first run
client.delete_run(run_id=run1.info.run_id)

# Search only deleted runs under the experiment id and use a case insensitive pattern
# in the filter_string for the tag.
filter_string = "tags.s.release ILIKE '%rc%'"
runs = client.search_runs(experiment_id, run_view_type=ViewType.DELETED_ONLY,
                            filter_string=filter_string)


In [52]:
def get_best_run(experiment_id, metric):
    # Connect to mflow using the default connection
    client = MlflowClient()
    
    # Get all the runs for the experiment
    runs = client.search_runs(experiment_id)
    
    # Find the run with the highest accuracy metric
    best_run = None
    best_metric_value = 0
    for run in runs:
        metric_value = run.data.metrics[metric]
        if metric_value > best_metric_value:
            best_metric_value = metric_value
            best_run = run
    # Return the best run
    return best_run

In [53]:
get_best_run(experiment_id, metric="m")

<Run: data=<RunData: metrics={'m': 2.5}, params={}, tags={'mlflow.runName': 'run2',
 'mlflow.source.name': '/opt/conda/lib/python3.10/site-packages/ipykernel_launcher.py',
 'mlflow.source.type': 'LOCAL',
 'mlflow.user': 'jovyan',
 's.release': '1.2.0-GA'}>, info=<RunInfo: artifact_uri='file:///home/jovyan/mlruns/851752001971950812/862ae94ace5b41e1a760079150fd2179/artifacts', end_time=1675906147567, experiment_id='851752001971950812', lifecycle_stage='active', run_id='862ae94ace5b41e1a760079150fd2179', run_name='run2', run_uuid='862ae94ace5b41e1a760079150fd2179', start_time=1675906147524, status='FINISHED', user_id='jovyan'>>