# Introduction to MLflow tracking 

MLflow is an open-source platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. In this notebook, we will walk through the process of using MLflow to track machine learning experiments. We will use the Iris dataset and perform dimensionality reduction followed by clustering, logging all relevant information to MLflow.

MLflow's Tracking component allows us to log and query experiments, comparing results, and ensuring reproducibility. It records key information such as:
- **Parameters:** Input values or configurations used in the experiments, like hyperparameters.
- **Metrics:** Performance measurements, such as accuracy or loss, that help evaluate the model.
- **Artifacts:** Output files or models generated during the experiment, like plots, model files, or logs.
- **Tags:** Labels to help organize and filter experiments, making it easier to search and manage runs.ucibility.

In [1]:
import mlflow
import mlflow.sklearn
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap.umap_ as umap
from sklearn.cluster import DBSCAN, OPTICS, Birch
from sklearn.metrics import silhouette_score, davies_bouldin_score
import matplotlib.pyplot as plt

### Setting up the experiment

First, we need to specify the experiment we want to track. An experiment in MLflow is a collection of related runs. We use `mlflow.set_experiment()` to specify the experiment.

In [2]:
# Setting up the experiment
mlflow_experiment = mlflow.set_experiment("Iris Dataset Clustering")

2024/08/25 11:57:17 INFO mlflow.tracking.fluent: Experiment with name 'Iris Dataset Clustering' does not exist. Creating a new experiment.


**Explanation**:
- **`mlflow.set_experiment(experiment_name)`**: This sets the experiment where all subsequent runs will be logged. If the experiment doesn't exist, it will be created. The return value is stored in the `mlflow_experiment` object with details about the experiment.

### Defining and running the experiment
Now let's define the function `run_experiment` that will perform dimensionality reduction and clustering, and log all relevant information to MLflow.

In [3]:
def run_experiment(dr_method, cluster_method, dataset_name, data, labels, n_components=2):
    run_name = f"{dr_method}-{cluster_method}"
    
    # Start the run explicitly
    mlflow.start_run(run_name=run_name)
    
    # Dimensionality reduction
    if dr_method == 'PCA':
        dr_model = PCA(n_components=n_components)
    elif dr_method == 't-SNE':
        dr_model = TSNE(n_components=n_components)
    elif dr_method == 'UMAP':
        dr_model = umap.UMAP(n_components=n_components)
    else:
        raise ValueError("Unsupported dimensionality reduction method")

    reduced_data = dr_model.fit_transform(data)

    # Clustering
    if cluster_method == 'DBSCAN':
        cluster_model = DBSCAN()
    elif cluster_method == 'OPTICS':
        cluster_model = OPTICS()
    elif cluster_method == 'BIRCH':
        cluster_model = Birch()
    else:
        raise ValueError("Unsupported clustering method")

    clusters = cluster_model.fit_predict(reduced_data)

    # Evaluate clustering
    silhouette_avg = silhouette_score(reduced_data, clusters)
    davies_bouldin_avg = davies_bouldin_score(reduced_data, clusters)

    # Log parameters
    params = {
        "dimensionality_reduction": dr_method,
        "clustering_method": cluster_method,
        "n_components": n_components,
        "dataset_name": dataset_name
    }
    mlflow.log_params(params)

    # Log metrics
    metrics = {
        "silhouette_score": silhouette_avg,
        "davies_bouldin_score": davies_bouldin_avg
    }
    mlflow.log_metrics(metrics)

    # Log the model
    mlflow.sklearn.log_model(dr_model, "dimensionality_reduction_model")
    mlflow.sklearn.log_model(cluster_model, "clustering_model")

    # Plot and log the clustering result
    plt.figure(figsize=(10, 6))
    plt.scatter(reduced_data[:, 0], reduced_data[:, 1], c=clusters, cmap='viridis', marker='o', edgecolor='k')
    plt.title(f'{dr_method} + {cluster_method} Clustering')
    plt.xlabel('Component 1')
    plt.ylabel('Component 2')
    #plot_path = f'plots/{dr_method}_{cluster_method}.png'
    #plt.savefig(plot_path)
    mlflow.log_figure(plt.gcf(), f"{dr_method}_{cluster_method}_plot.png")

    plt.close()

    #mlflow.log_artifact(plot_path)
    
    # Set tags
    tags = {
        "project": "Iris Clustering",
        "team": "Data Science",
        "developer": "Israel",
        "dataset": dataset_name,
        "dim_reduction": dr_method,
        "clustering": cluster_method
    }
    mlflow.set_tags(tags)
    
    # End the run explicitly
    mlflow.end_run()

**Explanation**:
- **`mlflow.start_run(run_name=run_name)`**: Starts an MLflow run with a specified name. This is where all tracking information (parameters, metrics, artifacts) will be logged, and it will be associated with the experiment set earlier. Parameters:
  - **run_id (str, optional):** If specified, resumes the run with the given ID instead of creating a new run.
  - **experiment_id (str, optional):** Specifies the experiment under which to create the run. If not specified, the active experiment or the default experiment is used.
  - **run_name (str, optional):** A descriptive name for the run. It can make it easier to identify runs in the UI.
  - **nested (bool, optional):** If `True`, starts a nested run under the current active run.

- **`mlflow.log_params(params)`**: Logs a dictionary of parameters for the current run to MLflow. Parameters are key-value pairs that represent the hyperparameters or other configurations of the experiment.
- **`mlflow.log_metrics(metrics)`**: Logs a dictionary of metrics for the current run to MLflow. Metrics are key-value pairs that represent the performance or outcome of the experiment.
- **`mlflow.set_tags(tags)`**: Sets multiple tags for the current run. Tags are useful for filtering and organizing runs in MLflow.
- **`mlflow.log_artifact()`**: Logs a local file or directory as an artifact for the current run. Artifacts are typically output files such as model files, plots, or other files generated during the run. Parameters:
  - **local_path (str):** The path to the file or directory to log as an artifact.
  - **artifact_path (str, optional):** If provided, the path within the run’s artifact directory to log the artifact to. Defaults to logging at the root level.

- **`mlflow.log_figure(plt.gcf(), "plot_name.png")`**: Logs the current Matplotlib figure directly to MLflow as an artifact. The figure does not need to be saved to a file beforehand.
- **`mlflow.sklearn.log_model(sk_model)`**: Logs a Scikit-learn model as an artifact for the current run. Parameters:
  - **sk_model (object):** The Scikit-learn model to log.
  - **artifact_path (str):** The directory under which to log the model.
  - **serialization_format (str, optional):** The format to use for serializing the model (default is `'cloudpickle'`).
  - **registered_model_name (str, optional):** If provided, this will register the model under the given name in the model registry.

- **`mlflow.end_run()`**: Ends the current active run, ensuring that all the logged data is saved. Parameters:
  - **status (str, optional):** The run status, such as `'FINISHED'`, `'FAILED'`, or `'KILLED'`. Defaults to `'FINISHED'`.

### Running the experiments
With the function defined, we can now run experiments with different combinations of dimensionality reduction and clustering methods.

In [4]:
# Load the Iris dataset
iris = load_iris()
data = iris.data
labels = iris.target
dataset_name = "iris"

# Define your methods
dr_methods = ['PCA', 't-SNE', 'UMAP']
cluster_methods = ['DBSCAN', 'OPTICS', 'BIRCH']

# Run experiments
for dr_method in dr_methods:  # Loop through each dimensionality reduction method
    for cluster_method in cluster_methods:  # Loop through each clustering method
        run_experiment(dr_method, cluster_method, dataset_name, data, labels)



### Access the MLflow UI
After running the code and logging experiments with MLflow, we might want to explore the logged data, parameters, metrics, and artifacts through the MLflow UI. We can start the MLflow UI by running the following command in the terminal:
```bash
mlflow ui
```

If the UI is not showing the experiment, we can start the MLflow UI with an explicit path to the correct `mlruns` directory:

```bash
mlflow ui --backend-store-uri "file:///absolute/path/to/mlruns"
```

Replace `/absolute/path/to/mlruns` with the correct path to the `mlruns` directory.

Once the MLflow server is running, we can access the MLflow UI by opening the web browser and navigating to:
```
http://localhost:5000
```

When we are done using the MLflow UI, we can stop the server by pressing `Ctrl+C` in the terminal.

##### Explore the Experiments
In the MLflow UI, we can do the following:
- **View experiments**: On the main page, we will see a list of all experiments. Click on an experiment to view the associated runs.
- **Inspect runs**: For each run, we can inspect logged parameters, metrics, artifacts, and more.
- **Compare runs**: Select multiple runs to compare them side-by-side, which is useful for evaluating different models and hyperparameters.