# MLFlow Tutorial
Following [this](https://mlflow.org/docs/latest/getting-started/intro-quickstart/notebooks/tracking_quickstart.html).

# Imports

In [1]:
# mlflow
import mlflow
from mlflow.models import infer_signature

# other libs
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# MLFlow Tracking URI

We need to tell MLFlow which localhost port to serve the dashboard from. `GOTCHA` but first we need to actually start the server!

Start the server from command line (first activate the venv if needed):

- ```source .venv/bin/activate```

- `mlflow server --host 127.0.0.1 --port 8080`

The server will live and store its files in the folder in which you triggered the command from, so it's easy to separate experimental results.
It will require the environment to have mlflow installed though (so have to have venv active or install mlflow globally).

In [2]:
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

In [3]:
mlflow.is_tracking_uri_set()

True

# Data, Model & Metric
Not our focus here, so we'll just do everything via sklearn.

In [4]:
# load iris
X, y = datasets.load_iris(return_X_y=True)

# split data without stratification
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# model hyperparams
params = dict(
    solver="lbfgs",
    max_iter=1000,
    multi_class="auto",
    random_state=8888
)

In [6]:
# train logistic regression
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

In [7]:
# predict
y_pred = lr.predict(X_test)

In [8]:
# metrics
acc = accuracy_score(y_test, y_pred)
print(acc)

1.0


# Define MLFlow Experiment
Experiment is a group of runs, where we're testing one idea with multiple sets params.

In [9]:
mlflow.set_experiment("Iris Logistic Regression")

<Experiment: artifact_location='mlflow-artifacts:/336971655772693158', creation_time=1703231076073, experiment_id='336971655772693158', last_update_time=1703231076073, lifecycle_stage='active', name='Iris Logistic Regression', tags={}>

# Track Model, H-params, Metrics
We want to log all those aspects for each run of the experiment, on our server.

Runs can be given names or mlflow will generate them randomly, but the name doesn't matter much as they'll have separate IDs even if they have the same name.

In [13]:
# start the context of an mlflow run
with mlflow.start_run(run_name="zealous-snipe-173"):
    # log hyperparams
    mlflow.log_params(params)

    # log the metric
    mlflow.log_metric("accuracy", acc)

    # tag the run with basic notes
    mlflow.set_tag("Training info", "Basic Logistic Regression for Iris")

    # infer the model signature (?)
    signature = infer_signature(X_train, lr.predict(X_train))

    # log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=lr,
        artifact_path="iris_model",
        signature=signature,
        input_example=X_train,
        registered_model_name="first-test-model"
    )

Registered model 'first-test-model' already exists. Creating a new version of this model...
2023/12/22 09:10:45 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: first-test-model, version 3
Created version '3' of model 'first-test-model'.


The above created an `./mlruns` and `./mlartifacts` folder in this project's directory.

- `./mlruns` contains the .yaml files with information about the run and where the resulting model is stored, as well as a per-run-id (auto-generated) metrics and hyperparams files (pure text).
- `./mlartifacts` is where the actual model is stored, in subfolder with the id of the experiment and then within it a folder with the id of the individual run. Inside that one there's the model (pickled), with input example to run it on and some info about the environment needed to run it.

# Load the saved model as Python Function
Because we trained the model via scikit-learn, we coul use `mlflow.sklearn.load_model()` to get it.

However for more generality we'll want our models as callable functions (which is what we'd want for online serving).

## Reload as sklearn native

In [20]:
# as native sklearn
loaded_model_native = mlflow.sklearn.load_model(model_info.model_uri)

In [21]:
# predict (have to use sklearn's .predict())
loaded_model_native.predict(X_train)

array([0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2,
       1, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 0, 2, 2,
       1, 1, 2, 1, 0, 1, 2, 0, 0, 1, 2, 0, 2, 0, 0, 2, 1, 2, 2, 2, 2, 1,
       0, 0, 2, 2, 0, 0, 0, 1, 2, 0, 2, 2, 0, 1, 1, 2, 1, 2, 0, 2, 1, 2,
       1, 1, 1, 0, 1, 1, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 1, 2, 2, 1, 2,
       1, 1, 2, 2, 0, 1, 2, 0, 1, 2])

## Reload as `pyfunc`

In [23]:
loaded_model_func = mlflow.pyfunc.load_model(model_info.model_uri)
loaded_model_func.predict(X_train)  # we still have to use the .predict() ...

array([0, 0, 1, 0, 0, 2, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 2, 2, 1, 2, 1, 2,
       1, 0, 2, 1, 0, 0, 0, 1, 2, 0, 0, 0, 1, 0, 1, 2, 0, 1, 2, 0, 2, 2,
       1, 1, 2, 1, 0, 1, 2, 0, 0, 1, 2, 0, 2, 0, 0, 2, 1, 2, 2, 2, 2, 1,
       0, 0, 2, 2, 0, 0, 0, 1, 2, 0, 2, 2, 0, 1, 1, 2, 1, 2, 0, 2, 1, 2,
       1, 1, 1, 0, 1, 1, 0, 1, 2, 2, 0, 1, 2, 2, 0, 2, 0, 1, 2, 2, 1, 2,
       1, 1, 2, 2, 0, 1, 2, 0, 1, 2])