# Fit a Gaussian Process and store it in MLFlow

In [1]:
# MLFlow setup

import mlflow

# Manually specify this uri when launching the server in $HOME with mlflow ui --port 8080
# Setting the tracking uri to the one launched from home will prevent a new mlflow instance in each directory
mlflow.set_tracking_uri("http://127.0.0.1:8080")
client = mlflow.MlflowClient()

# Search for the experiment we want to be working in, which will be sandbox as we play around
experiment_list = client.search_experiments(filter_string="name = 'Default'")
experiment_id = experiment_list[0].experiment_id
mlflow.set_experiment(experiment_id=experiment_id)

<Experiment: artifact_location='mlflow-artifacts:/0', creation_time=1708732941708, experiment_id='0', last_update_time=1708732941708, lifecycle_stage='active', name='Default', tags={}>

## Now go ahead and mess around

In [2]:
import numpy as np
from sklearn.datasets import make_friedman2
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel

mlflow.autolog()

2024/02/23 16:08:16 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


In [3]:
X, y = make_friedman2(n_samples=500, noise=0, random_state=0)

In [11]:
kernel = DotProduct() + WhiteKernel()
gpr = GaussianProcessRegressor(
    kernel=kernel, random_state=0, n_restarts_optimizer=10
).fit(X, y)



And with that, we've got an mlflow experiment tracked and set up with very few actual steps required of us! Which is great.

In [12]:
mlflow.sklearn.log_model(
    sk_model=gpr,
    artifact_path="gp-model",
    signature=mlflow.models.infer_signature(X, y),
    registered_model_name="friedman2-gp",
)

Registered model 'friedman2-gp' already exists. Creating a new version of this model...
2024/02/23 16:12:59 INFO mlflow.store.model_registry.abstract_store: Waiting up to 300 seconds for model version to finish creation. Model name: friedman2-gp, version 2
Created version '2' of model 'friedman2-gp'.


<mlflow.models.model.ModelInfo at 0x2803acb90>

In [13]:
model = client.get_registered_model("friedman2-gp")

In [14]:
model.to_proto()

name: "friedman2-gp"
creation_timestamp: 1708733330652
last_updated_timestamp: 1708733579591
latest_versions {
  name: "friedman2-gp"
  version: "2"
  creation_timestamp: 1708733579591
  last_updated_timestamp: 1708733579591
  user_id: ""
  current_stage: "None"
  description: ""
  source: "mlflow-artifacts:/0/11a1f33e8ad549b98ec5edc025b51f6a/artifacts/gp-model"
  run_id: "11a1f33e8ad549b98ec5edc025b51f6a"
  status: READY
  run_link: ""
}

In [8]:
model_uri = f"models:/{model.name}/{model.latest_versions[0].version}"

In [9]:
gp_model = mlflow.sklearn.load_model(model_uri)

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

In [10]:
gp_model.score(X, y)

0.3680293861017293

Okay, now we can go the whole round trip and return the model from mlflow if we wanted to. This is great! Rather than store these things and hope that the log messages are sufficient, we can store all the prototyping models places and track the various information. 

This also has obvious relation to my old physics stuff since all the GP fitting could be tracked and logged more appropriately. 

The piece I obviously haven't practiced here is to use a database as the backend for mlflow, but that requires setting up a database for a fairly naive set of uses. For the analyses I do from now on, I should try to use mlflow to track just to keep a nice record of things.