FastAI reference, for training vision functions, on a simpler dataset MINST.

From version mlflow 3 onwards fastai no longer has autologging

In [None]:
pip install fastai

In [None]:
pip install mlflow --quiet

In [None]:
from fastai.vision.all import *
import os
import mlflow

In [None]:
# default value
LOCAL_REGISTRY = "sqlite:///mlruns.db"

**MLflow parameters**

In [None]:
LOCAL_REGISTRY = "sqlite:///tutorial_mlflow.db"
MODEL_NAME = "TutorialMlFlowModel"

EXPERIMENT_NAME = "FastAi MLFlow Tutorial"

Init the client and local default registry

In [None]:
mlfclient = mlflow.tracking.MlflowClient(tracking_uri=LOCAL_REGISTRY)

In [None]:
mlflow.set_tracking_uri(LOCAL_REGISTRY)

MLFlow terminology:
Experiment vs Run

A run is a single execution of model code.
During an MLflow run, you can log model parameters and results.
An experiment is a collection of related runs.




Nested runs also exists:

Nested runs are typically used to log components or sub-processes within a single, larger run. For example, you might have a top-level run for a complete model training pipeline, and then use nested runs to log steps like data preprocessing, feature engineering, or evaluating on different datasets.


The Default Experiment is always present from the start

While MLflow does provide a default experiment, it primarily serves as a 'catch-all' safety net for runs initiated without a specified active experiment.


In [None]:
print(mlfclient.search_experiments())


[<Experiment: artifact_location='/content/mlruns/0', creation_time=1751404869785, experiment_id='0', last_update_time=1751404869785, lifecycle_stage='active', name='Default', tags={}>]


When Should You Define an Experiment based on documentation:

The guiding principle for creating an experiment is the consistency of the input data. 	If multiple runs use the same input dataset (even if they utilize different portions of it), they logically belong to the same experiment. For other hierarchical categorizations, using tags is advisable.



File directory and .db file is created based on variables then

The mlruns directory primarily contains the artifacts and metadata associated with your MLflow runs. Each subdirectory within mlruns corresponds to an experiment, and within each experiment directory, you'll find subdirectories for individual runs. These run directories contain the logged parameters, metrics, and artifacts (like model files, plots, etc.) for that specific run.



In [None]:
!ls

mlruns	sample_data  tutorial_mlflow.db


In [None]:
experiment_tags = {
    "project_name": "tutorials-to-learn-ml",
}

active_experiment = mlfclient.get_experiment_by_name(EXPERIMENT_NAME)
if active_experiment is None:
    mlfclient.create_experiment(name=EXPERIMENT_NAME, tags=experiment_tags)


active_experiment = mlflow.set_experiment(EXPERIMENT_NAME)
active_experiment_id = active_experiment.experiment_id

Now a new experiment was created with new id

In [None]:
active_experiment_id

'1'

In [None]:
mlfclient.search_experiments(filter_string="tags.`project_name` = 'tutorials-to-learn-ml'")

[<Experiment: artifact_location='/content/mlruns/1', creation_time=1751404878902, experiment_id='1', last_update_time=1751404878902, lifecycle_stage='active', name='FastAi MLFlow Tutorial', tags={'project_name': 'tutorials-to-learn-ml'}>]

In [None]:
print(len(mlfclient.search_experiments()), mlfclient.search_experiments())

2 [<Experiment: artifact_location='/content/mlruns/1', creation_time=1751404878902, experiment_id='1', last_update_time=1751404878902, lifecycle_stage='active', name='FastAi MLFlow Tutorial', tags={'project_name': 'tutorials-to-learn-ml'}>, <Experiment: artifact_location='/content/mlruns/0', creation_time=1751404869785, experiment_id='0', last_update_time=1751404869785, lifecycle_stage='active', name='Default', tags={}>]


Experiment was created in the current directory

Callback to save the metrics during model epochs

https://docs.fast.ai/callback.core.html#attributes-available-to-callbacks

In [None]:
from mlflow import MlflowClient
from typing import List

class MLFlowTracking(Callback):
	"A `LearnerCallback` that tracks the loss and other metrics into MLFlow"

	def __init__(self,
            metric_names:List[str],
            client:MlflowClient,
            run_id:str):
			self.client = client
			self.run_id = run_id
			self.metric_names = metric_names

	def after_epoch(self):
		"Compare the last value to the best up to now"
		for metric_name in self.metric_names:
			m_idx = list(self.recorder.metric_names[1:]).index(metric_name)
			if len(self.recorder.values) > 0:
				val = self.recorder.values[-1][m_idx]
				self.client.log_metric(self.run_id, metric_name, float(val), step=self.learn.epoch)

**Callbacks in fastai can be defined on 2 different places**

- on learner -> core callbacks that are always active during training
- on fit method call -> This allows you to have core callbacks that are always active for your learner and add specific callbacks for particular training phases (like early stopping for fine-tuning).



In [None]:
SPLIT_SEED = 42
VALID_PCT = 0.2
BATCH_SIZE = 64

In [None]:
params_training_all_runs = {'data_split': 'random',
    'split_seed': SPLIT_SEED,
    'split_valid_pct': VALID_PCT,
    'item_tfms': 'Resize(460)',
    'batch_tfms': 'aug_transforms(size=224, min_scale=0.75)',
    'batch_size': BATCH_SIZE,
    'model_name': 'resnet18_pretrained',
}

In [None]:
params_training_all_runs

{'data_split': 'random',
 'split_seed': 42,
 'split_valid_pct': 0.2,
 'item_tfms': 'Resize(460)',
 'batch_tfms': 'aug_transforms(size=224, min_scale=0.75)',
 'batch_size': 64,
 'model_name': 'resnet18_pretrained'}

aug_transforms  is a list, so it needs to be unpacked, with * operator:
aug_transforms(size=224, min_scale=0.75)

In [None]:
path = untar_data(URLs.MNIST)
train_path = path / 'training'

In [None]:
mnist = DataBlock(blocks = (ImageBlock, CategoryBlock),
                 get_items = get_image_files,
                 get_y = parent_label,
                 batch_tfms=[Normalize.from_stats(*imagenet_stats), *aug_transforms(size=224, min_scale=0.75)],
                 item_tfms = Resize(460),
                 splitter = RandomSplitter(valid_pct = VALID_PCT, seed = SPLIT_SEED))

dls = mnist.dataloaders(train_path, bs=BATCH_SIZE)

In [None]:
learn = vision_learner(dls, resnet18, metrics=error_rate)

Example of running to consecutive runs, on the same learner

In [None]:
def log_mlflow_params(mlfclient, run, params):
  for k, v in params.items():
      mlfclient.log_param(run_id=run.info.run_id, key=k, value=v)


To use the MLflow model registry, you need to add your MLflow models to it. This is done through registering a given model via one of the below commands:

- mlflow.'model_flavor'.log_model(registered_model_name='model_name'):
    register the model while logging it to the tracking server.
- mlflow.register_model('model_uri', 'model_name'): register the model after logging it to the tracking server.
  Note that you'll have to log the model before running this command to get a model URI.




you can log a fastai model in MLflow without using the deprecated mlflow.fastai module, by exporting the model and logging the exported file as an artifact.

Logging a model as an artifact using mlflow.log_artifact will save the model file within the run's artifacts, but it will not automatically register it under the "Models" section in the MLflow UI.



In [None]:
def save_fastai_model_as_artifact(mlfclient, run_id,  learner, exported_model_filename, artifact_path = 'fastai_model'):

    learner.export(exported_model_filename)
    mlfclient.log_artifact(run_id,
        local_path=exported_model_filename,
        artifact_path=artifact_path,
    )

    print("artifact_uri saved as model")
    print(f"runs:/{run_id}/{artifact_path}/{exported_model_filename}")

    # clear the exported fastai model
    os.remove(exported_model_filename)

In [None]:
def fastai_model_from_artifact(artifact_uri):
    local_download_path = mlflow.artifacts.download_artifacts(artifact_uri=artifact_uri)
    return load_learner(local_download_path)

In [None]:
with mlflow.start_run(experiment_id=active_experiment_id, run_name='resnet18_prertained_01_final_layers') as run:
    log_mlflow_params(mlfclient, run, params_training_all_runs)

    run_params = {"learning_rate": 0.01, "pct_start": 0.99, "num_epochs": 4}
    log_mlflow_params(mlfclient, run, run_params)

    cb_mlflow = MLFlowTracking(metric_names=['valid_loss', 'train_loss', 'error_rate'], client=mlfclient, run_id=run.info.run_id)

    learn.freeze()
    learn.fit_one_cycle(run_params['num_epochs'], run_params['learning_rate'], pct_start=run_params['pct_start'], cbs=[cb_mlflow])

    save_fastai_model_as_artifact(mlfclient, run.info.run_id, learn, 'fastai_resnet18_01.pkl')

epoch,train_loss,valid_loss,error_rate,time
0,0.190921,0.092403,0.030667,04:47
1,0.183168,0.134663,0.040917,04:36
2,0.16626,0.095721,0.0275,04:37
3,0.118813,0.054045,0.016667,04:36


artifact_uri saved as model
runs:/e850b2bfbf7e4fe1b2039d5028942226/fastai_model/fastai_resnet18_01.pkl


In [None]:
with mlflow.start_run(experiment_id=active_experiment_id, run_name='resnet18_prertained_02_unfreezed layers') as run:
    log_mlflow_params(mlfclient, run, params_training_all_runs)

    run_params = {"learning_rate_min": 0.0001, "learning_rate_max": 0.001, "pct_start": 0.3, "num_epochs": 8, "div": 5.0}
    log_mlflow_params(mlfclient, run, run_params)

    cb_mlflow = MLFlowTracking(metric_names=['valid_loss', 'train_loss', 'error_rate'], client=mlfclient, run_id=run.info.run_id)

    learn.unfreeze()
    learn.fit_one_cycle(run_params['num_epochs'], slice(run_params['learning_rate_min'], run_params['learning_rate_max']),
                        pct_start=run_params['pct_start'], div=run_params['div'],
                        cbs=[cb_mlflow, EarlyStoppingCallback(min_delta=0.001, patience=2)])

    save_fastai_model_as_artifact(mlfclient, run.info.run_id, learn, 'fastai_resnet18_02.pkl')

epoch,train_loss,valid_loss,error_rate,time
0,0.094031,0.061326,0.019667,04:43
1,0.094099,0.061668,0.019167,04:45
2,0.073661,0.049875,0.015333,04:41
3,0.072585,0.043312,0.013167,04:40
4,0.041572,0.028902,0.008917,04:39
5,0.032185,0.027033,0.007833,04:41
6,0.025064,0.022558,0.007167,04:41
7,0.028147,0.023244,0.007583,04:42


artifact_uri saved as model
runs:/b2d08a1fb88547f88b877a8bf7f45f87/fastai_model/fastai_resnet18_02.pkl


In [None]:
# how to get the last run id, when reruning the experiment again or saving the artifacts for the experiment
def get_last_run_id(mlfclient, active_experiment_id):
    runs = mlfclient.search_runs(
        experiment_ids=[active_experiment_id],
        order_by=["start_time DESC"],
        max_results=1,
    )
    return runs[0] if runs else None

In [None]:
run = get_last_run_id(mlfclient, active_experiment_id)

How to save an artifact from learner

In [None]:
loss_plot_path = "loss_plot.png"
learn.recorder.plot_loss(show_epochs=True).figure.savefig(loss_plot_path)

# Log the plot as an artifact to the  MLflow run
mlfclient.log_artifact(run.info.run_id, local_path=loss_plot_path,
        artifact_path='figures')

In [None]:
learner1 = fastai_model_from_artifact('runs:/454be886fd9a4e3dad8ab5ee3c9c1029/fastai_model/fastai_resnet18_01.pkl')

Downloading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

If you only need to load model weights and optimizer state, use the safe `Learner.load` instead.
  warn("load_learner` uses Python's insecure pickle module, which can execute malicious arbitrary code when loading. Only load files you trust.\nIf you only need to load model weights and optimizer state, use the safe `Learner.load` instead.")


When saving fastai model it saves the model architecture, weights, and optimizer state, but it doesn't save the DataLoaders object itself.
You need to recreate dataloader yourself and save it to learner

In [None]:
learner1.dls = dls

We can see in model summary, that learner is still frozen in first layers.

Total non-trainable params: 11,166,912:

In [None]:
learner1.summary()

Sequential (Input shape: 64 x 3 x 224 x 224)
Layer (type)         Output Shape         Param #    Trainable 
                     64 x 64 x 112 x 112 
Conv2d                                    9408       False     
BatchNorm2d                               128        True      
ReLU                                                           
____________________________________________________________________________
                     64 x 64 x 56 x 56   
MaxPool2d                                                      
Conv2d                                    36864      False     
BatchNorm2d                               128        True      
ReLU                                                           
Conv2d                                    36864      False     
BatchNorm2d                               128        True      
Conv2d                                    36864      False     
BatchNorm2d                               128        True      
ReLU                      

**Server where experiments can be seen will run in background**

In [None]:
MLFLOW_PORT = 5000

# run tracking UI in the background
get_ipython().system_raw(f'mlflow ui --backend-store-uri {LOCAL_REGISTRY}  --port {MLFLOW_PORT} &')# run tracking UI in the background

check for already acitve process, if present kill it.

In [None]:
!ps | egrep 'mlflow'

   7276 ?        00:00:01 mlflow


Check that port number is not already open

In [None]:
!netstat -lntu

Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 172.28.0.12:9000        0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:5000          0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.11:34935        0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:35663         0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:43329         0.0.0.0:*               LISTEN     
tcp        0      0 172.28.0.12:6000        0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:3453          0.0.0.0:*               LISTEN     
tcp        0      0 127.0.0.1:50763         0.0.0.0:*               LISTEN     
tcp6       0      0 :::8080                 :::*                    LISTEN     
udp        0      0 127.0.0.11:46033        0.0.0.0:*                          


How to debug a process, you see the actual command that was used to create process

In [None]:
!cat /proc/21962/cmdline

cat: /proc/21962/cmdline: No such file or directory


Solution to view the server on google colab

In [None]:
# create remote tunnel using ngrok.com to allow local port access
# borrowed from https://colab.research.google.com/github/alfozan/MLflow-GBRT-demo/blob/master/MLflow-GBRT-demo.ipynb#scrollTo=4h3bKHMYUIG6
!pip install pyngrok --quiet
from pyngrok import ngrok
from google.colab import userdata

if you get error that session already runs andyou are only limited to 1 session, stop the session here:

https://dashboard.ngrok.com/agents

In [None]:
# Terminate open tunnels if exist
ngrok.kill()

# Setting the authtoken (optional)
# Get your authtoken from https://dashboard.ngrok.com/get-started/your-authtoken
ngrok.set_auth_token(userdata.get('NGROK_AUTH_TOKEN'))

public_url = ngrok.connect(MLFLOW_PORT).public_url
print("MLflow Tracking UI:", public_url)

MLflow Tracking UI: https://1211-34-126-71-140.ngrok-free.app


In [None]:
!ps | egrep 'ngrok'

  10131 ?        00:00:01 ngrok


alternative for colab is to use external tracking server, instead of your own:

https://mlflow.org/docs/latest/ml/getting-started/tracking-server-overview/#method-2-use-free-hosted-tracking-server-databricks-free-trial