# Mlflow Datasets & Serving

Exploring Mlflow Datasets & Serving API

| Date | User | Change Type | Remarks |  
| ---- | ---- | ----------- | ------- |
| 13/01/2026   | Martin | Created   | Notebook to explore Mlflow datasets and serving | 
| 14/01/2026   | Martin | Updated   | Explored datasets and serving APIs | 

# Content

* [Datasets](#datasets)
* [Serving](#serving)

In [4]:
import os
import mlflow
from dotenv import dotenv_values
config = dotenv_values("../.env")

os.environ["AWS_ACCESS_KEY_ID"] = config["MLFLOW_USER"]
os.environ["AWS_SECRET_ACCESS_KEY"] = config["MLFLOW_PASSWORD"]
os.environ["MLFLOW_S3_ENDPOINT_URL"] = "http://127.0.0.1:9000"
os.environ["MLFLOW_S3_IGNORE_TLS"] = "true"

mlflow.set_tracking_uri("http://127.0.0.1:5000")

# Datasets

_Adapted from: https://mlflow.org/docs/latest/ml/dataset/_

Features:

- __Data Lineage__: Track the complete journey from raw data sources to model inputs
- __Reproducibility__: Ensure experiments can be reproduced with identical datasets
- __Version Control__: Manage different versions of datasets as they evolve
- __Collaboration__: Share datasets and their metadata across teams
- __Evaluation Integration__: Seamlessly integrate with MLflow's evaluation capabilities
- __Production Monitoring__: Track datasets used in production inference and evaluation

Advanced workflows found under _"Advanced Dataset Management"_ section

In [5]:
import pandas as pd
import polars as pl
import mlflow
import mlflow.data
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

Simple training example

In [6]:
mlflow.set_experiment("mlflow-datasets-api")

<Experiment: artifact_location='s3://mlflow/4', creation_time=1768273092652, experiment_id='4', last_update_time=1768273092652, lifecycle_stage='active', name='mlflow-datasets-api', tags={'mlflow.experimentKind': 'custom_model_development'}>

In [45]:
# Load data
dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Create a Dataset object
dataset = mlflow.data.from_pandas(
  raw_data,
  source=dataset_source_url,
  name="wine-quality-white",
  targets="quality"
)

# Log the dataset to an MLflow run
with mlflow.start_run():
  mlflow.log_input(dataset, context="training", tags={'tag1': 'nice', 'tag2': 'try'})

üèÉ View run sneaky-gnat-637 at: http://127.0.0.1:5000/#/experiments/4/runs/475ead24e5f149c293000aa1dc4dfaca
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/4


  return _dataset_source_registry.resolve(


Using Polars, saving with predictions from a machine learning model

In [34]:
raw_data_pl = pl.from_pandas(raw_data)

# Split data
train = raw_data_pl.sample(fraction=0.8, shuffle=True)
test = raw_data_pl.sample(fraction=0.2, shuffle=True)

y_train = train.select('quality')
X_train = train.drop('quality')

y_test = test.select('quality')
X_test = test.drop('quality')

rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)

preds = rfc.predict(X_test)
pred_proba = rfc.predict_proba(X_test)[:, 1]

eval_data = X_test.clone()
eval_data = eval_data.with_columns(
  quality=y_test.to_numpy().flatten(),
  prediction=preds,
  prediction_proba=pred_proba
)

eval_dataset = mlflow.data.from_polars(
  eval_data,
  source=dataset_source_url,
  name="wine-quality-evaluation",
  targets="quality", # These are column names from the dataset
  predictions="prediction"
)

with mlflow.start_run():
  mlflow.log_input(eval_dataset, context='evaluation')

  mlflow.sklearn.log_model(
    sk_model=rfc,
    name="wine-quality-classifier",
    input_example=X_test[0].to_numpy()
  )

  return fit_method(estimator, *args, **kwargs)
  return _dataset_source_registry.resolve(


Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]



üèÉ View run whimsical-pig-554 at: http://127.0.0.1:5000/#/experiments/4/runs/845d98b9e982460d8f1615530b2b7551
üß™ View experiment at: http://127.0.0.1:5000/#/experiments/4


Metadata is stored in the dataset object

In [35]:
print(f"Dataset name: {dataset.name}")  # Defaults to "dataset" if not specified
print(f"Dataset digest: {dataset.digest}")  # Unique hash identifier (computed automatically)
print(f"Dataset source: {dataset.source}")  # DatasetSource object
print(f"Dataset profile: {dataset.profile}")  # Optional: implementation-specific statistics
print(f"Dataset schema: {dataset.schema}")

Dataset name: wine-quality-white
Dataset digest: 2a1e42c4
Dataset source: <mlflow.data.http_dataset_source.HTTPDatasetSource object at 0x000002044E33B290>
Dataset profile: {'num_rows': 4898, 'num_elements': 58776}
Dataset schema: ['fixed acidity': double (required), 'volatile acidity': double (required), 'citric acid': double (required), 'residual sugar': double (required), 'chlorides': double (required), 'free sulfur dioxide': double (required), 'total sulfur dioxide': double (required), 'density': double (required), 'pH': double (required), 'sulphates': double (required), 'alcohol': double (required), 'quality': long (required)]


Retrieving data from previous run

In [36]:
run_id = "abed61c56e824cfc841dbc1c94137895"

logged_run = mlflow.get_run(run_id)
logged_dataset = logged_run.inputs.dataset_inputs[0].dataset

# Get the data source and reload data
dataset_source = mlflow.data.get_source(logged_dataset)
local_path = dataset_source.load()  # Downloads to local temp file

# Reload the data
reloaded_data = pd.read_csv(local_path, delimiter=";")
print(f"Reloaded {len(reloaded_data)} rows from {local_path}")

Reloaded 4898 rows from C:\Users\User\AppData\Local\Temp\tmpzfezieya\winequality-white.csv


## Dataset versioning

Track datasets as they evolve. Mainly for metadata management

In [None]:
def create_versioned_dataset(data, version, base_name="customer-data"):
  """Create a versioned dataset with metadata."""

  dataset = mlflow.data.from_pandas(
    data,
    source=f"data_pipeline_v{version}",
    name=f"{base_name}-v{version}",
    targets="target",
  )

  with mlflow.start_run(run_name=f"Dataset_Version_{version}"):
    mlflow.log_input(dataset, context="versioning")

    # Log version metadata
    mlflow.log_params(
      {
        "dataset_version": version,
        "data_size": len(data),
        "features_count": len(data.columns) - 1,
        "target_distribution": data["target"].value_counts().to_dict(),
      }
    )

    # Log data quality metrics
    mlflow.log_metrics(
      {
        "missing_values_pct": (data.isnull().sum().sum() / data.size) * 100,
        "duplicate_rows": data.duplicated().sum(),
        "target_balance": data["target"].std(),
      }
    )

  return dataset


# Create multiple versions
v1_dataset = create_versioned_dataset(data_v1, "1.0")
v2_dataset = create_versioned_dataset(data_v2, "2.0")
v3_dataset = create_versioned_dataset(data_v3, "3.0")

## Batch prediction monitoring

Record the results of a model's batch prediction on production data

In [None]:
def monitor_batch_predictions(batch_data, model_version, date):
  """Monitor production batch prediction datasets."""

  # Create dataset for batch predictions
  batch_dataset = mlflow.data.from_pandas(
    batch_data,
    source=f"production_batch_{date}",
    name=f"batch_predictions_{date}",
    targets="true_label" if "true_label" in batch_data.columns else None,
    predictions="prediction" if "prediction" in batch_data.columns else None,
  )

  with mlflow.start_run(run_name=f"Batch_Monitor_{date}"):
    mlflow.log_input(batch_dataset, context="production_batch")

    # Log production metadata
    mlflow.log_params(
      {
        "batch_date": date,
        "model_version": model_version,
        "batch_size": len(batch_data),
        "has_ground_truth": "true_label" in batch_data.columns,
      }
    )

    # Monitor prediction distribution
    if "prediction" in batch_data.columns:
      pred_metrics = {
        "prediction_mean": batch_data["prediction"].mean(),
        "prediction_std": batch_data["prediction"].std(),
        "unique_predictions": batch_data["prediction"].nunique(),
      }
      mlflow.log_metrics(pred_metrics)

    # Evaluate if ground truth is available
    if all(col in batch_data.columns for col in ["prediction", "true_label"]):
      result = mlflow.models.evaluate(data=batch_dataset, model_type="classifier")
      print(f"Batch accuracy: {result.metrics.get('accuracy_score', 'N/A')}")

  return batch_dataset


# Usage
batch_dataset = monitor_batch_predictions(daily_batch_data, "v2.1", "2024-01-15")

## Best practices

1. __Ensure data quality__ - Validate data quality before logging
2. __Consistent naming convention__ - Consistent, descriptive names including version information
3. __Source documentation__ - Always specify meaningful source URLs or identifiers that allow you to trace back to the original data
4. __Context specification__ - Use clear `context`
5. __Metadata logging__ - Include relevant metadata (e.g data collection, preprocessing steps, data characteristics)
6. __Version control__ - Track versions explicitly

---

# Serving

Toolkit to deply models to various targets: local environment, cloud services, Kubernetes

- __Mlflow Model__ - Standard format that packages a ML model with metadata. Created when models are logged
- __Docker Containers__ - Uses Docker containers to package models with dependencies
- __Deployment Target__ - Destination environment for the model

<u>How it works</u>

1. Mlflow packages model and dependencies into virtual env or docker container
2. Launch inference server with REST endpoints (e.g FastAPI)
3. Exposes API based on specifications

Generally, there are 2 modules:

- mlflow models: for local deployment
- mlflow deployments: for custom targets (e.g Sagemaker, Azure, Databricks, Kubernetes)

## Inference details

4 main endpoints:

1. `/invocations` - Inference endpoint that accepts POST requests with input data and returns predictions
2. `/ping` - Used for health checks
3. `/health` - Same as /ping
4. `/version` - Returns the Mlflow version

`/invocations` accepts both CSV and JSON inputs by specifying the Content-Type header as `application/csv` or `application/json`.

In [24]:
raw_data.iloc[0:2].drop('quality', axis=1).to_dict(orient='split')

{'index': [0, 1],
 'columns': ['fixed acidity',
  'volatile acidity',
  'citric acid',
  'residual sugar',
  'chlorides',
  'free sulfur dioxide',
  'total sulfur dioxide',
  'density',
  'pH',
  'sulphates',
  'alcohol'],
 'data': [[7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.001, 3.0, 0.45, 8.8],
  [6.3, 0.3, 0.34, 1.6, 0.049, 14.0, 132.0, 0.994, 3.3, 0.49, 9.5]]}

In [None]:
import json
import requests

payload = json.dumps({
  "dataframe_split": raw_data.iloc[0:2].drop('quality', axis=1).to_dict(orient='split')
})

response = requests.post(
  url=f"http://localhost:5001/invocations",
  data=payload,
  headers={"Content-Type": "application/json"}
)

print(response.json())

{'predictions': [6, 6]}


Example of an OpenAI message post request with params. Params must be defined in the model signature to be used

In [None]:
payload = json.dumps(
  {
    "inputs": {"messages": [{"role": "user", "content": "Tell a joke!"}]},
    "params": {
      "temperature": 0.5,
      "max_tokens": 20,
    },
  }
)
response = requests.post(
  url=f"http://localhost:5678/invocations",
  data=payload,
  headers={"Content-Type": "application/json"},
)
print(response.json())

## Local inference server

In [2]:
run_id = "845d98b9e982460d8f1615530b2b7551"
print(f"mlflow models serve -m runs:/{run_id}/wine-quality-classifier -p 5001 -h 0.0.0.0 --no-conda")

mlflow models serve -m runs:/845d98b9e982460d8f1615530b2b7551/wine-quality-classifier -p 5001 -h 0.0.0.0 --no-conda


Post request for inference

`curl http://127.0.0.1:5001/invocations -H "content-Type:application/json" --data '{"inputs": [[7.0, 0.27, 0.36, 20.7, 0.045, 45.0, 170.0, 1.0010, 3.00, 0.45, 8.8]]}'`

## Build a docker image

<u>Requirements to run from host docker client</u>

- `boto3` must be installed in Python environment
- Following environment variables must be in host machine
  * MLFLOW_TRACKING_URI
  * MLFLOW_S3_ENDPOINT_URL
  * AWS_ACCESS_KEY_ID
  * AWS_SECRET_ACCESS_KEY
- `/etc/host` file must contain `minio` entry pointing to localhost (optional if configured as such in docker container)

NOTE: 8080 is the default port within the container used to host

In [None]:
# Run this on host machine
print(f'mlflow models build-docker --model-uri "runs:/{run_id}/wine-quality-classifier" --name "wine-classifier-image"')

mlflow models build-docker --model-uri "runs:/845d98b9e982460d8f1615530b2b7551/wine-quality-classifier" --name "wine-classifier-image"


To run it with `MLServer` as the serving framework instead of FastAPI, add the `--enable-mlserver` flag

In [15]:
# Then to run the container
print("docker run -p 5002:8080 wine-classifier-image")

docker run -p 5002:8080 wine-classifier-image


Other deployment options can be found in their documentation here:

- [Deploying to Kubernetes](https://mlflow.org/docs/latest/ml/deployment/deploy-model-to-kubernetes/tutorial/)
- [Deploying to Sagemaker](https://mlflow.org/docs/latest/ml/deployment/deploy-model-to-sagemaker/)

In [8]:
%load_ext watermark
%watermark

Last updated: 2026-01-14T07:41:11.399676+08:00

Python implementation: CPython
Python version       : 3.11.9
IPython version      : 9.8.0

Compiler    : MSC v.1938 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 183 Stepping 1, GenuineIntel
CPU cores   : 20
Architecture: 64bit

