# Dataset Tracking

The `mlflow.data` module is an integral part of the MLflow ecosystem, designed to enhance your machine learning workflow. This module enables you to record and retrieve dataset information during model training and evaluation, leveraging MLflow’s tracking capabilities.

## Dataset

The Dataset abstraction is a metadata tracking object that holds the information about a given logged dataset.

The information stored within a Dataset object includes features, targets, and predictions, along with metadata like the dataset’s name, digest (hash), schema, and profile. You can log this metadata using the mlflow.log_input() API. The module provides functions to construct mlflow.data.dataset.Dataset objects from various data types.

There are a number of concrete implementations of this abstract class, including:

* mlflow.data.spark_dataset.SparkDataset

* mlflow.data.pandas_dataset.PandasDataset

* mlflow.data.numpy_dataset.NumpyDataset

* mlflow.data.huggingface_dataset.HuggingFaceDataset

* mlflow.data.spark_dataset.SparkDataset

In [1]:
from mlflow_for_ml_dev.experiments.exp_utils import get_or_create_experiment
import mlflow 

In [10]:
# Create experiment 

experiment_name = "tracking_datasets"
tags = {"mlflow.note.content": "This experiment is used to track datasets used in the project"}
experiment_id = get_or_create_experiment(
    experiment_name=experiment_name,
    tags=tags
)

Experiment with name tracking_datasets and ID 528950709966603598 created.


## Pandas Dataset

In [None]:
from mlflow.data.pandas_dataset import PandasDataset
import pandas as pd

dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
    raw_data, source=dataset_source_url, name="wine quality - white", targets="quality"
)

  return _dataset_source_registry.resolve(


In [9]:
dataset.to_dict()

{'name': 'wine quality - white',
 'digest': '2a1e42c4',
 'source': '{"url": "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"}',
 'source_type': 'http',
 'schema': '{"mlflow_colspec": [{"type": "double", "name": "fixed acidity", "required": true}, {"type": "double", "name": "volatile acidity", "required": true}, {"type": "double", "name": "citric acid", "required": true}, {"type": "double", "name": "residual sugar", "required": true}, {"type": "double", "name": "chlorides", "required": true}, {"type": "double", "name": "free sulfur dioxide", "required": true}, {"type": "double", "name": "total sulfur dioxide", "required": true}, {"type": "double", "name": "density", "required": true}, {"type": "double", "name": "pH", "required": true}, {"type": "double", "name": "sulphates", "required": true}, {"type": "double", "name": "alcohol", "required": true}, {"type": "long", "name": "quality", "required": true}]}',
 'profile': '{"num_rows": 4898, "num

In [11]:
with mlflow.start_run(run_name="pandas-dataset") as run:
    mlflow.log_input(dataset=dataset, context="raw_data")

In [12]:
run = mlflow.get_run(run.info.run_id)

In [22]:
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")

Dataset name: wine quality - white
Dataset digest: 2a1e42c4
Dataset profile: {"num_rows": 4898, "num_elements": 58776}
Dataset schema: {"mlflow_colspec": [{"type": "double", "name": "fixed acidity", "required": true}, {"type": "double", "name": "volatile acidity", "required": true}, {"type": "double", "name": "citric acid", "required": true}, {"type": "double", "name": "residual sugar", "required": true}, {"type": "double", "name": "chlorides", "required": true}, {"type": "double", "name": "free sulfur dioxide", "required": true}, {"type": "double", "name": "total sulfur dioxide", "required": true}, {"type": "double", "name": "density", "required": true}, {"type": "double", "name": "pH", "required": true}, {"type": "double", "name": "sulphates", "required": true}, {"type": "double", "name": "alcohol", "required": true}, {"type": "long", "name": "quality", "required": true}]}


In [26]:
# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
df_url = dataset_source.load()

In [29]:
retrieved_df = pd.read_csv(df_url, delimiter=";")

In [30]:
retrieved_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Numpy Dataset

### Basic Example

In [None]:
import numpy as np

x = np.random.uniform(size=[2, 5, 4])
y = np.random.randint(2, size=[2])
np_dataset = mlflow.data.from_numpy(x, targets=y)

In [36]:
np_dataset.features

array([[[0.18453265, 0.63159549, 0.04967797, 0.21466798],
        [0.98902142, 0.97320733, 0.7824624 , 0.91210244],
        [0.73143932, 0.55768062, 0.19484916, 0.36995325],
        [0.43176435, 0.85734025, 0.7916474 , 0.52764688],
        [0.25803976, 0.11513334, 0.44953914, 0.56404576]],

       [[0.77407644, 0.55709479, 0.84733835, 0.41060854],
        [0.13880843, 0.30767282, 0.06552621, 0.21235701],
        [0.45623034, 0.48439548, 0.02537879, 0.30928973],
        [0.75213861, 0.6372964 , 0.94310921, 0.87853547],
        [0.27740635, 0.39767504, 0.93397464, 0.59336445]]])

### Basic Example Dictionary

In [42]:
x = {
    "feature_1": np.random.uniform(size=[2, 5, 4]),
    "feature_2": np.random.uniform(size=[2, 5, 4]),
}
y = np.random.randint(2, size=[2])
np_dataset = mlflow.data.from_numpy(x, targets=y)

In [39]:
np_dataset.features

{'feature_1': array([[[4.55826568e-02, 4.78446923e-01, 3.15908997e-01, 3.79682041e-01],
         [4.17614629e-01, 6.07581693e-01, 2.85869982e-01, 5.51701404e-02],
         [1.96942372e-01, 1.30319372e-01, 6.34504643e-01, 8.64324696e-01],
         [6.57352854e-01, 9.13275328e-01, 4.24117694e-01, 1.94300361e-01],
         [5.38021309e-01, 9.12497340e-02, 9.62126486e-01, 5.32253657e-01]],
 
        [[1.71470016e-01, 8.43150765e-01, 8.60694693e-01, 6.38001332e-01],
         [6.14422882e-02, 1.06908744e-04, 2.61745933e-01, 8.20126831e-01],
         [5.59659619e-01, 4.24912754e-01, 2.02055064e-01, 9.44831362e-01],
         [4.91865078e-01, 7.31700156e-01, 3.30916939e-01, 2.79093727e-01],
         [7.75922987e-01, 6.18393515e-01, 1.64727812e-01, 1.02433782e-02]]]),
 'feature_2': array([[[0.07337427, 0.80992585, 0.35398261, 0.96412994],
         [0.83556886, 0.01345509, 0.9248515 , 0.23248917],
         [0.45451284, 0.1213721 , 0.6359282 , 0.06176179],
         [0.38621753, 0.65032151, 0.74505

In [43]:
with mlflow.start_run(run_name="numpy-dataset") as run:
    mlflow.log_input(dataset=np_dataset, context="raw_data")

In [44]:
# retrieve the np dataset
run = mlflow.get_run(run.info.run_id)
np_dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {np_dataset_info.name}")
print(f"Dataset digest: {np_dataset_info.digest}")
print(f"Dataset profile: {np_dataset_info.profile}")
print(f"Dataset schema: {np_dataset_info.schema}")


Dataset name: dataset
Dataset digest: 42c20947
Dataset profile: {"features_shape": {"feature_1": [2, 5, 4], "feature_2": [2, 5, 4]}, "features_size": {"feature_1": 40, "feature_2": 40}, "features_nbytes": {"feature_1": 320, "feature_2": 320}, "targets_shape": [2], "targets_size": 2, "targets_nbytes": 8}
Dataset schema: {"mlflow_tensorspec": {"features": "[{\"name\": \"feature_1\", \"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"float64\", \"shape\": [-1, 5, 4]}}, {\"name\": \"feature_2\", \"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"float64\", \"shape\": [-1, 5, 4]}}]", "targets": "[{\"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"int32\", \"shape\": [-1]}}]"}}
