# Dataset Tracking

The `mlflow.data` module is an integral part of the MLflow ecosystem, designed to enhance your machine learning workflow. This module enables you to record and retrieve dataset information during model training and evaluation, leveraging MLflow’s tracking capabilities.

## Dataset

The Dataset abstraction is a metadata tracking object that holds the information about a given logged dataset.

The information stored within a Dataset object includes features, targets, and predictions, along with metadata like the dataset’s name, digest (hash), schema, and profile. You can log this metadata using the mlflow.log_input() API. The module provides functions to construct mlflow.data.dataset.Dataset objects from various data types.

There are a number of concrete implementations of this abstract class, including:

* mlflow.data.spark_dataset.SparkDataset

* mlflow.data.pandas_dataset.PandasDataset

* mlflow.data.numpy_dataset.NumpyDataset

* mlflow.data.huggingface_dataset.HuggingFaceDataset

In [None]:
import mlflow
from mlflow_for_ml_dev.src.utils.folder_operations import get_project_root

# set mlflow tracking uri
mlflow.set_tracking_uri(uri=(get_project_root() / 'mlruns').as_uri())

## Pandas Dataset

In [2]:
import pandas as pd

dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
    raw_data, source=dataset_source_url, name="wine quality - white", targets="quality"
)

  return _dataset_source_registry.resolve(


In [3]:
dataset.to_dict()



{'name': 'wine quality - white',
 'digest': '2a1e42c4',
 'source': '{"url": "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"}',
 'source_type': 'http',
 'schema': '{"mlflow_colspec": [{"type": "double", "name": "fixed acidity", "required": true}, {"type": "double", "name": "volatile acidity", "required": true}, {"type": "double", "name": "citric acid", "required": true}, {"type": "double", "name": "residual sugar", "required": true}, {"type": "double", "name": "chlorides", "required": true}, {"type": "double", "name": "free sulfur dioxide", "required": true}, {"type": "double", "name": "total sulfur dioxide", "required": true}, {"type": "double", "name": "density", "required": true}, {"type": "double", "name": "pH", "required": true}, {"type": "double", "name": "sulphates", "required": true}, {"type": "double", "name": "alcohol", "required": true}, {"type": "long", "name": "quality", "required": true}]}',
 'profile': '{"num_rows": 4898, "num

In [4]:
with mlflow.start_run(run_name="pandas-dataset") as run:
    mlflow.log_input(dataset=dataset, context="raw_data")

In [5]:
run = mlflow.get_run(run.info.run_id)

In [6]:
dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {dataset_info.name}")
print(f"Dataset digest: {dataset_info.digest}")
print(f"Dataset profile: {dataset_info.profile}")
print(f"Dataset schema: {dataset_info.schema}")

Dataset name: wine quality - white
Dataset digest: 2a1e42c4
Dataset profile: {"num_rows": 4898, "num_elements": 58776}
Dataset schema: {"mlflow_colspec": [{"type": "double", "name": "fixed acidity", "required": true}, {"type": "double", "name": "volatile acidity", "required": true}, {"type": "double", "name": "citric acid", "required": true}, {"type": "double", "name": "residual sugar", "required": true}, {"type": "double", "name": "chlorides", "required": true}, {"type": "double", "name": "free sulfur dioxide", "required": true}, {"type": "double", "name": "total sulfur dioxide", "required": true}, {"type": "double", "name": "density", "required": true}, {"type": "double", "name": "pH", "required": true}, {"type": "double", "name": "sulphates", "required": true}, {"type": "double", "name": "alcohol", "required": true}, {"type": "long", "name": "quality", "required": true}]}


In [7]:
# Load the dataset's source, which downloads the content from the source URL to the local
# filesystem
dataset_source = mlflow.data.get_source(dataset_info)
df_url = dataset_source.load()

In [8]:
df_url

'C:\\Users\\manue\\AppData\\Local\\Temp\\tmp1zp6phnt\\winequality-white.csv'

In [9]:
retrieved_df = pd.read_csv(df_url, delimiter=";")

In [10]:
retrieved_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


## Numpy Dataset

### Basic Example

In [11]:
import numpy as np

x = np.random.uniform(size=[2, 5, 4])
y = np.random.randint(2, size=[2])
np_dataset = mlflow.data.from_numpy(x, targets=y)

In [12]:
np_dataset.features

array([[[0.74909247, 0.25861044, 0.00820082, 0.37053635],
        [0.50643759, 0.95636559, 0.72032466, 0.74361884],
        [0.65365618, 0.2325535 , 0.19440739, 0.87896178],
        [0.78294626, 0.88914739, 0.64461299, 0.81639379],
        [0.34935933, 0.80011475, 0.27178135, 0.40919264]],

       [[0.59434332, 0.62487433, 0.49780964, 0.89125117],
        [0.38849287, 0.99917886, 0.85572353, 0.3784318 ],
        [0.04683443, 0.86641894, 0.20997142, 0.89997806],
        [0.02298351, 0.48882069, 0.16025171, 0.68334543],
        [0.89960598, 0.30288663, 0.79732161, 0.80080448]]])

### Basic Example Dictionary

In [13]:
x = {
    "feature_1": np.random.uniform(size=[2, 5, 4]),
    "feature_2": np.random.uniform(size=[2, 5, 4]),
}
y = np.random.randint(2, size=[2])
np_dataset = mlflow.data.from_numpy(x, targets=y)

In [14]:
np_dataset.features

{'feature_1': array([[[0.36290722, 0.38741   , 0.02211541, 0.46864922],
         [0.9807029 , 0.40795016, 0.41779174, 0.41555896],
         [0.84063979, 0.34150833, 0.58429674, 0.43479671],
         [0.11596398, 0.25141929, 0.28390554, 0.28194285],
         [0.84146728, 0.55096242, 0.37103449, 0.63768243]],
 
        [[0.93745114, 0.92876871, 0.65934828, 0.48539714],
         [0.70965727, 0.85347486, 0.57402007, 0.191737  ],
         [0.33812805, 0.63079853, 0.63539026, 0.09653044],
         [0.14938437, 0.24287166, 0.23203401, 0.29244357],
         [0.71289638, 0.30510724, 0.4318705 , 0.93104434]]]),
 'feature_2': array([[[0.42833608, 0.27700281, 0.38490294, 0.31226781],
         [0.87006845, 0.40175085, 0.04193233, 0.26001864],
         [0.44368361, 0.40548522, 0.82130109, 0.51583947],
         [0.63997436, 0.1657265 , 0.4245044 , 0.89516391],
         [0.61064067, 0.51279222, 0.92094368, 0.19553562]],
 
        [[0.83106127, 0.05746171, 0.37423863, 0.45036939],
         [0.78553223,

In [15]:
# login dataset 
with mlflow.start_run(run_name="numpy-dataset") as run:
    mlflow.log_input(dataset=np_dataset, context="raw_data")

In [16]:
# retrieve the np dataset
run = mlflow.get_run(run.info.run_id)
np_dataset_info = run.inputs.dataset_inputs[0].dataset
print(f"Dataset name: {np_dataset_info.name}")
print(f"Dataset digest: {np_dataset_info.digest}")
print(f"Dataset profile: {np_dataset_info.profile}")
print(f"Dataset schema: {np_dataset_info.schema}")


Dataset name: dataset
Dataset digest: d3a84b69
Dataset profile: {"features_shape": {"feature_1": [2, 5, 4], "feature_2": [2, 5, 4]}, "features_size": {"feature_1": 40, "feature_2": 40}, "features_nbytes": {"feature_1": 320, "feature_2": 320}, "targets_shape": [2], "targets_size": 2, "targets_nbytes": 8}
Dataset schema: {"mlflow_tensorspec": {"features": "[{\"name\": \"feature_1\", \"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"float64\", \"shape\": [-1, 5, 4]}}, {\"name\": \"feature_2\", \"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"float64\", \"shape\": [-1, 5, 4]}}]", "targets": "[{\"type\": \"tensor\", \"tensor-spec\": {\"dtype\": \"int32\", \"shape\": [-1]}}]"}}
