# MLflow Dataset Tracking Tutorial

The `mlflow.data` module is an integral part of the MLflow ecosystem, designed to enhance your machine learning workflow. This module enables you to record and retrieve dataset information during model training and evaluation, leveraging MLflow’s tracking capabilities.



## Dataset

The Dataset abstraction is a metadata tracking object that holds the information about a given logged dataset.

The information stored within a Dataset object includes features, targets, and predictions, along with metadata like the dataset’s name, digest (hash), schema, and profile. You can log this metadata using the mlflow.log_input() API. The module provides functions to construct mlflow.data.dataset.Dataset objects from various data types.

There are a number of concrete implementations of this abstract class, including:

* mlflow.data.spark_dataset.SparkDataset

* mlflow.data.pandas_dataset.PandasDataset

* mlflow.data.numpy_dataset.NumpyDataset

* mlflow.data.huggingface_dataset.HuggingFaceDataset

* mlflow.data.spark_dataset.SparkDataset

* mlflow.data.tensorflow_dataset.TensorFlowDataset

In [None]:
import mlflow
mlflow.login()

In [None]:
name = "/Shared/Experiments/tracking-datasets"
try:
    experiment = mlflow.create_experiment(
        name=name,
        tags={
            "mlflow.note.content": "Tracking dataset tutorial",
            "project_name":"UNKNOWN",
            "topic": "run_management",
        }
    )
except:
    print("Experiment already exists")
    experiment = mlflow.get_experiment_by_name(name)

experiment = mlflow.set_experiment(name)

In [None]:
import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset


dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
    raw_data, source=dataset_source_url, name="wine quality - white", targets="quality"
)

In [None]:
dataset.df

In [None]:
# get the schema
dataset.schema

In [None]:
#get the targets
dataset.targets

In [None]:
# get the source
dataset.source
# dataset.source.url

In [None]:
# get predictions for model evaluation
dataset.predictions

In [None]:
# log dataset
with mlflow.start_run(run_name = "logging_dataset") as run:
    mlflow.log_input(dataset=dataset, context="Testing", tags={"dataset_version": "v1"})

# get the run
logged_run = mlflow.get_run(run.info.run_id)

In [None]:
logged_dataset = logged_run.inputs.dataset_inputs[0].dataset

In [None]:
# View some of the recorded Dataset information
print(f"Dataset name: {logged_dataset.name}")
print(f"Dataset digest: {logged_dataset.digest}")
print(f"Dataset profile: {logged_dataset.profile}")
print(f"Dataset schema: {logged_dataset.schema}")

## Retrieving data

In [None]:
# Loading the dataset's source
dataset_source = mlflow.data.get_source(logged_dataset)

local_dataset = dataset_source.load()

print(f"The local file where the data has been downloaded to: {local_dataset}")

# Load the data again
loaded_data = pd.read_csv(local_dataset, delimiter=";")

In [None]:
loaded_data