# MLflow Dataset Tracking Tutorial

The `mlflow.data` module is an integral part of the MLflow ecosystem, designed to enhance your machine learning workflow. This module enables you to record and retrieve dataset information during model training and evaluation, leveraging MLflow’s tracking capabilities.



## Dataset

The Dataset abstraction is a metadata tracking object that holds the information about a given logged dataset.

The information stored within a Dataset object includes features, targets, and predictions, along with metadata like the dataset’s name, digest (hash), schema, and profile. You can log this metadata using the mlflow.log_input() API. The module provides functions to construct mlflow.data.dataset.Dataset objects from various data types.

There are a number of concrete implementations of this abstract class, including:

* mlflow.data.spark_dataset.SparkDataset

* mlflow.data.pandas_dataset.PandasDataset

* mlflow.data.numpy_dataset.NumpyDataset

* mlflow.data.huggingface_dataset.HuggingFaceDataset

* mlflow.data.spark_dataset.SparkDataset

* mlflow.data.tensorflow_dataset.TensorFlowDataset

In [1]:
import mlflow 
mlflow.login()

2024/06/15 22:27:31 INFO mlflow.utils.credentials: Successfully connected to MLflow hosted tracking server! Host: https://adb-3088650010345545.5.azuredatabricks.net.


In [2]:
name = "/Shared/Experiments/tracking-datasets"
try:
    experiment = mlflow.create_experiment(
        name=name,
        tags={
            "mlflow.note.content": "Tracking dataset tutorial",
            "project_name":"UNKNOWN",
            "topic": "run_management",
        }
    )
except:
    print("Experiment already exists")
    experiment = mlflow.get_experiment_by_name(name)

experiment = mlflow.set_experiment(name)

In [3]:
import mlflow.data
import pandas as pd
from mlflow.data.pandas_dataset import PandasDataset


dataset_source_url = "https://raw.githubusercontent.com/mlflow/mlflow/master/tests/datasets/winequality-white.csv"
raw_data = pd.read_csv(dataset_source_url, delimiter=";")

# Create an instance of a PandasDataset
dataset = mlflow.data.from_pandas(
    raw_data, source=dataset_source_url, name="wine quality - white", targets="quality"
)

  return _dataset_source_registry.resolve(
  string_columns = trimmed_df.columns[(df.applymap(type) == str).all(0)]


In [5]:
dataset.source

<mlflow.data.http_dataset_source.HTTPDatasetSource at 0x2413d630ed0>