Make it possible to log a dataset without loading anything #11172

chenmoneygithub · 2024-02-16T22:14:08Z

🛠 DevTools 🛠

Install mlflow from this PR

pip install git+https://github.com/mlflow/mlflow.git@refs/pull/11172/merge

Checkout with GitHub CLI

gh pr checkout 11172

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Make it possible to log a dataset without loading anything. In more details, this is what happens under the hood:

Allow constructing a dataset from only dataset source.
Log all the metadata when calling log_input.

How is this PR tested?

Existing unit/integration tests
New unit/integration tests
Manual tests

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

github-actions · 2024-02-16T22:14:32Z

Documentation preview for 121756f will be available when this CircleCI job
completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/8056947777.

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

chenmoneygithub · 2024-02-23T19:30:39Z

mlflow/data/dataset.py

-            base_dict: A string dictionary of base information about the
-                dataset, including: name, digest, source, and source
-                type.
+    def to_dict(self) -> Dict[str, str]:


Reason for this signature change:

classes that are overridden in subclasses should be public.

It feels odd to have to_dict method taking a base_dict as input. We should just put the default logic in the body, and have subclasses call the super class' method.

(1) makes sense to me. i dont feel strongly about (2), but it seems like a resonable change

prithvikannan

Overall looks good. Left a few small comments about organization and testing. Thanks @chenmoneygithub

prithvikannan · 2024-02-23T21:30:33Z

mlflow/data/__init__.py

+with suppress(ImportError):
+    # Suppressing ImportError to pass mlflow-skinny testing.
+    from mlflow.data import meta_dataset  # noqa: F401


what part of meta_dataset is breaking the test?

import numpy as np - which also exists in numpy_dataset.py

i see. it looks like we handle these in the dataset registry mlflow/data/dataset_registry.py rather than in the module __init__.py. can we use that approach here for consistency?

Actually those imports inside mlflow/data/dataset_registry.py should also be written in __init__.py, otherwise it's quite unclear why mlflow.data.numpy_dataset is a valid module without having from mlflow.data import numpy_dataset in the __init__.py. I will open a followup PR to clean them up.

prithvikannan · 2024-02-23T21:35:53Z

tests/data/test_meta_dataset.py

+    json_str = dataset.to_json()
+    parsed_json = json.loads(json_str)
+
+    assert parsed_json["digest"] is not None


can we add some tests on the digest content itself? its important that different dataset sources will map to different digests

makes sense to me, adding.

prithvikannan · 2024-02-23T21:36:46Z

mlflow/data/dataset.py

-            base_dict: A string dictionary of base information about the
-                dataset, including: name, digest, source, and source
-                type.
+    def to_dict(self) -> Dict[str, str]:


(1) makes sense to me. i dont feel strongly about (2), but it seems like a resonable change

mlflow/data/huggingface_dataset.py

prithvikannan · 2024-02-23T21:40:41Z

mlflow/data/meta_dataset.py

+        super().__init__(source=source, name=name, digest=digest)
+
+    def _compute_digest(self) -> str:
+        """Computes a digest for the dataset."""


can we update this docstring with some information about how this hash works and differs from other dataset hashes?

good call, done!

prithvikannan · 2024-02-23T21:40:58Z

mlflow/data/meta_dataset.py

+        config = {
+            "name": self.name,
+            "source": self.source.to_json(),
+            "source_type": self.source._get_source_type(),
+            "schema": self.schema.to_dict() if self.schema else "",
+        }
+        return hashlib.sha256(json.dumps(config).encode("utf-8")).hexdigest()[:8]


for consistency, can we pull this out to a helper fn in mlflow/data/digest_utils.py

hmm I would prefer inlining the code, since this is not sharing any logic with other functions in mlflow/data/digest_utils.py and pretty short.

we have other hashing functions in mlflow/data/digest_utils.py such as compute_tensorflow_dataset_digest that are only used for one dataset type. i think we should pull this out even if its short.

we should do the reverse - for the util functions that is specific to a module, they should go to its own module not the util file, I will clean it up in a followup PR.

Signed-off-by: chenmoneygithub <chen.qian@databricks.com> Add cccccbggllnvhtijuvjdiuhfbljjcftuhjhurnubrbve

prithvikannan

LGTM once the small reorganizations are addressed. Thanks @chenmoneygithub !

prithvikannan · 2024-02-26T23:01:54Z

mlflow/data/__init__.py

+with suppress(ImportError):
+    # Suppressing ImportError to pass mlflow-skinny testing.
+    from mlflow.data import meta_dataset  # noqa: F401


i see. it looks like we handle these in the dataset registry mlflow/data/dataset_registry.py rather than in the module __init__.py. can we use that approach here for consistency?

prithvikannan · 2024-02-26T23:03:46Z

mlflow/data/meta_dataset.py

+        config = {
+            "name": self.name,
+            "source": self.source.to_json(),
+            "source_type": self.source._get_source_type(),
+            "schema": self.schema.to_dict() if self.schema else "",
+        }
+        return hashlib.sha256(json.dumps(config).encode("utf-8")).hexdigest()[:8]


we have other hashing functions in mlflow/data/digest_utils.py such as compute_tensorflow_dataset_digest that are only used for one dataset type. i think we should pull this out even if its short.

prithvikannan · 2024-02-26T23:05:09Z

tests/data/test_meta_dataset.py

+
+    assert dataset1.digest != dataset2.digest
+
+    source = DeltaDatasetSource("fake/path/to/delta")


nit: can we call this delta_source rather than overwriting source? its hard to tell that dataset1 and dataset3 are meant to be different.

good call, done!

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

) Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

) Signed-off-by: chenmoneygithub <chen.qian@databricks.com> Signed-off-by: Arthur Jenoudet <arthur.jenoudet@databricks.com>

chenmoneygithub marked this pull request as draft February 17, 2024 00:42

chenmoneygithub added 2 commits February 20, 2024 18:40

make it possible to log a dataset without loading anything

f0c05cd

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

fix broken

377fa77

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

chenmoneygithub force-pushed the improve-dataset-logging branch from 2a8d839 to ca3eaa4 Compare February 21, 2024 23:06

incremental

da6f5c2

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

chenmoneygithub force-pushed the improve-dataset-logging branch from ca3eaa4 to da6f5c2 Compare February 21, 2024 23:10

chenmoneygithub marked this pull request as ready for review February 21, 2024 23:24

chenmoneygithub force-pushed the improve-dataset-logging branch from aaf2f51 to ad1686a Compare February 23, 2024 03:33

github-actions bot added the rn/feature Mention under Features in Changelogs. label Feb 23, 2024

chenmoneygithub commented Feb 23, 2024

View reviewed changes

chenmoneygithub force-pushed the improve-dataset-logging branch 2 times, most recently from eaa5b0a to c0c8ffb Compare February 23, 2024 20:32

chenmoneygithub requested review from prithvikannan and jessechancy and removed request for prithvikannan February 23, 2024 20:53

prithvikannan reviewed Feb 23, 2024

View reviewed changes

chenmoneygithub removed the request for review from jessechancy February 23, 2024 22:04

chenmoneygithub force-pushed the improve-dataset-logging branch from c0c8ffb to fd31fbc Compare February 23, 2024 23:09

fix tests

d32d7d8

Signed-off-by: chenmoneygithub <chen.qian@databricks.com> Add cccccbggllnvhtijuvjdiuhfbljjcftuhjhurnubrbve

chenmoneygithub force-pushed the improve-dataset-logging branch from fd31fbc to d32d7d8 Compare February 23, 2024 23:30

chenmoneygithub requested a review from prithvikannan February 26, 2024 18:36

prithvikannan approved these changes Feb 26, 2024

View reviewed changes

fix comments

121756f

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

chenmoneygithub merged commit b1e1b47 into mlflow:master Feb 27, 2024
37 checks passed

serena-ruan pushed a commit to serena-ruan/mlflow that referenced this pull request Feb 28, 2024

Make it possible to log a dataset without loading anything (mlflow#11172

939dd25

) Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

artjen pushed a commit to artjen/mlflow that referenced this pull request Mar 26, 2024

Make it possible to log a dataset without loading anything (mlflow#11172

29820ee

) Signed-off-by: chenmoneygithub <chen.qian@databricks.com> Signed-off-by: Arthur Jenoudet <arthur.jenoudet@databricks.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make it possible to log a dataset without loading anything #11172

Make it possible to log a dataset without loading anything #11172

chenmoneygithub commented Feb 16, 2024 •

edited

github-actions bot commented Feb 16, 2024 •

edited

chenmoneygithub Feb 23, 2024 •

edited

prithvikannan Feb 23, 2024

prithvikannan left a comment

prithvikannan Feb 23, 2024

chenmoneygithub Feb 23, 2024

prithvikannan Feb 26, 2024

chenmoneygithub Feb 26, 2024

prithvikannan Feb 23, 2024

chenmoneygithub Feb 23, 2024

prithvikannan Feb 23, 2024

prithvikannan Feb 23, 2024

chenmoneygithub Feb 23, 2024

prithvikannan Feb 23, 2024

chenmoneygithub Feb 23, 2024

prithvikannan Feb 26, 2024

chenmoneygithub Feb 26, 2024

prithvikannan left a comment

prithvikannan Feb 26, 2024

prithvikannan Feb 26, 2024

prithvikannan Feb 26, 2024

chenmoneygithub Feb 26, 2024


		assert dataset1.digest != dataset2.digest

		source = DeltaDatasetSource("fake/path/to/delta")

Make it possible to log a dataset without loading anything #11172

Make it possible to log a dataset without loading anything #11172

Conversation

chenmoneygithub commented Feb 16, 2024 • edited

Install mlflow from this PR

Checkout with GitHub CLI

Related Issues/PRs

What changes are proposed in this pull request?

How is this PR tested?

Does this PR require documentation update?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

github-actions bot commented Feb 16, 2024 • edited

chenmoneygithub Feb 23, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prithvikannan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

prithvikannan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chenmoneygithub commented Feb 16, 2024 •

edited

github-actions bot commented Feb 16, 2024 •

edited

chenmoneygithub Feb 23, 2024 •

edited