Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it possible to log a dataset without loading anything #11172

Merged

Conversation

chenmoneygithub
Copy link
Collaborator

@chenmoneygithub chenmoneygithub commented Feb 16, 2024

馃洜 DevTools 馃洜

Open in GitHub Codespaces

Install mlflow from this PR

pip install git+https://github.com/mlflow/mlflow.git@refs/pull/11172/merge

Checkout with GitHub CLI

gh pr checkout 11172

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Make it possible to log a dataset without loading anything. In more details, this is what happens under the hood:

  • Allow constructing a dataset from only dataset source.
  • Log all the metadata when calling log_input.

How is this PR tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests

Does this PR require documentation update?

  • No. You can skip the rest of this section.
  • Yes. I've updated:
    • Examples
    • API references
    • Instructions

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Copy link

github-actions bot commented Feb 16, 2024

Documentation preview for 121756f will be available when this CircleCI job
completes successfully.

More info

@chenmoneygithub chenmoneygithub marked this pull request as draft February 17, 2024 00:42
Signed-off-by: chenmoneygithub <chen.qian@databricks.com>
Signed-off-by: chenmoneygithub <chen.qian@databricks.com>
Signed-off-by: chenmoneygithub <chen.qian@databricks.com>
@chenmoneygithub chenmoneygithub marked this pull request as ready for review February 21, 2024 23:24
@github-actions github-actions bot added the rn/feature Mention under Features in Changelogs. label Feb 23, 2024
base_dict: A string dictionary of base information about the
dataset, including: name, digest, source, and source
type.
def to_dict(self) -> Dict[str, str]:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reason for this signature change:

  • classes that are overridden in subclasses should be public.
  • It feels odd to have to_dict method taking a base_dict as input. We should just put the default logic in the body, and have subclasses call the super class' method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) makes sense to me. i dont feel strongly about (2), but it seems like a resonable change

@chenmoneygithub chenmoneygithub force-pushed the improve-dataset-logging branch 2 times, most recently from eaa5b0a to c0c8ffb Compare February 23, 2024 20:32
@chenmoneygithub chenmoneygithub requested review from prithvikannan and jessechancy and removed request for prithvikannan February 23, 2024 20:53
Copy link
Collaborator

@prithvikannan prithvikannan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Left a few small comments about organization and testing. Thanks @chenmoneygithub

Comment on lines +16 to +18
with suppress(ImportError):
# Suppressing ImportError to pass mlflow-skinny testing.
from mlflow.data import meta_dataset # noqa: F401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what part of meta_dataset is breaking the test?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import numpy as np - which also exists in numpy_dataset.py

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. it looks like we handle these in the dataset registry mlflow/data/dataset_registry.py rather than in the module __init__.py. can we use that approach here for consistency?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually those imports inside mlflow/data/dataset_registry.py should also be written in __init__.py, otherwise it's quite unclear why mlflow.data.numpy_dataset is a valid module without having from mlflow.data import numpy_dataset in the __init__.py. I will open a followup PR to clean them up.

json_str = dataset.to_json()
parsed_json = json.loads(json_str)

assert parsed_json["digest"] is not None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some tests on the digest content itself? its important that different dataset sources will map to different digests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me, adding.

base_dict: A string dictionary of base information about the
dataset, including: name, digest, source, and source
type.
def to_dict(self) -> Dict[str, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) makes sense to me. i dont feel strongly about (2), but it seems like a resonable change

mlflow/data/huggingface_dataset.py Outdated Show resolved Hide resolved
super().__init__(source=source, name=name, digest=digest)

def _compute_digest(self) -> str:
"""Computes a digest for the dataset."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we update this docstring with some information about how this hash works and differs from other dataset hashes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, done!

Comment on lines +78 to +89
config = {
"name": self.name,
"source": self.source.to_json(),
"source_type": self.source._get_source_type(),
"schema": self.schema.to_dict() if self.schema else "",
}
return hashlib.sha256(json.dumps(config).encode("utf-8")).hexdigest()[:8]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency, can we pull this out to a helper fn in mlflow/data/digest_utils.py

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I would prefer inlining the code, since this is not sharing any logic with other functions in mlflow/data/digest_utils.py and pretty short.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have other hashing functions in mlflow/data/digest_utils.py such as compute_tensorflow_dataset_digest that are only used for one dataset type. i think we should pull this out even if its short.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should do the reverse - for the util functions that is specific to a module, they should go to its own module not the util file, I will clean it up in a followup PR.

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>

Add cccccbggllnvhtijuvjdiuhfbljjcftuhjhurnubrbve
Copy link
Collaborator

@prithvikannan prithvikannan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the small reorganizations are addressed. Thanks @chenmoneygithub !

Comment on lines +16 to +18
with suppress(ImportError):
# Suppressing ImportError to pass mlflow-skinny testing.
from mlflow.data import meta_dataset # noqa: F401
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see. it looks like we handle these in the dataset registry mlflow/data/dataset_registry.py rather than in the module __init__.py. can we use that approach here for consistency?

Comment on lines +78 to +89
config = {
"name": self.name,
"source": self.source.to_json(),
"source_type": self.source._get_source_type(),
"schema": self.schema.to_dict() if self.schema else "",
}
return hashlib.sha256(json.dumps(config).encode("utf-8")).hexdigest()[:8]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have other hashing functions in mlflow/data/digest_utils.py such as compute_tensorflow_dataset_digest that are only used for one dataset type. i think we should pull this out even if its short.


assert dataset1.digest != dataset2.digest

source = DeltaDatasetSource("fake/path/to/delta")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we call this delta_source rather than overwriting source? its hard to tell that dataset1 and dataset3 are meant to be different.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, done!

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>
@chenmoneygithub chenmoneygithub merged commit b1e1b47 into mlflow:master Feb 27, 2024
37 checks passed
serena-ruan pushed a commit to serena-ruan/mlflow that referenced this pull request Feb 28, 2024
)

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>
artjen pushed a commit to artjen/mlflow that referenced this pull request Mar 26, 2024
)

Signed-off-by: chenmoneygithub <chen.qian@databricks.com>
Signed-off-by: Arthur Jenoudet <arthur.jenoudet@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants