Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLflow Dataset Tracking #8186

Merged
merged 56 commits into from
Jun 1, 2023
Merged

MLflow Dataset Tracking #8186

merged 56 commits into from
Jun 1, 2023

Conversation

prithvikannan
Copy link
Collaborator

@prithvikannan prithvikannan commented Apr 6, 2023

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

  • Create new client API log_inputs and fluent API log_input
  • Create MLflow data module, with various Datasets and DatasetSources
  • Add dataset auto logging integrations
  • Support MLflow Datasets in evaluate()

How is this patch tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (describe details, including test results, below)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Introduce Dataset Tracking to MLflow! Now users can log datasets as inputs to MLflow runs.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • [] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

dbczumar and others added 11 commits March 17, 2023 19:51
* Source reg

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Rename

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Registry

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Data

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Rename data.py

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dataset sources

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Partial

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Sources

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Source

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* done

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* dbfs data source

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dummy

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Tweaks

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some docs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Working pandas

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Pandas works :D

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Colspec in schema

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Test structure

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Move

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some tests

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Suite

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* fixes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fixes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Many test

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Add files

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Blacken

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* CI

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove todo

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Simplify

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Resource init

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Removals

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Tweak

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove unused

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix tests, rename

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove dataset stuff

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Better docs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Blank init

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Restore file

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Datasets files

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Lint

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Better docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Docstring

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
…ies (#8051)

* Source reg

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Rename

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Registry

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Data

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Rename data.py

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dataset sources

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Partial

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Sources

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Source

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* done

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* dbfs data source

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dummy

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Tweaks

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some docs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Working pandas

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Pandas works :D

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Colspec in schema

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Test structure

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Move

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some tests

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Suite

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* fixes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fixes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Many test

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Add files

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Blacken

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* CI

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove todo

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Simplify

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Resource init

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Removals

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Tweak

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove unused

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix tests, rename

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove dataset stuff

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Better docs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Blank init

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Restore file

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Datasets files

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Lint

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Better docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Docstring

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Register artifact sources

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Get it working

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* artifact DS

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some test coverage

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Test, docstring

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix windows

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Assert on content

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
…or downloads (#8069)

* fix

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove separator

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Lint

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: Ben Wilson <benjamin.wilson@databricks.com>
* Source reg

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Rename

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Registry

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Data

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Rename data.py

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dataset sources

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Partial

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Sources

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Source

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* done

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* dbfs data source

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dummy

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Tweaks

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some docs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Working pandas

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Pandas works :D

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Colspec in schema

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Test structure

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Move

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Some tests

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Suite

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* fixes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fixes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Many test

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Add files

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Blacken

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* CI

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove todo

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Simplify

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Resource init

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Removals

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Tweak

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove unused

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix tests, rename

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove dataset stuff

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Better docs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Blank init

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Restore file

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Datasets files

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Lint

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Better docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Docstring

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* HF source base

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More args

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* More args

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* HF - needs tests

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Updates

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Loosen dict requirements

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix windows

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* to_pyfunc, targets, test

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Docstrings and couple tests

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Mixin

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* hyphen source type

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Digest fixes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Add consistent digest big data test

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Format

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
* add numpy dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test for numpy dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add pandas dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* update

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test for deterministic hash

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add property

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* targets in numpy

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* PyFuncConvertibleDatasetMixin

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* create delta and spark dataset sources

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix delta and spark source

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test from_pandas and from_numpy

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* lint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* delta information

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* from_pandas with delta and spark sources

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* lint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* tablse

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* lint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix delta tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* lint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* _get_table_info_if_uc

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* databricks-uc host creds

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* _is_uc_table

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* split out spark and delta dataset source tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/4602300008

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>

* cleanup

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* addressing comments

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* bump delta core to 2.2.0

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixed all references of spark session

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* remove import

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
Co-authored-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
* Check out data model files except sql

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Working filestore

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Simplify

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Test cases

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix test

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* SQL, REST notimplemennt

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Address comment

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Address comments

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Coverage

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Add internal

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove exp

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Pass

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
* Add log inputs to rest store

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test_rest_store

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/4604613188

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>

* log_inputs api

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* bulk writes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* implement read inputs to run via _get_run_inputs

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* unused import

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test_log_inputs_fails_with_missing_inputs and test_log_inputs_fails_with_too_large_inputs

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* search_runs and test case

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* more tests [wip]

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixed write side

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixing some tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix overwrite issue

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* cleanup

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* teardown

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
Co-authored-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
@mlflow-automation
Copy link
Collaborator

mlflow-automation commented Apr 6, 2023

Documentation preview for 565f299 will be available here when this CircleCI job completes successfully.

More info

@github-actions
Copy link

github-actions bot commented Apr 6, 2023

@prithvikannan Thank you for the contribution! Could you fix the following issue(s)?

⚠ DCO check

The DCO check failed. Please sign off your commit(s) by following the instructions here. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.md#sign-your-work for more details.

prithvikannan and others added 17 commits April 6, 2023 22:30
* Add log inputs to rest store

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test_rest_store

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/4604613188

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>

* log_inputs api

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* bulk writes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* draft for log_inputs fluent api

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fluent log_input api

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test case

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* implement read inputs to run via _get_run_inputs

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* unused import

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test_log_inputs_fails_with_missing_inputs and test_log_inputs_fails_with_too_large_inputs

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* search_runs and test case

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* more tests [wip]

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* pylint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixing up test_log_input

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixed write side

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixing some tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix overwrite issue

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* cleanup

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/4627612637

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>

* teardown

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
Co-authored-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
* Add log inputs to rest store

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test_rest_store

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/4604613188

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>

* log_inputs api

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* bulk writes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* draft for log_inputs fluent api

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fluent log_input api

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test case

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* implement read inputs to run via _get_run_inputs

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* unused import

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test_log_inputs_fails_with_missing_inputs and test_log_inputs_fails_with_too_large_inputs

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* search_runs and test case

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* more tests [wip]

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* pylint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixing up test_log_input

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixed write side

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixing some tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix overwrite issue

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* cleanup

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Autoformat: https://github.com/mlflow/mlflow/actions/runs/4627612637

Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>

* teardown

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* python server log inputs

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* check keys

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* remove run_uuid

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
Co-authored-by: mlflow-automation <mlflow-automation@users.noreply.github.com>
* create a code dataset source

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* mlflow_source_type and mlflow_source_name

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
* add numpy dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test for numpy dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add pandas dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* update

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Partial spark ds, hash is broken

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* test for deterministic hash

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add property

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* targets in numpy

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* PyFuncConvertibleDatasetMixin

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* create delta and spark dataset sources

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix delta and spark source

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* spark

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* spark

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* fixes

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Add

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* test from_pandas and from_numpy

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* lint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* delta information

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix kwagrs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Register

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dedupe

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* remove nl

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Address comments

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* test case for spark dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* approx count

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* skeleton for various from_spark tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* tests for properties

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix _is_delta_table

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test cleanup

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* move pyspark import

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* trying again

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* create spark_delta_utils

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* pyspark import into util fn

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* import utils inside loaders

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* check for pyspark in sys modules

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* move pyspark import to load in spark and delta source

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* lint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
Co-authored-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
* tensorflow dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* more progress

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix schema and profile

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test tensor

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* lint

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* move tf imports

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fix

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test_tensorflow_dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* remove reference to dataframe

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
…nd profile (#8305)

* Infer schem dict

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix profile and schema for np

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
…8315)

* Use dataset as default

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dataset

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
* Patches

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Make dataset sources importable

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Test

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove protocol

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Cherry pick

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix entrypoint loading error and revert dummy dataset

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Comment

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix typo

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* DS

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix attempt

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix silliness

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
* load from source

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Load source

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Experimental

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
* tensorflow dataset targets

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* test coverage

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* cosmetic changes:

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
* starting out

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* draft of branching with mlflow dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add to_evaluation_dataset to pyfunc mixin

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* to_evaluation_dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* log input

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add to_evaluation_dataset to all dataset types

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* remove tensorflow impl

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* use metric prefix as name

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add test without metric_prefix

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* update to_evaluation_dataset and add test cases

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* update docstrings and use client api

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix case with no context

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* make targets optional

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* disable=unused-variable

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix tensorflow targets and expand tests for to_evaluation_dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* only support eval dataset if Tensor

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
…a, improve test coverage (#8304)

* SQL

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix approx count performance

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fixes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: Corey Zumar <39497902+dbczumar@users.noreply.github.com>
* Patches

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Make dataset sources importable

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Test

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Remove protocol

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Cherry pick

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix entrypoint loading error and revert dummy dataset

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Make experimental

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Mark DS and sources experimental

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Experimental entities

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Experimental input tags

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* get run docstrings

Signed-off-by: dbczumar <corey.zumar@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
prithvikannan and others added 9 commits May 24, 2023 21:21
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
* Update schema in sklearn and xgboost tests

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* empty

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
* Update optional targets logic with datasets

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fix

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* targets

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
* Update optional targets logic with datasets

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fix

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* targets

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Add CodeDatasetSource to dataset_source_registry

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: dbczumar <corey.zumar@databricks.com>
* docs progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Initial API docs

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Renaming for tf

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Partial

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* fix

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* data rst

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Install deps

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Fix

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Dataset

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* Progress

Signed-off-by: dbczumar <corey.zumar@databricks.com>

* fix some doc references

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* alias

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fix

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* use numpy with a local array

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* pyspark import

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* remove annotations for spark df

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* remove mlflow.data.DatasetSource and mlflow.data.Dataset

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* add dataset sources

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* double backtick

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* tracking doc

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* small fix

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* fix py class

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* pysaprk

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

---------

Signed-off-by: dbczumar <corey.zumar@databricks.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Co-authored-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Comment on lines +8 to +10
tensorflow
pyspark
datasets
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these for building docs?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need these for the docs build because, otherwise, the mlflow.data.from_tensorflow, etc. methods are not defined

@harupy harupy added the only-latest If applied, only test the latest version of each group in cross-version tests. label May 31, 2023
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Once @harupy 's comments are addressed (or feel free to file a follow-up). Thanks @prithvikannan !

Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
@dbczumar dbczumar merged commit 3a58f74 into master Jun 1, 2023
45 of 47 checks passed
@prithvikannan prithvikannan mentioned this pull request Jun 1, 2023
33 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docs Documentation issues area/examples Example code area/sqlalchemy Use of SQL alchemy in tracking service or model registry area/tracking Tracking service, tracking client APIs, autologging only-latest If applied, only test the latest version of each group in cross-version tests. rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants