Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DatabricksArtifactRepository to the MLflow client. #2911

Merged
merged 34 commits into from Jun 16, 2020

Conversation

arjundc-db
Copy link
Contributor

What changes are proposed in this pull request?

This PR adds a new artifact repository 'DatabricksArtifactRepository' to the MLflow client along with the required test cases.
(Please fill in changes proposed in this fix)

How is this patch tested?

Unit Tests (added) and manual testing.

Release Notes

Is this a user-facing change?

  • [x ] No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • [ x] area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for
    Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: Local serving, model deployment tools, spark UDFs
  • [ x] area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, JavaScript, plotting
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • [x ] rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

@arjundc-db arjundc-db requested a review from dbczumar June 8, 2020 20:13
@arjundc-db arjundc-db added area/artifacts Artifact stores and artifact logging area/tracking Tracking service, tracking client APIs, autologging labels Jun 8, 2020
@dbczumar
Copy link
Collaborator

dbczumar commented Jun 8, 2020

Note: This PR was filed against the main MLflow fork after undergoing extensive review on @dbczumar 's fork here: dbczumar#5

@@ -0,0 +1,324 @@
# -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arjundc-db Can we add unit tests for log_artifact()/log_artifacts(), download_artifacts(), and list_artifacts()that construct aDatabricksArtifactReporooted at a subdirectory of the run artifact root (e.g.,dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts/my/path) and ensure that these APIs operate on paths relative to this subdirectory? (E.g., list_artifacts('foo')should list artifacts atdbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts/my/path/foorather thandbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts/foo`)

To test this, we should be able to mock out _upload_to_cloud, _call_endpoint, and _download_from_cloud and ensure that they're called with the expected paths / request bodies. Let me know if this makes sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

os.path.join(artifact_path, 'subdir'))]
log_artifact_mock.assert_has_calls(calls)

def test_list_artifacts(self, databricks_artifact_repo):
Copy link
Collaborator

@dbczumar dbczumar Jun 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test that outputs are computed relative to run_relative_artifact_repo_root_path by invoking list on a DatabricksArtifactRepo that is rooted at a subdirectory of the run artifact root?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Great work @arjundc-db !

@codecov-commenter
Copy link

Codecov Report

Merging #2911 into master will decrease coverage by 0.19%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2911      +/-   ##
==========================================
- Coverage   85.23%   85.04%   -0.20%     
==========================================
  Files          20       20              
  Lines        1050     1050              
==========================================
- Hits          895      893       -2     
- Misses        155      157       +2     
Impacted Files Coverage Δ
R/tracking-server.R 96.22% <0.00%> (-3.78%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cf0330b...dfc7f60. Read the comment docs.

@arjundc-db arjundc-db merged commit 74be64c into mlflow:master Jun 16, 2020
@smurching smurching added rn/none List under Small Changes in Changelogs. area/artifacts Artifact stores and artifact logging and removed area/artifacts Artifact stores and artifact logging labels Jun 18, 2020
avflor pushed a commit to avflor/mlflow that referenced this pull request Aug 22, 2020
* Add protos and compilation

* Adding databricks_artifact_repo to store/artifact

* Addressing comments

* Addressing comments and fixing _download_file

* Code Clean-up and lint

* Adding multi-part upload logic and unit tests

* Addressing comments

* Addressing comments, making azure download more memory efficent and other fixes.

* Small fix

* Fixing list_artifacts

* Addressing final comments.

* Making extract_run_id static

* Adding AWS support

* Addressing comments

* Fix - needs docs and tests

* Comment and simplification

* Special case for empty file upload to AWS

* Clean up and added tests for relative path

* Page

* Fix

* Added relative path test cases

* Added test for list_artifacts pagination

* Fixing travis failures

* Fixes

* More fixes

* More fixes

* Clean-up

* Clean-up

Co-authored-by: Corey Zumar <corey.zumar@databricks.com>
Co-authored-by: dbczumar <39497902+dbczumar@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/artifacts Artifact stores and artifact logging area/tracking Tracking service, tracking client APIs, autologging rn/none List under Small Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants