Fix batch scoring pipeline artifact writing on Databricks #6766

jerrylian-db · 2022-09-12T18:23:28Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

This change fixes the artifact writing and reading logic for the batch scoring predict step.

How is this patch tested?

I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

I cloned the example MLP regression template repo. Updated the requirements.txt file to point to my fix branch. Ran the entire Databricks notebook in the example. Then, I added notebook cells and ran the ingest_scoring and predict steps. Both of which succeeded. Finally, I loaded the scored dataset artifact using the get_artifact API and it succeeded.

See screenshots:

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Click the Details link on the Preview docs check.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

jerrylian-db · 2022-09-12T18:26:27Z

mlflow/pipelines/steps/predict.py

+                ".mlflow",
+                _get_execution_directory_basename(self.pipeline_root),
+                _SCORED_OUTPUT_FOLDER_NAME,
+            )


In Databricks, it seems like the spark writes parquet files as a folder. This makes it difficult for the get_artifact logic to work because when the artifact path has a .parquet file extension, it reads the folder as if it were a file. Thus, I've updated the scored_data artifact writing and reading logic to an artifact path that does not have a .parquet file extension.

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

jinzhang21 · 2022-09-13T00:43:49Z

mlflow/pipelines/steps/predict.py

+        if databricks_utils.is_in_databricks_runtime():
+            dbfs_path = os.path.join(
+                ".mlflow",
+                _get_execution_directory_basename(self.pipeline_root),
+                _SCORED_OUTPUT_FOLDER_NAME,
+            )
+            shutil.rmtree("/dbfs/" + dbfs_path, ignore_errors=True)
+            scored_sdf.coalesce(1).write.format("parquet").save(dbfs_path)
+            _logger.info("Moving artifact from DBFS to driver disk")
+            shutil.copytree(
+                "/dbfs/" + dbfs_path, os.path.join(output_directory, _SCORED_OUTPUT_FOLDER_NAME)
+            )
+            shutil.rmtree("/dbfs/" + dbfs_path)
+        else:
+            scored_sdf.coalesce(1).write.format("parquet").save(
+                os.path.join(output_directory, _SCORED_OUTPUT_FILE_NAME)
+            )


Can we abstract this logic into a util function e.g. in https://github.com/mlflow/mlflow/blob/master/mlflow/utils/file_utils.py? I can easily see it's used in other places in the future.

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

sunishsheth2009

Looks good to me :)

jinzhang21

Thanks for the quick fix, @jerrylian-db !

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

* wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> * wip Signed-off-by: Jerry Liang <jerry.liang@databricks.com> Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

jerrylian-db added 5 commits September 9, 2022 15:20

wip

414156c

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

wip

0513fe4

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

wip

efb0ad8

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

wip

206cb1a

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

wip

496215e

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

jerrylian-db self-assigned this Sep 12, 2022

jerrylian-db requested a review from jinzhang21 September 12, 2022 18:24

jerrylian-db commented Sep 12, 2022

View reviewed changes

github-actions bot added the rn/none List under Small Changes in Changelogs. label Sep 12, 2022

wip

5e509e5

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

jinzhang21 reviewed Sep 13, 2022

View reviewed changes

jerrylian-db added 3 commits September 13, 2022 14:22

wip

69d19f8

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

wip

83645e0

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

wip

14a2fc9

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

sunishsheth2009 approved these changes Sep 13, 2022

View reviewed changes

jinzhang21 approved these changes Sep 14, 2022

View reviewed changes

jerrylian-db added 3 commits September 14, 2022 09:15

wip

587dead

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

wip

29d75eb

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

wip

4b1834f

Signed-off-by: Jerry Liang <jerry.liang@databricks.com>

jerrylian-db merged commit 2b1d6be into master Sep 14, 2022

jerrylian-db deleted the fix_batch branch September 22, 2022 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix batch scoring pipeline artifact writing on Databricks #6766

Fix batch scoring pipeline artifact writing on Databricks #6766

jerrylian-db commented Sep 12, 2022 •

edited

jerrylian-db Sep 12, 2022 •

edited

jinzhang21 Sep 13, 2022

sunishsheth2009 left a comment

jinzhang21 left a comment

Fix batch scoring pipeline artifact writing on Databricks #6766

Fix batch scoring pipeline artifact writing on Databricks #6766

Conversation

jerrylian-db commented Sep 12, 2022 • edited

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

jerrylian-db Sep 12, 2022 • edited

Choose a reason for hiding this comment

jinzhang21 Sep 13, 2022

Choose a reason for hiding this comment

sunishsheth2009 left a comment

Choose a reason for hiding this comment

jinzhang21 left a comment

Choose a reason for hiding this comment

jerrylian-db commented Sep 12, 2022 •

edited

jerrylian-db Sep 12, 2022 •

edited