Fix log_artifact for large model in HDFS #5812

hitchhicker · 2022-05-03T00:19:44Z

Use HadoopFileSystem upload API can address the problem of log_artifact when the model is larger than 2GB.

What changes are proposed in this pull request?

(Please fill in changes proposed in this fix)

How is this patch tested?

(Details)

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Check the status of the ci/circleci: build_doc check. If it's successful, proceed to the
next step, otherwise fix it.
Click Details on the right to open the job page of CircleCI.
Click the Artifacts tab.
Click docs/build/html/index.html.
Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

github-actions · 2022-05-03T00:19:58Z

@hitchhicker Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/6266347573. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work for more details.

dbczumar · 2022-05-03T17:54:59Z

mlflow/store/artifact/hdfs_artifact_repo.py

@@ -33,8 +33,7 @@ def log_artifact(self, local_file, artifact_path=None):
        with hdfs_system(scheme=self.scheme, host=self.host, port=self.port) as hdfs:
            _, file_name = os.path.split(local_file)
            destination = posixpath.join(hdfs_base_path, file_name)
-            with hdfs.open(destination, "wb") as output:
-                output.write(open(local_file, "rb").read())
+            hdfs.upload(destination, open(local_file, "rb"))


Nit: Can we use open() as a context manager to make sure that the file is closed after read? Can we also make that change to the other line below?

Suggested change

hdfs.upload(destination, open(local_file, "rb"))

with open(local_file, "rb") as f:

hdfs.upload(destination, f)

Thanks for you review. Very good advice ! I just added it.

I think the last failure might be due to the fact that it is not closed correctly. I don't have such failure when I run the unit tests on Linux though.

Awesome! Looks like that addressed it :)

dbczumar

@hitchhicker Thanks so much for filing this PR! It looks great! Just left a very tiny suggestion & am happy to merge once it's addressed. I've confirmed that the HDFS upload API is available in older pyarrow versions (e.g. 1.0 - https://arrow.apache.org/docs/1.0/python/generated/pyarrow.HadoopFileSystem.upload.html), so it should be safe to make this change.

…reater than 2GB (mlflow#4025) Signed-off-by: Bokai YU <b.yu@criteo.com>

Signed-off-by: Bokai YU <b.yu@criteo.com>

github-actions bot added area/artifacts Artifact stores and artifact logging area/model-registry Model registry, model registry APIs, and the fluent client calls for model registry rn/bug-fix Mention under Bug Fixes in Changelogs. labels May 3, 2022

hitchhicker force-pushed the master branch from 0ee9d59 to b2f4a9f Compare May 3, 2022 00:22

dbczumar reviewed May 3, 2022

View reviewed changes

dbczumar approved these changes May 3, 2022

View reviewed changes

Bokai YU added 2 commits May 3, 2022 20:14

Fix log_artifact when artifact storage is HDFS and artifact size is g…

5c72f81

…reater than 2GB (mlflow#4025) Signed-off-by: Bokai YU <b.yu@criteo.com>

Use context manager to open HDFS file handler

1bb811a

Signed-off-by: Bokai YU <b.yu@criteo.com>

hitchhicker force-pushed the master branch from 90f9eaa to 1bb811a Compare May 3, 2022 20:14

dbczumar merged commit 9806bc0 into mlflow:master May 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix log_artifact for large model in HDFS #5812

Fix log_artifact for large model in HDFS #5812

hitchhicker commented May 3, 2022

github-actions bot commented May 3, 2022

dbczumar May 3, 2022 •

edited

Loading

hitchhicker May 3, 2022

dbczumar May 4, 2022

dbczumar left a comment

	hdfs.upload(destination, open(local_file, "rb"))
	with open(local_file, "rb") as f:
	hdfs.upload(destination, f)

Fix log_artifact for large model in HDFS #5812

Fix log_artifact for large model in HDFS #5812

Conversation

hitchhicker commented May 3, 2022

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

github-actions bot commented May 3, 2022

dbczumar May 3, 2022 • edited Loading

Choose a reason for hiding this comment

hitchhicker May 3, 2022

Choose a reason for hiding this comment

dbczumar May 4, 2022

Choose a reason for hiding this comment

dbczumar left a comment

Choose a reason for hiding this comment

dbczumar May 3, 2022 •

edited

Loading