Use subprocess for multi-part download execution #8352

dbczumar · 2023-04-28T05:11:01Z

What changes are proposed in this pull request?

Use subprocess for multi-part download execution.

Benchmarking test:

On DBR 12.2 LTS ML, 32GB 4 cores:
Decreases 1GB file download time from 18.x seconds to 3.x seconds.
Decreases 2GB file download time from 32.x seconds to 6.x seconds.

How is this patch tested?

Existing unit/integration tests
New unit/integration tests
Manual tests (describe details, including test results, below)

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar · 2023-04-28T05:11:48Z

mlflow/store/artifact/databricks_artifact_repo.py

@@ -445,7 +450,7 @@ def _parallelized_download_from_cloud(

            if failed_downloads:
                new_cloud_creds = self._get_read_credential_infos(
-                    self.run_id, dst_run_relative_artifact_path
+                    self.run_id, [dst_run_relative_artifact_path]


This was broken before. _get_read_credential_infos expects a list of file paths. by passing a single file path as a string, the path was broken into individual characters, resulting in incorrect download retry logic that attempted to fetch nonexistent files.

We need test coverage here.

dbczumar · 2023-04-28T05:12:32Z

mlflow/store/artifact/databricks_artifact_repo.py

@@ -425,17 +425,22 @@ def _upload_to_cloud(
    def _parallelized_download_from_cloud(
        self, cloud_credential_info, file_size, dst_local_file_path, dst_run_relative_artifact_path
    ):
+        from mlflow.projects.utils import get_databricks_env_vars


Importing from projects feels odd. Perhaps move this util?

dbczumar · 2023-04-28T05:15:35Z

mlflow/utils/download_cloud_file_chunk.py

+@lru_cache(maxsize=64)
+def _cached_get_request_session(
+    max_retries,
+    backoff_factor,
+    retry_codes,
+    # To create a new Session object for each process, we use the process id as the cache key.
+    # This is to avoid sharing the same Session object across processes, which can lead to issues
+    # such as https://stackoverflow.com/q/3724900.
+    _pid,
+):
+    """
+    This function should not be called directly. Instead, use `_get_request_session` below.
+    """
+    assert 0 <= max_retries < 10
+    assert 0 <= backoff_factor < 120
+
+    retry_kwargs = {
+        "total": max_retries,
+        "connect": max_retries,
+        "read": max_retries,
+        "redirect": max_retries,
+        "status": max_retries,
+        "status_forcelist": retry_codes,
+        "backoff_factor": backoff_factor,
+    }
+    if Version(urllib3.__version__) >= Version("1.26.0"):
+        retry_kwargs["allowed_methods"] = None
+    else:
+        retry_kwargs["method_whitelist"] = None
+
+    retry = Retry(**retry_kwargs)
+    adapter = HTTPAdapter(max_retries=retry)
+    session = requests.Session()
+    session.mount("https://", adapter)
+    session.mount("http://", adapter)
+    return session


This method, along with a bunch of others, is defined in rest_utils.py. However, if we import these utils from MLflow, it will take ~ 1.5 seconds for the fresh subprocess to import all of MLflow. For prototyping purposes, I redefined these utils so that this module doesn't rely on any imports from MLflow, but we probably want to make this cleaner before merging.

In particular, this creates the potential for Databricks multi-part download logic to diverge from the rest of MLflow.

At minimum, before merging, we need some test verifying that this script doesn't import MLflow and runs quickly.

dbczumar · 2023-04-28T05:16:01Z

mlflow/utils/download_cloud_file_chunk.py

+        print(  # pylint: disable=print-function
+            json.dumps(
+                {
+                    "error_status_code": e.response.status_code,
+                    "error_text": str(e),
+                }
+            ),
+            file=sys.stdout,
+        )


This feels like it could be brittle. What if other dependencies print stdout? There must be a better way.

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar · 2023-04-28T05:20:28Z

mlflow/store/artifact/databricks_artifact_repo.py

@@ -62,7 +62,7 @@
 )

 _logger = logging.getLogger(__name__)
-_DOWNLOAD_CHUNK_SIZE = 10_000_000
+_DOWNLOAD_CHUNK_SIZE = 100_000_000  # 100 MB


Use 100MB chunks, which is objectively faster due to the overhead of launching new processes

mlflow-automation · 2023-04-28T06:02:41Z

Documentation preview for b72517a will be available here when this CircleCI job completes successfully.

More info

Ignore this comment if this PR does not change the documentation.
It takes a few minutes for the preview to be available.
The preview is updated when a new commit is pushed to this PR.
This comment was created by https://github.com/mlflow/mlflow/actions/runs/5110036914.

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

…ls.databricks_utils Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

…els_artifact_repo Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

serena-ruan · 2023-05-23T07:31:33Z

Benchmarking:
Comparing existing download with multi-part download.
File size: 1GB
Run 10 times

Direct download:

Multi-part download:

It decreases time to download a 1GB file from 18.x seconds to 1.x seconds 😄

mlflow/utils/request_utils.py

mlflow/utils/download_cloud_file_chunk.py

mlflow/utils/file_utils.py

tests/store/artifact/test_databricks_models_artifact_repo.py

mlflow/utils/request_utils.py

mlflow/azure/client.py

mlflow/utils/file_utils.py

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

tests/utils/test_request_utils.py

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

mlflow/utils/download_cloud_file_chunk.py

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

This reverts commit a63313a. Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

harupy

Looks good to me!

harupy · 2023-05-29T04:31:10Z

@serena-ruan Can you update the PR description to add a table describing how large the performance gain is?

mlflow/store/artifact/databricks_artifact_repo.py

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

mlflow/utils/request_utils.py

tests/utils/test_request_utils.py

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

dbczumar added 9 commits April 27, 2023 18:52

Initial

7532af5

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Initial 2

64ac9fc

Signed-off-by: dbczumar <corey.zumar@databricks.com>

debug

1259ab4

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dlp

a0572fb

Signed-off-by: dbczumar <corey.zumar@databricks.com>

fix

741bf1b

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Fix

c90fa82

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Inst

3384a5c

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Adjust

a9c0021

Signed-off-by: dbczumar <corey.zumar@databricks.com>

Fix

48097cd

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar commented Apr 28, 2023

View reviewed changes

dbczumar added 2 commits April 27, 2023 22:17

Registry

f710a55

Signed-off-by: dbczumar <corey.zumar@databricks.com>

100mb chunks

db7d6a4

Signed-off-by: dbczumar <corey.zumar@databricks.com>

dbczumar commented Apr 28, 2023

View reviewed changes

dbczumar mentioned this pull request Apr 28, 2023

Fix multiprocess download hanging in notebooks #8345

Closed

33 tasks

apurva-koti self-requested a review April 28, 2023 05:23

github-actions bot added area/model-registry Model registry, model registry APIs, and the fluent client calls for model registry area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Apr 28, 2023

serena-ruan added 7 commits May 22, 2023 10:34

merge master

ba4b7fe

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

write error to tempfile instead of stdout in download_cloud_file_chunk

ec246ce

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

enforce _get_read_credential_infos accept paths as a list

c415449

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

move get_databricks_env_vars from mlflow.projects.utils to mlflow.uti…

95bf66f

…ls.databricks_utils Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

remove duplicate methods in rest_utils and download_cloud_file_chunk

208a528

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

fix test

5f36ead

Signed-off-by: Serena Ruan <serena.rxy@gmail.com>

update _download_file for databricks_artifact_repo and databricks_mod…

01abcee

…els_artifact_repo Signed-off-by: Serena Ruan <serena.rxy@gmail.com>