fetch: initial fetch of a cloud upload-url takes forever #9968

d1sounds · 2023-09-22T16:36:57Z

Bug Report

fetch: initial fetch of a cloud upload-url takes *forever*

Description

I'm importing several large s3 buckets (dvc import-url --version-aware s3://bucket/path) which are about 100k files totalling around 50GB.
When I pull a new repo, I expected the initial dvc fetch to be slow, but it's really slow (like 15 hours over gigabit fiber!). Way slower than the initial import-url. I started debugging, and what I found is that almost all of the time is going into the initial remote index building (md5() in dvc_data/index/save.py).

In a nutshell what I found is:

The indexing is linearly fetching the full bucket from s3 to compute the md5, which is way slower than the actual fetch because that's done with some parallelism.
It's unfortunate that the full bucket is being fetched for the md5, thrown away, then refetched for the fetch(). The structure of the code makes sense, but obviously a single fetch would be preferable (the fast one!).
When the remote index is computed (fetch() in dvc_data/index/fetch.py), the index isn't updated (save()) until the entire thing is computed. When thefetch fails or cancels (which easily happens in 15 hours), none of the intermediate progress is saved!
There's also no progress indicated in all this time.

Reproduce

dvc import-url --version-aware s3://bucket/path test
git commit test.dvc
in a new copy of the git repo: git pull
dvc fetch test

Expected

I expect the fetch to take about the same time as the original import-url.

Environment information

I've tried this on 3.19.0 and 3.22.0.

Output of dvc doctor:

DVC version: 3.19.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.2.0-32-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 2.16.1
	dvc_objects = 1.0.1
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.3.1
Supports:
	http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.9.0, boto3 = 1.28.17)
Config:
	Global: /home/david/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/ac3ad7a1fbc9fd2c2a49af1dfba113c3

You are using dvc version 3.19.0; however, version 3.22.0 is available.
To upgrade, run 'pip install --upgrade dvc'.

Additional Information (if any):

The text was updated successfully, but these errors were encountered:

efiop · 2023-09-23T23:08:34Z

Hey @d1sounds , you pretty much got to the bottom of it. The main difference between dvc import-url and dvc fetch/pull currently could be summarized as import-url actually being 2 commands: download to your workspace and then dvc add that. While dvc fetch/pull doesn't currently have a workspace to download temporarily too, so it has to virtually download files to a content-based storage without knowing their md5s. We've introduced that temporary space for some edge cases already, but not yet for the scenario you've described, so stay tuned. I'll see if I can send something actionable soon or, if not, will later link some issue or something here to track the progress.

d1sounds · 2023-09-24T00:34:02Z

Would it work to fetch the md5 sums in parallel, rather than linearly?

efiop · 2023-09-24T11:43:35Z

@d1sounds Sure, but addressing redundant downloads will likely make a bigger difference. Those things are not mutually exclusive though, we don't parallelize hashing locally either (in our plans though).

dberenbaum · 2023-09-25T12:30:36Z

2. It's unfortunate that the full bucket is being fetched

What about this part @efiop?

efiop · 2023-09-25T18:23:21Z

@dberenbaum I assumed @d1sounds just means that we stream all files (not litteraly everything in the bucket, unless bucket only contains this project which is stored in the root). Maybe he @d1sounds could clarify.

d1sounds · 2023-09-26T02:44:36Z

yes, just the files. but in my case I'm using all the files in the bucket.

mbergal · 2023-11-06T13:05:03Z

Not having a progress indicator for md5 indexing (even with -vv) is really a bad UX and does not depend on the specifics of fetching algo though.

dberenbaum mentioned this issue May 2, 2024

dvc fetch: hangs forever on "Fetching" #10410

Closed

dberenbaum added p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks labels May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fetch: initial fetch of a cloud upload-url takes forever #9968

fetch: initial fetch of a cloud upload-url takes forever #9968

d1sounds commented Sep 22, 2023

efiop commented Sep 23, 2023

d1sounds commented Sep 24, 2023

efiop commented Sep 24, 2023

dberenbaum commented Sep 25, 2023

efiop commented Sep 25, 2023

d1sounds commented Sep 26, 2023

mbergal commented Nov 6, 2023

fetch: initial fetch of a cloud upload-url takes *forever* #9968

fetch: initial fetch of a cloud upload-url takes *forever* #9968

Comments

d1sounds commented Sep 22, 2023

Bug Report

Description

Reproduce

Expected

Environment information

efiop commented Sep 23, 2023

d1sounds commented Sep 24, 2023

efiop commented Sep 24, 2023

dberenbaum commented Sep 25, 2023

efiop commented Sep 25, 2023

d1sounds commented Sep 26, 2023

mbergal commented Nov 6, 2023

fetch: initial fetch of a cloud upload-url takes forever #9968

fetch: initial fetch of a cloud upload-url takes forever #9968