Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fetch: initial fetch of a cloud upload-url takes *forever* #9968

Open
d1sounds opened this issue Sep 22, 2023 · 7 comments
Open

fetch: initial fetch of a cloud upload-url takes *forever* #9968

d1sounds opened this issue Sep 22, 2023 · 7 comments
Labels
p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks

Comments

@d1sounds
Copy link

Bug Report

fetch: initial fetch of a cloud upload-url takes *forever*

Description

I'm importing several large s3 buckets (dvc import-url --version-aware s3://bucket/path) which are about 100k files totalling around 50GB.
When I pull a new repo, I expected the initial dvc fetch to be slow, but it's really slow (like 15 hours over gigabit fiber!). Way slower than the initial import-url. I started debugging, and what I found is that almost all of the time is going into the initial remote index building (md5() in dvc_data/index/save.py).

In a nutshell what I found is:

  1. The indexing is linearly fetching the full bucket from s3 to compute the md5, which is way slower than the actual fetch because that's done with some parallelism.
  2. It's unfortunate that the full bucket is being fetched for the md5, thrown away, then refetched for the fetch(). The structure of the code makes sense, but obviously a single fetch would be preferable (the fast one!).
  3. When the remote index is computed (fetch() in dvc_data/index/fetch.py), the index isn't updated (save()) until the entire thing is computed. When thefetch fails or cancels (which easily happens in 15 hours), none of the intermediate progress is saved!
  4. There's also no progress indicated in all this time.

Reproduce

  1. dvc import-url --version-aware s3://bucket/path test
  2. git commit test.dvc
  3. in a new copy of the git repo: git pull
  4. dvc fetch test

Expected

I expect the fetch to take about the same time as the original import-url.

Environment information

I've tried this on 3.19.0 and 3.22.0.

Output of dvc doctor:

DVC version: 3.19.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.2.0-32-generic-x86_64-with-glibc2.35
Subprojects:
	dvc_data = 2.16.1
	dvc_objects = 1.0.1
	dvc_render = 0.5.3
	dvc_task = 0.3.0
	scmrepo = 1.3.1
Supports:
	http (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.5, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2023.9.0, boto3 = 1.28.17)
Config:
	Global: /home/david/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p2
Caches: local
Remotes: s3
Workspace directory: ext4 on /dev/nvme0n1p2
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/ac3ad7a1fbc9fd2c2a49af1dfba113c3

You are using dvc version 3.19.0; however, version 3.22.0 is available.
To upgrade, run 'pip install --upgrade dvc'.

Additional Information (if any):

@efiop
Copy link
Contributor

efiop commented Sep 23, 2023

Hey @d1sounds , you pretty much got to the bottom of it. The main difference between dvc import-url and dvc fetch/pull currently could be summarized as import-url actually being 2 commands: download to your workspace and then dvc add that. While dvc fetch/pull doesn't currently have a workspace to download temporarily too, so it has to virtually download files to a content-based storage without knowing their md5s. We've introduced that temporary space for some edge cases already, but not yet for the scenario you've described, so stay tuned. I'll see if I can send something actionable soon or, if not, will later link some issue or something here to track the progress.

@d1sounds
Copy link
Author

Would it work to fetch the md5 sums in parallel, rather than linearly?

@efiop
Copy link
Contributor

efiop commented Sep 24, 2023

@d1sounds Sure, but addressing redundant downloads will likely make a bigger difference. Those things are not mutually exclusive though, we don't parallelize hashing locally either (in our plans though).

@dberenbaum
Copy link
Contributor

2. It's unfortunate that the full bucket is being fetched

What about this part @efiop?

@efiop
Copy link
Contributor

efiop commented Sep 25, 2023

@dberenbaum I assumed @d1sounds just means that we stream all files (not litteraly everything in the bucket, unless bucket only contains this project which is stored in the root). Maybe he @d1sounds could clarify.

@d1sounds
Copy link
Author

yes, just the files. but in my case I'm using all the files in the bucket.

@mbergal
Copy link

mbergal commented Nov 6, 2023

Not having a progress indicator for md5 indexing (even with -vv) is really a bad UX and does not depend on the specifics of fetching algo though.

@dberenbaum dberenbaum added p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks labels May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p2-medium Medium priority, should be done, but less important performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

4 participants