-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fetch: initial fetch of a cloud upload-url takes *forever* #9968
Comments
Hey @d1sounds , you pretty much got to the bottom of it. The main difference between |
Would it work to fetch the md5 sums in parallel, rather than linearly? |
@d1sounds Sure, but addressing redundant downloads will likely make a bigger difference. Those things are not mutually exclusive though, we don't parallelize hashing locally either (in our plans though). |
What about this part @efiop? |
@dberenbaum I assumed @d1sounds just means that we stream all files (not litteraly everything in the bucket, unless bucket only contains this project which is stored in the root). Maybe he @d1sounds could clarify. |
yes, just the files. but in my case I'm using all the files in the bucket. |
Not having a progress indicator for md5 indexing (even with |
Bug Report
fetch: initial fetch of a cloud upload-url takes *forever*
Description
I'm importing several large s3 buckets (
dvc import-url --version-aware s3://bucket/path
) which are about 100k files totalling around 50GB.When I pull a new repo, I expected the initial
dvc fetch
to be slow, but it's really slow (like 15 hours over gigabit fiber!). Way slower than the initialimport-url
. I started debugging, and what I found is that almost all of the time is going into the initial remote index building (md5()
indvc_data/index/save.py
).In a nutshell what I found is:
md5
, thrown away, then refetched for thefetch()
. The structure of the code makes sense, but obviously a single fetch would be preferable (the fast one!).fetch()
indvc_data/index/fetch.py
), the index isn't updated (save()
) until the entire thing is computed. When thefetch
fails or cancels (which easily happens in 15 hours), none of the intermediate progress is saved!Reproduce
dvc import-url --version-aware s3://bucket/path test
git commit test.dvc
git pull
dvc fetch test
Expected
I expect the
fetch
to take about the same time as the originalimport-url
.Environment information
I've tried this on
3.19.0
and3.22.0
.Output of
dvc doctor
:Additional Information (if any):
The text was updated successfully, but these errors were encountered: