import-url: Performance decrease with growing number of data files #8373
Labels
A: data-sync
Related to dvc get/fetch/import/pull/push
performance
improvement over resource / time consuming tasks
Bug Report
Description
In our current scenario we want to use dvc together with a remote storage that is not a typical s3 bucket. Therefore we want to use dvc import-url to track and import data from that remote storage. We have to track 4000 files per class and the files itself are of type *.wav (data is taken from tensorflows' microspeech example -> Link).
We observed that the performance of dvc import-url drops as we progressed in importing more data points from the remote.
The first execution of dvc import-url took 1.5 seconds:
Pulling file 2000 already took more than 20 seconds:
And pulling file 3750 then took almost 50 seconds:
We took a look into the source code of this command and figured out that in line 73 an index is created and a graph is checked:
dvc/dvc/repo/imp_url.py
Lines 72 to 74 in c737641
Could this be the reason for the performance drop as we got more files? If yes, are there any other ways to implement it? And is there even a possibility to not create that index and check the graph?
For us we don't want to utilize dvc push. We just care about making import-url work fast in our current setup.
Reproduce
As input we have a list of artifact IDs linking to files that are stored in the remote storage.
We than loop over this list and for each element we run:
Expected
Performance won't drop after importing a larger amount of data files.
Environment information
Output of
dvc doctor
:The text was updated successfully, but these errors were encountered: