Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import-url: Performance decrease with growing number of data files #8373

Open
patrickbrus opened this issue Sep 28, 2022 · 3 comments
Open
Labels
A: data-sync Related to dvc get/fetch/import/pull/push performance improvement over resource / time consuming tasks

Comments

@patrickbrus
Copy link

Bug Report

Description

In our current scenario we want to use dvc together with a remote storage that is not a typical s3 bucket. Therefore we want to use dvc import-url to track and import data from that remote storage. We have to track 4000 files per class and the files itself are of type *.wav (data is taken from tensorflows' microspeech example -> Link).

We observed that the performance of dvc import-url drops as we progressed in importing more data points from the remote.

The first execution of dvc import-url took 1.5 seconds:

image

Pulling file 2000 already took more than 20 seconds:

image

And pulling file 3750 then took almost 50 seconds:

image

We took a look into the source code of this command and figured out that in line 73 an index is created and a graph is checked:

dvc/dvc/repo/imp_url.py

Lines 72 to 74 in c737641

try:
new_index = self.index.add(stage)
new_index.check_graph()

Could this be the reason for the performance drop as we got more files? If yes, are there any other ways to implement it? And is there even a possibility to not create that index and check the graph?

For us we don't want to utilize dvc push. We just care about making import-url work fast in our current setup.

Reproduce

As input we have a list of artifact IDs linking to files that are stored in the remote storage.
We than loop over this list and for each element we run:

# added time command to measure execution time of this command
time dvc import-url remote://storage/artifactID destfolder/targetfilename

Expected

Performance won't drop after importing a larger amount of data files.

Environment information

Output of dvc doctor:

DVC version: 2.8.1 (pip)
---------------------------------
Platform: Python 3.9.7 on Linux-4.18.0-305.57.1.el8_4.x86_64-x86_64-with-glibc2.28
Supports:
        webhdfs (fsspec = 2021.10.0),
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2021.10.0, boto3 = 1.17.106)
@dtrifiro
Copy link
Contributor

Hi, it seems that you're using a fairly old dvc version. Is upgrading to a more recent version an option?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Sep 30, 2022
@patrickbrus
Copy link
Author

We also tested this behavior with dvc==2.27.2, but there we had the same observations. We also linked the latest source code version above in our report.

@dtrifiro dtrifiro added performance improvement over resource / time consuming tasks A: data-sync Related to dvc get/fetch/import/pull/push and removed awaiting response we are waiting for your reply, please respond! :) labels Sep 30, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Oct 2, 2022

The graph check you noted is required since DVC needs to make sure your repo does not contain any overlapping outputs (i.e. multiple .dvc files that point to the same output path). If you are generating thousands of .dvc files it will eventually start to degrade performance.

Do you need to import each file individually, or are you actually importing everything that is in the defined remote storage? If you are importing the entire remote contents it would be faster on the DVC performance side to do something like dvc import-url remote://rddl/ -o data/ (which would generate a single data.dvc for the entire directory)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

4 participants