Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVCFileSystem.get() doesn't parallelize download #8896

Open
grahameth opened this issue Jan 27, 2023 · 1 comment
Open

DVCFileSystem.get() doesn't parallelize download #8896

grahameth opened this issue Jan 27, 2023 · 1 comment
Labels
A: api Related to the dvc.api performance improvement over resource / time consuming tasks

Comments

@grahameth
Copy link

Bug Report

Calling get on DVCFileSystem with multiple files doesn't parallelize the download.

Description

Reproduce

This example downloads multiple files with a single get call:

fs = DVCFileSystem(repo_url, rev=rev)
remote_files = [...]
local_files  = [...]
fs.get(remote_files, local_files, callback=cb)

It works, but it downloads 1 file at a time.

Expected

This example does the same, but accesses internal datastructures to make it faster:

fs = DVCFileSystem(repo_url, rev=rev)
remote_files = [...]
local_files  = [...]
fs._datafss[()].get(remote_files, local_files, callback=cb)

This internal get function parallelizes the download, and is thus about twice as fast on my machine.

Given that this functionality is already implemented, I find it a bit sad that it isn't used.

DVC version: 2.38.1 (conda)
---------------------------------
Platform: Python 3.9.15 on Windows-10-10.0.19045-SP0
Subprojects:
        dvc_data = 0.28.4
        dvc_objects = 0.14.0
        dvc_render = 0.0.15
        dvc_task = 0.1.8
        dvclive = 1.3.0
        scmrepo = 0.1.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        ssh (sshfs = 0.0.0)
@skshetry
Copy link
Member

I'll suggest something like follows as a temporary workaround:

with Repo.open(repo_url, rev=rev) as repo:
	repo.dvcfs.get(...)

There are some inconsistencies between repo.dvcfs and DvcFileSystem, but for the purpose of fs.get(), they should work similarly (and be faster). They are still a bit inefficient, ideally we want to take the underlying filesystem of the remote (say s3fs) directly for better performance.

@dtrifiro dtrifiro added the performance improvement over resource / time consuming tasks label Jan 28, 2023
@daavoo daavoo added the A: api Related to the dvc.api label Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: api Related to the dvc.api performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

4 participants