Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repofs: use underlying fs.download to download files #6401

Merged
merged 1 commit into from Aug 10, 2021

Conversation

efiop
Copy link
Member

@efiop efiop commented Aug 9, 2021

This makes dvc get work just as fast as import/pull. The reason is that open() is harder to make fast (e.g. we need to know what block size to use, etc, for which we don't yet have a very good internal infrastructure), and it is much easier to just use native fs.download method that has been already tuned for maximum performance.

Related to #5546
Fixes #6019

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@efiop efiop requested a review from a team as a code owner August 9, 2021 19:09
@efiop efiop requested a review from isidentical August 9, 2021 19:09
@@ -253,3 +255,9 @@ def info(self, path_info):
ret[obj.hash_info.name] = obj.hash_info.value

return ret

def _download(self, from_info, to_file, **kwargs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: this is a bit ugly, but it will turn into a cleaner and more proper fs.get_file when migrating dvcfs to fsspec.

def open( # type: ignore
self, path: PathInfo, mode="r", encoding=None, remote=None, **kwargs
): # pylint: disable=arguments-differ
def _get_fs_path(self, path: PathInfo, remote=None):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This didn't handle chained imports before and neither does it handle them now. We'll need to generalize resolving logic we use in dvc/dependency/repo.py to use it here. Will prob look into it in a followup.

@efiop efiop added optimize Optimizes DVC performance improvement over resource / time consuming tasks labels Aug 9, 2021

def _download(self, from_info, to_file, **kwargs):
fs, path = self._get_fs_path(from_info)
fs._download( # pylint: disable=protected-access
Copy link
Member Author

@efiop efiop Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This again is temporary, it will be replaced by just get_file in the future.

@efiop efiop changed the title [WIP] repofs: use underlying fs.download to download files repofs: use underlying fs.download to download files Aug 9, 2021
Copy link
Contributor

@isidentical isidentical left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@efiop efiop merged commit e99e6cd into iterative:master Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimize Optimizes DVC performance improvement over resource / time consuming tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

get: load performance is significantly (many times) lower than for pull (reproduced on dvc-bench dataset)
2 participants