Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: pull performance #10059

Open
dberenbaum opened this issue Oct 30, 2023 · 6 comments · Fixed by iterative/dvc-data#459
Open

import: pull performance #10059

dberenbaum opened this issue Oct 30, 2023 · 6 comments · Fixed by iterative/dvc-data#459
Labels
A: data-sync Related to dvc get/fetch/import/pull/push performance improvement over resource / time consuming tasks

Comments

@dberenbaum
Copy link
Contributor

Copied from slack

I’m able to reproduce it using the aws sandbox:

$ git clone git@github.com:dberenbaum/download-dvc-dir.git
$ cd download-dvc-dir
$ dvc pull test2014

This pulls data imported from git@github.com:dberenbaum/coco-sample.git. When pulling directly from the source repo, it starts to pull fast, but pulling from download-dvc-dir gets stuck here for a long time:

$ dvc pull -vv test2014.dvc
2023-10-26 08:06:39,751 DEBUG: v3.27.1.dev6+g4a0d56a79.d20231020, CPython 3.11.5 on macOS-14.0-arm64-arm-64bit
2023-10-26 08:06:39,751 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc pull -vv test2014.dvc
2023-10-26 08:06:39,751 TRACE: Namespace(quiet=0, verbose=2, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='pull', jobs=None, targets=['test2014.dvc'], remote=None, all_branches=False, all_tags=False, all_commits=False, force=False, with_deps=False, recursive=False, run_cache=False, glob=False, allow_missing=False, func=<class 'dvc.commands.data_sync.CmdDataPull'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-10-26 08:06:39,969 TRACE: params.yaml does not exist, it won't be used in parametrization
2023-10-26 08:06:39,971 TRACE:    16.24 ms in collecting stages from /Users/dave/Code/download-dvc-dir
2023-10-26 08:06:39,979 DEBUG: Creating external repo git@github.com:dberenbaum/coco-sample.git@ad247281096a07d3c3ea417617bf68ba491d16cb
2023-10-26 08:06:39,979 DEBUG: erepo: git clone 'git@github.com:dberenbaum/coco-sample.git' to a temporary dir
2023-10-26 08:06:42,722 TRACE:     1.91 ms in collecting stages from /
2023-10-26 08:06:42,723 TRACE:     6.08 mks in collecting stages from /annotations
2023-10-26 08:06:42,738 DEBUG: Creating external repo git@github.com:iterative/lstm_seq2seq@8aa13ed31971eae16e4148cc0cd2c62fa65c38d0
2023-10-26 08:06:42,738 DEBUG: erepo: git clone 'git@github.com:iterative/lstm_seq2seq' to a temporary dir
2023-10-26 08:06:46,391 TRACE: Context during resolution of stage download:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-10-26 08:06:46,481 TRACE: Context during resolution of stage train:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-10-26 08:06:46,482 TRACE:    95.37 ms in collecting stages from /
2023-10-26 08:06:46,482 TRACE:     1.63 mks in collecting stages from /.github
2023-10-26 08:06:46,482 TRACE:     1.63 mks in collecting stages from /.github/workflows
2023-10-26 08:06:46,482 TRACE:     2.25 mks in collecting stages from /conf
2023-10-26 08:06:46,482 TRACE:     1.83 mks in collecting stages from /conf/model
2023-10-26 08:06:46,482 TRACE:     2.67 mks in collecting stages from /results
Collecting                                                     |0.00 [00:06,    ?entry/s]
2023-10-26 08:06:47,627 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache' to '/Users/dave/Code/download-dvc-dir/.dvc/cache'
2023-10-26 08:06:47,627 DEBUG: Preparing to collect status from '/Users/dave/Code/download-dvc-dir/.dvc/cache'
2023-10-26 08:06:47,627 DEBUG: Collecting status from '/Users/dave/Code/download-dvc-dir/.dvc/cache'
2023-10-26 08:06:48,586 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache/files/md5' to '/Users/dave/Code/download-dvc-dir/.dvc/cache/files/md5'
2023-10-26 08:06:48,586 DEBUG: Preparing to collect status from '/Users/dave/Code/download-dvc-dir/.dvc/cache/files/md5'
2023-10-26 08:06:48,586 DEBUG: Collecting status from '/Users/dave/Code/download-dvc-dir/.dvc/cache/files/md5'
2023-10-26 08:06:48,858 DEBUG: failed to load ('test2014',) from storage local (/Users/dave/Code/download-dvc-dir/.dvc/cache) - [Errno 2] No such file or directory: '/Users/dave/Code/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
Traceback (most recent call last):
  File "/Users/dave/Code/dvc-data/src/dvc_data/index/index.py", line 552, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/dave/Code/dvc-data/src/dvc_data/index/index.py", line 488, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/Code/dvc-data/src/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/local.py", line 136, in open
    return open(path, mode=mode, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/Users/dave/Code/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'

Fetching
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks A: data-sync Related to dvc get/fetch/import/pull/push labels Oct 30, 2023
@dberenbaum dberenbaum changed the title dvc get performance import pull performance Oct 30, 2023
@dberenbaum dberenbaum changed the title import pull performance import: pull performance Oct 30, 2023
@efiop efiop self-assigned this Oct 31, 2023
@efiop
Copy link
Contributor

efiop commented Nov 17, 2023

For the record: Can reproduce even with small dataset from dvc-bench. Investigating further.

efiop added a commit to efiop/dvc-data that referenced this issue Nov 17, 2023
The main issue is that we don't use md5 provided by the fs (e.g. dvcfs),
which results in needless hash recomputing. We can just use tried-and-tested
`hash_file` here for now.

Fixes iterative/dvc#10059
efiop added a commit that referenced this issue Nov 17, 2023
efiop added a commit that referenced this issue Nov 17, 2023
@dberenbaum
Copy link
Contributor Author

@efiop Does the example above work for you? I'm seeing it get a little further but still get stuck on fetching:

$ dvc pull -vv test2014.dvc
2023-11-17 13:15:41,213 DEBUG: v3.30.1, CPython 3.11.5 on macOS-14.1-arm64-arm-64bit
2023-11-17 13:15:41,213 DEBUG: command: /Users/dave/micromamba/envs/dvc/bin/dvc pull -vv test2014.dvc
2023-11-17 13:15:41,213 TRACE: Namespace(quiet=0, verbose=2, cprofile=False, cprofile_dump=None, yappi=False, yappi_separate_threads=False, viztracer=False, viztracer_depth=None, viztracer_async=False, pdb=False, instrument=False, instrument_open=False, show_stack=False, cd='.', cmd='pull', jobs=None, targets=['test2014.dvc'], remote=None, all_branches=False, all_tags=False, all_commits=False, force=False, with_deps=False, recursive=False, run_cache=False, glob=False, allow_missing=False, func=<class 'dvc.commands.data_sync.CmdDataPull'>, parser=DvcParser(prog='dvc', usage=None, description='Data Version Control', formatter_class=<class 'argparse.RawTextHelpFormatter'>, conflict_handler='error', add_help=False))
2023-11-17 13:15:41,404 TRACE: params.yaml does not exist, it won't be used in parametrization
2023-11-17 13:15:41,406 TRACE:    16.60 ms in collecting stages from /private/tmp/download-dvc-dir
2023-11-17 13:15:41,414 DEBUG: Creating external repo git@github.com:dberenbaum/coco-sample.git@ad247281096a07d3c3ea417617bf68ba491d16cb
2023-11-17 13:15:41,414 DEBUG: erepo: git clone 'git@github.com:dberenbaum/coco-sample.git' to a temporary dir
2023-11-17 13:15:43,050 TRACE:     2.18 ms in collecting stages from /
2023-11-17 13:15:43,051 TRACE:     6.13 mks in collecting stages from /annotations
2023-11-17 13:15:43,062 DEBUG: failed to load ('test2014',) from storage local (/private/tmp/download-dvc-dir/.dvc/cache/files/md5) - [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/files/md5/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
Traceback (most recent call last):
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/local.py", line 136, in open
    return open(path, mode=mode, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/files/md5/5d/2fabe8cfc3f4246724d34bb9791f84.dir'

2023-11-17 13:15:43,068 DEBUG: failed to load ('test2014',) from storage local (/private/tmp/download-dvc-dir/.dvc/cache) - [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'
Traceback (most recent call last):
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 582, in _load_from_storage
    _load_from_object_storage(trie, entry, storage)
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/index/index.py", line 518, in _load_from_object_storage
    obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_data/hashfile/tree.py", line 193, in load
    with obj.fs.open(obj.path, "r") as fobj:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/base.py", line 228, in open
    return self.fs.open(path, mode=mode, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dave/micromamba/envs/dvc/lib/python3.11/site-packages/dvc_objects/fs/local.py", line 136, in open
    return open(path, mode=mode, encoding=encoding)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/private/tmp/download-dvc-dir/.dvc/cache/5d/2fabe8cfc3f4246724d34bb9791f84.dir'

2023-11-17 13:15:53,634 DEBUG: Creating external repo git@github.com:iterative/lstm_seq2seq@8aa13ed31971eae16e4148cc0cd2c62fa65c38d0
2023-11-17 13:15:53,635 DEBUG: erepo: git clone 'git@github.com:iterative/lstm_seq2seq' to a temporary dir
2023-11-17 13:15:55,713 TRACE: Context during resolution of stage download:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-11-17 13:15:55,724 TRACE: Context during resolution of stage train:
{'model': {'batch_size': 512, 'latent_dim': 8, 'duration': '00:00:30:00', 'max_epochs': 2, 'optim': {'lr': 0.01}}, 'data_path': 'fra.txt', 'num_samples': 1013, 'seed': 423}
2023-11-17 13:15:55,725 TRACE:    16.33 ms in collecting stages from /
2023-11-17 13:15:55,725 TRACE:     1.87 mks in collecting stages from /.github
2023-11-17 13:15:55,726 TRACE:     1.67 mks in collecting stages from /.github/workflows
2023-11-17 13:15:55,726 TRACE:     2.25 mks in collecting stages from /conf
2023-11-17 13:15:55,726 TRACE:     1.87 mks in collecting stages from /conf/model
2023-11-17 13:15:55,726 TRACE:     2.62 mks in collecting stages from /results
Collecting                                                   |40.8k [00:14, 2.85kentry/s]
2023-11-17 13:15:56,344 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache' to '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,344 DEBUG: Preparing to collect status from '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,345 DEBUG: Collecting status from '/private/tmp/download-dvc-dir/.dvc/cache'
2023-11-17 13:15:56,823 DEBUG: Preparing to transfer data from 's3://dave-sandbox/cache/files/md5' to '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
2023-11-17 13:15:56,823 DEBUG: Preparing to collect status from '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
2023-11-17 13:15:56,824 DEBUG: Collecting status from '/private/tmp/download-dvc-dir/.dvc/cache/files/md5'
Fetching

@efiop efiop reopened this Nov 17, 2023
@efiop
Copy link
Contributor

efiop commented Nov 17, 2023

I've modified it to work with dvc-bench to make it quicker for me, but looks like I might've missed something. Let me try again.

@efiop
Copy link
Contributor

efiop commented Nov 17, 2023

So i was testing with a slightly different setup in a sense that the dataset in the data registry (not dvc-bench but derived local one) was a new one with hash: md5 field, while your coco-sample is an oldschool one, so Meta didn't know how to load md5-dos2unix properly, so this is kinda 3.x migration problem that we ran into here in addition to the one that got fixed. Working on a fix.

@dberenbaum
Copy link
Contributor Author

@efiop Any status update on this?

@efiop
Copy link
Contributor

efiop commented Dec 13, 2023

We've discussed this, but for the record: the only thing left here is cross-hash compatibility, which I'm in no rush to implement as still waiting for user feedback on whether this was enough to fix it for them or not (can''t find a link yet, but will post if I do find it).

@dberenbaum dberenbaum removed the p1-important Important, aka current backlog of things to do label Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push performance improvement over resource / time consuming tasks
Projects
No open projects
Status: Todo
Development

Successfully merging a pull request may close this issue.

2 participants