Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: unnecessary hashing during experiments #10308

Open
gregstarr opened this issue Feb 17, 2024 · 1 comment
Open

exp run: unnecessary hashing during experiments #10308

gregstarr opened this issue Feb 17, 2024 · 1 comment
Labels

Comments

@gregstarr
Copy link

Bug Report

Description

Not sure if this is a bug per say, probably more of a discussion. I noticed that it was very slow to run experiments in parallel because it took a long time for them to start. This is because DVC is recomputing all the hashes for my large dataset.

DVC typically avoids recomputing hashes by utilizing a cache stored in site_cache_dir. The site cache dir on linux should be something like /var/tmp/dvc/repo/{hash}. This hash is computed here and is formed from several components including the root_dir (i.e. the dvc repo dir) and the btime which is sort of supposed to be the creation time of the root directory, but is instead taken from the mtime of the btime file in the .dvc/tmp folder.

When you run experiments in parallel, copies of the repo are made in the temp directory and the experiments are run from the copies. This means that the specific site cache dir for the repo copies will be different because the repo paths are different and the mtimes of the copied btime files are different. This results in DVC thinking that there is no cache yet and so it recomputes all the necessary hashes for each experiment. I have evidence of this because I only have one dvc repo, but my site cache dir has many cache folders.

Unless I'm missing something, it seems like experiments should use the same site cache as the base repo.

Reproduce

  1. look in your site cache dir, take note of the hashes
  2. run a bunch of experiments in parallel
  3. see that the site cache dir has more cache folders
$ ls -al /scratch/tmp/starrgw1/dvc/site_cache_dir/repo/
total 72
drwxrwxrwx 18 starrgw1 starrgw1 4096 Feb 17 06:09 .
drwxrwxr-x  3 starrgw1 starrgw1 4096 Feb 15 17:11 ..
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 18:25 048e839878f97ba9324bb139fa8e4b06
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 0c53c5b78086c5438b3ee6b4aaef570d
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 17 06:09 1f18cf09ad43f0845bea96b6b719b3ee
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 378f0eae8f9824f1f96149c481621d03
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 441bd548b8b298abffb2449dc7c1cf54
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 465bff9fb0df8bd1be46b6ec24fdb069
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 20:04 511df2ed3e7fdf1d12303c5929277158
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 73887b1a621845b9038bb7d3ec4ba704
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 20:17 7f4e8b33c6bc7ef879b1491b9ed50fec
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 82c6be6b9d97c42ec7ba7569d39a9a65
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 aee9b76e8f486264f0800522304b53b0
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 15 18:53 d412c540ff7f186df3641073fe15a061
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 e4efc309f726450d1b3bdb37748a60d5
drwxrwxr-x  4 starrgw1 starrgw1 4096 Feb 16 18:45 e93d6446a825241907ed374d37e1f58d
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 16:46 f0fb5078327924424b4c3ae74fe98b46
drwxrwxr-x  5 starrgw1 starrgw1 4096 Feb 16 18:25 f2d3168db34ebf88584f37903b9b3dcc

Environment information

dvc doctor
DVC version: 3.38.1 (pip)
-------------------------
Platform: Python 3.10.13 on Linux-3.10.0-693.el7.x86_64-x86_64-with-glibc2.17
Subprojects:
        dvc_data = 3.7.0
        dvc_objects = 3.0.3
        dvc_render = 1.0.0
        dvc_task = 0.3.0
        scmrepo = 2.0.2
Supports:
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3)
Config:
        Global: /home/starrgw1/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: lustre on 192.168.199.212@o2ib:192.168.199.213@o2ib:/scratch
Caches: local
Remotes: local
Workspace directory: nfs on master:/home
Repo: dvc, git
Repo.site_cache_dir: /scratch/tmp/starrgw1/dvc/site_cache_dir/repo/d412c540ff7f186df3641073fe15a061
@gregstarr
Copy link
Author

possibly related: #9813

@efiop efiop self-assigned this Feb 21, 2024
@efiop efiop added the research label Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

2 participants