Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: Rebuilds same tree unnecessarily #9085

Open
daavoo opened this issue Feb 27, 2023 · 4 comments
Open

repro: Rebuilds same tree unnecessarily #9085

daavoo opened this issue Feb 27, 2023 · 4 comments
Labels
A: pipelines Related to the pipelines feature p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks

Comments

@daavoo
Copy link
Contributor

daavoo commented Feb 27, 2023

Given a directory tracked with dvc add/dvc import and a dvc.yaml with stages that have that directory as dependeny:

$ cat data.dvc
outs:
- md5: 6f68a8a747e41c152e7cc5fc62437727.dir
  size: 2890
  nfiles: 1000
  path: data
$ cat stages:
  foo:
    cmd: echo foo
    deps:
    - data
  bar:
    cmd: echo bar
    deps:
    - data

During a dvc repro execution, the same tree for the .dir is being built (_build_tree) multiple times during:

  • changed_outs for data.dvc
    Unless I am missing something, this is the only place where we should really call _build_tree and cache the result.
  • (for each stage) changed_deps
  • (for each stage) save_deps as part of _run_stage.
  • (for each stage) save_deps as part of save
    I don't really know why we need to call save_deps twice inside stage.run.

So, in total there are 3 unnecessary (IMO) calls to _build_tree for each stage.

For 100k dummy files, each of these _build_tree calls takes around 10s.

It feels like a significant overhead, especially considering that it grows with the number of files and the number of stages having them as deps.

Don't know if this is something to be addressed in https://github.com/iterative/dvc-data or in DVC as part of pipeline management

@daavoo daavoo added the performance improvement over resource / time consuming tasks label Feb 27, 2023
@daavoo daavoo added the p1-important Important, aka current backlog of things to do label Jun 29, 2023
@daavoo
Copy link
Contributor Author

daavoo commented Jun 29, 2023

@dberenbaum
Copy link
Contributor

Do we know if it actually re-hashes each file each time, or it looks like that but it actually iterates over the files in the dir but skips actually re-hashing if nothing changed? I know it's slow either way, but want to identify the true source of the problem. cc @iterative/dvc

@skshetry
Copy link
Member

@dberenbaum, it does not hash, it goes through the directory and tries to look into the state db if we have the hashes. And it does that for each item, one-by-one which is why it is slow.

@dbalabka
Copy link

@dberenbaum , I also faced the same slowness problem. I'm using git hooks, and it slows down each git commit even if there weren't any changes in the dvc lock files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks
Projects
No open projects
Status: Backlog
Development

No branches or pull requests

4 participants