repro: Rebuilds same tree unnecessarily #9085

daavoo · 2023-02-27T15:27:07Z

Given a directory tracked with dvc add/dvc import and a dvc.yaml with stages that have that directory as dependeny:

$ cat data.dvc
outs:
- md5: 6f68a8a747e41c152e7cc5fc62437727.dir
  size: 2890
  nfiles: 1000
  path: data
$ cat stages:
  foo:
    cmd: echo foo
    deps:
    - data
  bar:
    cmd: echo bar
    deps:
    - data

During a dvc repro execution, the same tree for the .dir is being built (_build_tree) multiple times during:

changed_outs for data.dvc
Unless I am missing something, this is the only place where we should really call _build_tree and cache the result.
(for each stage) changed_deps
(for each stage) save_deps as part of _run_stage.
(for each stage) save_deps as part of save
I don't really know why we need to call save_deps twice inside stage.run.

So, in total there are 3 unnecessary (IMO) calls to _build_tree for each stage.

For 100k dummy files, each of these _build_tree calls takes around 10s.

It feels like a significant overhead, especially considering that it grows with the number of files and the number of stages having them as deps.

Don't know if this is something to be addressed in https://github.com/iterative/dvc-data or in DVC as part of pipeline management

The text was updated successfully, but these errors were encountered:

daavoo · 2023-06-29T10:43:36Z

Related: https://discord.com/channels/485586884165107732/485596304961962003/1123890104742707231

dberenbaum · 2023-08-23T15:24:42Z

Do we know if it actually re-hashes each file each time, or it looks like that but it actually iterates over the files in the dir but skips actually re-hashing if nothing changed? I know it's slow either way, but want to identify the true source of the problem. cc @iterative/dvc

skshetry · 2023-08-23T15:27:18Z

@dberenbaum, it does not hash, it goes through the directory and tries to look into the state db if we have the hashes. And it does that for each item, one-by-one which is why it is slow.

dbalabka · 2023-12-12T17:51:31Z

@dberenbaum , I also faced the same slowness problem. I'm using git hooks, and it slows down each git commit even if there weren't any changes in the dvc lock files.

daavoo added the performance improvement over resource / time consuming tasks label Feb 27, 2023

efiop mentioned this issue Feb 27, 2023

exp run: is much slower than dvc repro #8809

Closed

daavoo mentioned this issue Feb 28, 2023

Issues Using DVCLive with PyTorch Lightning Project iterative/dvclive#475

Closed

daavoo added the p1-important Important, aka current backlog of things to do label Jun 29, 2023

daavoo added the A: pipelines Related to the pipelines feature label Jun 29, 2023

daavoo mentioned this issue Aug 9, 2023

dvc exp run --temp: Collecting files and computing hashes takes a lot of time #9823

Open

dberenbaum mentioned this issue Dec 14, 2023

bench: add repro test #10167

Closed

dberenbaum mentioned this issue Apr 4, 2024

Faster index building #9813

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repro: Rebuilds same tree unnecessarily #9085

repro: Rebuilds same tree unnecessarily #9085

daavoo commented Feb 27, 2023 •

edited

daavoo commented Jun 29, 2023

dberenbaum commented Aug 23, 2023

skshetry commented Aug 23, 2023

dbalabka commented Dec 12, 2023

repro: Rebuilds same tree unnecessarily #9085

repro: Rebuilds same tree unnecessarily #9085

Comments

daavoo commented Feb 27, 2023 • edited

daavoo commented Jun 29, 2023

dberenbaum commented Aug 23, 2023

skshetry commented Aug 23, 2023

dbalabka commented Dec 12, 2023

daavoo commented Feb 27, 2023 •

edited