repro: Rebuilds same tree unnecessarily #9085
Labels
A: pipelines
Related to the pipelines feature
p1-important
Important, aka current backlog of things to do
performance
improvement over resource / time consuming tasks
Given a directory tracked with
dvc add/dvc import
and advc.yaml
with stages that have that directory as dependeny:During a
dvc repro
execution, the same tree for the.dir
is being built (_build_tree
) multiple times during:changed_outs
fordata.dvc
Unless I am missing something, this is the only place where we should really call
_build_tree
and cache the result.changed_deps
save_deps
as part of_run_stage
.save_deps
as part ofsave
I don't really know why we need to call
save_deps
twice insidestage.run
.So, in total there are 3 unnecessary (IMO) calls to
_build_tree
for each stage.For 100k dummy files, each of these
_build_tree
calls takes around 10s.It feels like a significant overhead, especially considering that it grows with the number of files and the number of stages having them as deps.
Don't know if this is something to be addressed in https://github.com/iterative/dvc-data or in DVC as part of pipeline management
The text was updated successfully, but these errors were encountered: