Skip to content

Conversation

@skshetry
Copy link
Collaborator

@skshetry skshetry commented Jan 9, 2023

Haven't made many changes here, but now the Index collection happens eagerly (although index is still created lazily) when Index.from_repo() or Index.from_file() is called.

  • The collection of stages happen inside Index rather than as part of Repo.stage.collect_repo.
  • repo._skip_graph_checks was broken, it has been fixed by introducing Repo.ensure_graph_correctness_with(stages=stages) API that honors that attribute.
  • Stage collection is eager during Index.from_repo().
  • Minor cleanups in Repo/StageLoad/Index.

Motivation

The Index had stages cached property, which was lazily loaded. We extended this and added metrics/params/plots property which was also a cached_property. It was not possible to collect and fill all of these properties at the same time, but for performance reasons, we wanted to load everything at once.

So I changed the implementation of all these properties to invoke Index._collect(), which would collect and fill all of the properties at the same time (even if you ask for just one).

https://github.com/iterative/dvc/blob/1d5de9c5ba5909f6eaa0911c3b2a691bbf4ea254/dvc/repo/index.py#L117-L127

But that felt hackish, the performance issue is solved but the way we collect feels odd.
It does seem like we are not collecting a unit of things, but multiple things. At least, the cached property makes me think in that way.

So not only in implementation, but to make it look like a single unit of thing to load, the interface also has to match that. So we needed to unify this collection logic.

With this PR, the Index is loaded eagerly when Index.from_repo() is invoked, and it returns a single unit Index that has all stages/metrics/params/plots in it.

The other reason was the realization that the collection logic belonged in Index rather than in StageLoad, but that's more obvious.

@skshetry skshetry requested review from dtrifiro and efiop January 9, 2023 10:52
Comment on lines +247 to +248
if callable(callback):
callback()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When self.index is called, it might go on and collect stages. So the callback is there to signify that collection was completed, and that we are moving to check graph.

This is used in dvc add to show Checking graph status output.

@codecov
Copy link

codecov bot commented Jan 9, 2023

Codecov Report

Base: 93.63% // Head: 93.53% // Decreases project coverage by -0.09% ⚠️

Coverage data is based on head (2e790eb) compared to base (a814f04).
Patch coverage: 90.81% of modified lines in pull request are covered.

❗ Current head 2e790eb differs from pull request most recent head 1504739. Consider uploading reports for the commit 1504739 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8782      +/-   ##
==========================================
- Coverage   93.63%   93.53%   -0.10%     
==========================================
  Files         456      457       +1     
  Lines       36174    36171       -3     
  Branches     5241     5245       +4     
==========================================
- Hits        33871    33834      -37     
- Misses       1805     1832      +27     
- Partials      498      505       +7     
Impacted Files Coverage Δ
tests/func/test_stage.py 100.00% <ø> (ø)
tests/func/test_stage_load.py 100.00% <ø> (ø)
dvc/repo/index.py 90.82% <86.50%> (-2.21%) ⬇️
dvc/dvcfile.py 96.79% <100.00%> (+0.01%) ⬆️
dvc/repo/__init__.py 92.14% <100.00%> (+0.14%) ⬆️
dvc/repo/add.py 100.00% <100.00%> (ø)
dvc/repo/imp_url.py 82.97% <100.00%> (ø)
dvc/repo/metrics/show.py 95.34% <100.00%> (-0.06%) ⬇️
dvc/repo/params/show.py 93.93% <100.00%> (-0.07%) ⬇️
dvc/repo/plots/__init__.py 87.08% <100.00%> (-0.46%) ⬇️
... and 88 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

return self.url or self.root_dir

@cached_property
def index(self):
Copy link
Collaborator Author

@skshetry skshetry Jan 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other suggestion that I have is, as we discussed repo.index is a full index, so an API to get a partial index probably should not be included inside Index, but should be at Repo level, something like repo.index_view().

Since Index now is more like a dataclass, we could, in theory, create an Index with partially filled data.

def index_view(self, directory):
    return Index(self, stages_from_dir, metrics_from_dir, plots_from_dir, params_from_dir)

But the collection happens eagerly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense 👍

@skshetry skshetry force-pushed the index-eager-collect branch from 2e790eb to 1504739 Compare January 11, 2023 12:45
@skshetry
Copy link
Collaborator Author

@efiop, any thoughts on this?

@efiop
Copy link
Contributor

efiop commented Jan 11, 2023

@skshetry Sorry for the delay, trying to get to this. I'll take a look asap.

@efiop efiop merged commit 30ec1cf into treeverse:main Jan 11, 2023
@skshetry skshetry deleted the index-eager-collect branch January 12, 2023 02:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants