Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index: experimental persistent data index #8827

Merged
merged 1 commit into from
Feb 13, 2023
Merged

index: experimental persistent data index #8827

merged 1 commit into from
Feb 13, 2023

Conversation

efiop
Copy link
Member

@efiop efiop commented Jan 16, 2023

Putting iterative/dvc-data#208 to use.

Note that this is experimental for now and needs to be enabled with dvc config feature.data_index_cache true.

For example, sequential run of

dvc list . data/mnist/dataset/ --recursive --rev HEAD

goes down from ~16sec to ~8sec (2x improvement)

and

dvc list . data/mnist/dataset/ --rev HEAD

goes down from ~11sec to ~1.5sec (7x improvement)

@codecov
Copy link

codecov bot commented Jan 16, 2023

Codecov Report

Base: 93.13% // Head: 93.13% // Increases project coverage by +0.00% 🎉

Coverage data is based on head (8e1b75f) compared to base (e1acab5).
Patch has no changes to coverable lines.

❗ Current head 8e1b75f differs from pull request most recent head b1daf13. Consider uploading reports for the commit b1daf13 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8827   +/-   ##
=======================================
  Coverage   93.13%   93.13%           
=======================================
  Files         455      453    -2     
  Lines       36644    36604   -40     
  Branches     5289     5287    -2     
=======================================
- Hits        34127    34092   -35     
+ Misses       2001     1998    -3     
+ Partials      516      514    -2     
Impacted Files Coverage Δ
dvc/testing/remote_tests.py 43.82% <0.00%> (-8.57%) ⬇️
dvc/repo/experiments/executor/base.py 84.16% <0.00%> (-0.23%) ⬇️
dvc/repo/experiments/utils.py 81.87% <0.00%> (-0.21%) ⬇️
tests/func/experiments/test_utils.py 100.00% <0.00%> (ø)
tests/integration/conftest.py
tests/integration/test_studio_live_experiments.py
dvc/repo/experiments/queue/workspace.py 82.17% <0.00%> (+0.42%) ⬆️
dvc/repo/experiments/queue/celery.py 87.73% <0.00%> (+1.85%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@efiop
Copy link
Member Author

efiop commented Feb 6, 2023

For the record: switched to a tree-based id (instead of git revs), which made it possible to use without git and for dirty changes. But that triggered a bit more bugs on the import side, where we assume index entries have fs instance, but views that we generate don't have that. Need to get rid of fs instance. Might go a route similar to odb_map.

efiop added a commit to efiop/dvc-data that referenced this pull request Feb 7, 2023
This replaces `odb_map` and `remote_map` in `DataIndex`,
and `fs` and `path` in `DataIndexEntry` incorporating everything
into `Storage`, which describes where to get the data contents from
no matter how they are stored (just as plain backup in a directory
or in an object storage).

`DataIndexEntry.fs/path` were very confusing, as it was not clear what
they really represent and were often unecessarily used during different
operations (for example `checkout` that mutates those instead of returning
a new local index). This also removes unserializable `fs` instance from
`DataIndexEntry`, making it much easier to work with after loading.

The new `Storage.fs/path` concepts fit nicely into dvc's import functionality,
by giving a clear way to declare where to get the data from.

Related iterative/dvc#8827
efiop added a commit to efiop/dvc-data that referenced this pull request Feb 7, 2023
This replaces `odb_map` and `remote_map` in `DataIndex`,
and `fs` and `path` in `DataIndexEntry` incorporating everything
into `Storage`, which describes where to get the data contents from
no matter how they are stored (just as plain backup in a directory
or in an object storage).

`DataIndexEntry.fs/path` were very confusing, as it was not clear what
they really represent and were often unecessarily used during different
operations (for example `checkout` that mutates those instead of returning
a new local index). This also removes unserializable `fs` instance from
`DataIndexEntry`, making it much easier to work with after loading.

The new `Storage.fs/path` concepts fit nicely into dvc's import functionality,
by giving a clear way to declare where to get the data from.

Related iterative/dvc#8827
efiop added a commit to iterative/dvc-data that referenced this pull request Feb 7, 2023
This replaces `odb_map` and `remote_map` in `DataIndex`,
and `fs` and `path` in `DataIndexEntry` incorporating everything
into `Storage`, which describes where to get the data contents from
no matter how they are stored (just as plain backup in a directory
or in an object storage).

`DataIndexEntry.fs/path` were very confusing, as it was not clear what
they really represent and were often unecessarily used during different
operations (for example `checkout` that mutates those instead of returning
a new local index). This also removes unserializable `fs` instance from
`DataIndexEntry`, making it much easier to work with after loading.

The new `Storage.fs/path` concepts fit nicely into dvc's import functionality,
by giving a clear way to declare where to get the data from.

Related iterative/dvc#8827
@efiop efiop force-pushed the fix-dvc-data-208 branch 6 times, most recently from 17fb0f3 to a926fae Compare February 8, 2023 02:43
@efiop
Copy link
Member Author

efiop commented Feb 8, 2023

One last significant problem to figure out is no-.dir outputs in cloud versioning, where we don't save .dir anywhere because we have out.files saved in a dvcfile. This is easy during loading of data index, but more complicated to propagate changes when saving. Looking into it...

@efiop efiop force-pushed the fix-dvc-data-208 branch 3 times, most recently from 3735388 to b765b37 Compare February 9, 2023 04:40
@efiop efiop force-pushed the fix-dvc-data-208 branch 3 times, most recently from e2a484d to b998e11 Compare February 12, 2023 23:35
@efiop efiop changed the title [WIP] index: use persistent data index [WIP] index: experimental persistent data index Feb 13, 2023
@efiop efiop force-pushed the fix-dvc-data-208 branch 2 times, most recently from 7241147 to 6560412 Compare February 13, 2023 00:29
@efiop efiop added enhancement Enhances DVC performance improvement over resource / time consuming tasks labels Feb 13, 2023
@efiop
Copy link
Member Author

efiop commented Feb 13, 2023

All tests pass with it enabled by default, but I've made it opt-in for now to be able to try it out in #8930 and #8962 , just to be extra careful and to have a chance to tweak defaults some more (e.g. storage location, etc).

@efiop efiop changed the title [WIP] index: experimental persistent data index index: experimental persistent data index Feb 13, 2023
@efiop efiop marked this pull request as ready for review February 13, 2023 00:57
@efiop efiop merged commit 86a6fb1 into main Feb 13, 2023
@efiop efiop deleted the fix-dvc-data-208 branch February 13, 2023 00:57
@efiop efiop self-assigned this Feb 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC performance improvement over resource / time consuming tasks
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

1 participant