-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
index: experimental persistent data index #8827
Conversation
Codecov ReportBase: 93.13% // Head: 93.13% // Increases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #8827 +/- ##
=======================================
Coverage 93.13% 93.13%
=======================================
Files 455 453 -2
Lines 36644 36604 -40
Branches 5289 5287 -2
=======================================
- Hits 34127 34092 -35
+ Misses 2001 1998 -3
+ Partials 516 514 -2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
2dc1a74
to
833c450
Compare
0021d29
to
7560919
Compare
914b469
to
e75a9d1
Compare
For the record: switched to a tree-based id (instead of git revs), which made it possible to use without git and for dirty changes. But that triggered a bit more bugs on the import side, where we assume index entries have |
This replaces `odb_map` and `remote_map` in `DataIndex`, and `fs` and `path` in `DataIndexEntry` incorporating everything into `Storage`, which describes where to get the data contents from no matter how they are stored (just as plain backup in a directory or in an object storage). `DataIndexEntry.fs/path` were very confusing, as it was not clear what they really represent and were often unecessarily used during different operations (for example `checkout` that mutates those instead of returning a new local index). This also removes unserializable `fs` instance from `DataIndexEntry`, making it much easier to work with after loading. The new `Storage.fs/path` concepts fit nicely into dvc's import functionality, by giving a clear way to declare where to get the data from. Related iterative/dvc#8827
This replaces `odb_map` and `remote_map` in `DataIndex`, and `fs` and `path` in `DataIndexEntry` incorporating everything into `Storage`, which describes where to get the data contents from no matter how they are stored (just as plain backup in a directory or in an object storage). `DataIndexEntry.fs/path` were very confusing, as it was not clear what they really represent and were often unecessarily used during different operations (for example `checkout` that mutates those instead of returning a new local index). This also removes unserializable `fs` instance from `DataIndexEntry`, making it much easier to work with after loading. The new `Storage.fs/path` concepts fit nicely into dvc's import functionality, by giving a clear way to declare where to get the data from. Related iterative/dvc#8827
This replaces `odb_map` and `remote_map` in `DataIndex`, and `fs` and `path` in `DataIndexEntry` incorporating everything into `Storage`, which describes where to get the data contents from no matter how they are stored (just as plain backup in a directory or in an object storage). `DataIndexEntry.fs/path` were very confusing, as it was not clear what they really represent and were often unecessarily used during different operations (for example `checkout` that mutates those instead of returning a new local index). This also removes unserializable `fs` instance from `DataIndexEntry`, making it much easier to work with after loading. The new `Storage.fs/path` concepts fit nicely into dvc's import functionality, by giving a clear way to declare where to get the data from. Related iterative/dvc#8827
17fb0f3
to
a926fae
Compare
One last significant problem to figure out is no-.dir outputs in cloud versioning, where we don't save |
3735388
to
b765b37
Compare
e2a484d
to
b998e11
Compare
7241147
to
6560412
Compare
6560412
to
04e063d
Compare
04e063d
to
191f040
Compare
Putting iterative/dvc-data#208 to use.
Note that this is
experimental
for now and needs to be enabled withdvc config feature.data_index_cache true
.For example, sequential run of
goes down from ~16sec to ~8sec (2x improvement)
and
goes down from ~11sec to ~1.5sec (7x improvement)