-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Documents and Views to better utilize Nebari #250
base: main
Are you sure you want to change the base?
Conversation
This also has one small change of moving when the documents are queried into the transaction. This should have no actual effect, since the integrity scanner must run for all views now before a document transaction is applied.
While working on khonsulabs/bonsaidb#225, I had to do major updates on how documents are stored. In the new storage scheme, the key in Nebari is the DocumentId, Revision::id is Nebari's SequenceId, and Revision::hash is indexed and stored as an embedded index. This means that retrieving a document needs access to the index to fully construct the Document record -- hence these new APIs. Refs: khonsulabs/bonsaidb#250
Both khonsulabs#76 and khonsulabs#225 ended up being heavily intertwined. This is not yet in its final form, but it's complete enough that unit tests are passing (aside from backwards compatibility ones). Document Storage: Documents are no longer serialized in a wrapper document type. Instead, the documents tree is now a versioned tree with an embedded index that stores the document's hash. The Revision's id is now the versioned tree's sequence_id. This means that instead of simply pulling a document out of the database and deserializing it, we must pull the value and index out for a key and combine it with the key to create our document. The other major change is introduced by the constraints of working within Nebari's modification system. Because we don't have access to the index for a key we're about to set, most of the logic for creating the OperationResult has been moved outside of the CompareSwap operation. View Storage: Views have been refactored to store the reduced value in Nebari through use of an embedded index. Instead of storing the entire ViewEntry structure in the view, we now only store the serialized `Vec<Entrymapping>`. The major change here is that Nebari will now reduce the stored index via the new `ViewIndexer`. The changes haven't been made to reduce/reduce_grouped yet to use Nebari's native reduce function -- but that is the inspiration for these changes. When retrieving a view entry, we reconstruct the ViewEntry using the stored index to maintain compatibility with the existing code that worked with the ViewEntry structure. There are a lot of remaining tasks: - Update reduce/reduce_grouped() to use Nebari's built-in reduction. - Remove the invalidated_entries map and make the view mapper sequence based. - Embed the DocumentMap tree in the ViewEntries tree by creating a custom Root. - Once all the above are done, when the view indexer is running outside of a transaction (lazy views), the view can be persisted without fsync and be 100% safe to use due to the append-only file format.
This commit removes the invalidated entries tree, and uses the sequence index of the documents tree to drive the indexing. The mapping operation is batched and performed in such a way that if new data is added to the documents tree while the operation is being performed, the indexing is performed using the sequence data at the time of the mapping job being kicked off. This guarantee allows us to track what the latest indexed sequence ID in the ViewEntries embedded index. The start of the map job begins from the ViewEntries tree's latest sequence id + 1.
This was a weird one to debug, as it only showed up on the simultaneous-connections test. Yet, the bug was unrelated to multiprocessing. Eager views are meant to always be up-to-date. This contract was broken when multiprocessing was involved, because there was a logic bug: the index being returned from TransactionTree::remove is the existing index, which means its sequence id is of the removed sequence, not the newly writeten sequence (document entries aren't actually removed, for history preservation). The fix is to retrieve the new sequence value and map it instead. This ensures we're actually mapping the deleted version of the entry. The reason this didn't cause issues outside of multithreading is most tests are written without specifying an access policy, which means all the queries are AccessPolicy::UpdateBefore. This meant that the preparation for queries would still index it, as it wasn't actually up-to-date.
This largely enables external Roots to be written. As part of khonsulabs/bonsaidb#250, I am moving the document map into the view entries tree, which requires using two roots like the Versioned root. The biggest limiting factor was a lot of unexported functionality and types. This set of changes also removes some of the &mut requirements for some of the closures. The last major change is adding the Value associated type to Root, which allows a tree to use something other than ArcBytes on its public API surface. The two built-in trees will continue to use ArcBytes, but the ViewEntries tree in BonsaiDb will be using a custom type to prevent extra deserialization.
This change turns ViewEntries into a new Root implementor for Nebari that stores the view entries in one B+Tree, and stores the document map in another B+Tree. This pull request does not yet add the ability to query from the document map. Once that is implemented, I can remove the external document map tree which will conclude the final format changes.
This removes the document_map tree, and stores it inline in a new custom Nebari Root. This custom tree supports querying what keys a document id emitted as well as what mappings were emitted for any given key. This branch also contains several other changes: The integrity scanner can spawn a mapping job, and that mapping job must use transactions if the view is eager. This set of changes addressed that, but it also lumped in with a refactor to change from easy_parallel to rayon. While rayon is a heaver dependency, I was noticing a *lot* of traffic on profiles for spinning up new threads. Rayon uses a persistent thread pool for work, and by embracing it here, we can start using it in other locations as well.
I've been starting work on a new file format that is my best theorycratt at something that could sit beneath Nebari -- https://github.com/khonsulabs/sediment. At its core is the basic idea that while fsync is happening, other transactions can proceed with updating the database, and then be batch-synced to confirm. This would make the fsyncs on each thread take on average the normal time for a sync, but now transactions will be able to be batched. That core idea is actually somewhat compatible with the append-only format, except that only one writer can be modifying the tree at any given moment. I attempted to bring this idea into Nebari without the new project today, but I ran into another issue that Sediment wouldn't suffer from: multi-file synchronizations. The reason my work today didn't do much is that each tree file is still being synced for each write. I don't have a good way to batch these operations at the moment, but it's one of the things Sediment aims to solve. I may come up with an idea in the meantime and try again -- but the more I think about Sediment the more I'm hopeful it will be able to be significantly better than an append-only format, so I probably still want to get there anyways. |
This is meant to be an atomic operation, and is implemented in SQL as a single query.
The collection sequence tracking I introduced as part of the sequence-based-mapping refactor was done incorrectly -- the sequence IDs can't be published to shared state until the transaction is confirmed. The edge case was that a lazy view could start mapping while a collection had a pending transaction being applied. The collection's sequence could report a higher number than the database would return via a query due to the transaction not being writen yet. This was partially a Nebari bug as well -- Tree::current_transaction_id was implemented incorrectly, while TransactionTree/TreeFile were correct.
This isn't completely functional, but I was ready to merge changes in for clippy fixes from main. Still, only 2 tests are broken in bonsaidb-local currently that are expected to be working.
While working on khonsulabs/bonsaidb#225, I had to do major updates on how documents are stored. In the new storage scheme, the key in Nebari is the DocumentId, Revision::id is Nebari's SequenceId, and Revision::hash is indexed and stored as an embedded index. This means that retrieving a document needs access to the index to fully construct the Document record -- hence these new APIs. Refs: khonsulabs/bonsaidb#250
This largely enables external Roots to be written. As part of khonsulabs/bonsaidb#250, I am moving the document map into the view entries tree, which requires using two roots like the Versioned root. The biggest limiting factor was a lot of unexported functionality and types. This set of changes also removes some of the &mut requirements for some of the closures. The last major change is adding the Value associated type to Root, which allows a tree to use something other than ArcBytes on its public API surface. The two built-in trees will continue to use ArcBytes, but the ViewEntries tree in BonsaiDb will be using a custom type to prevent extra deserialization.
Closes #76.
Closes #225.
The primary goal of this PR is to improve the speed of view indexing (See #251
for more info) by tackling #76 in such a way that it can be executed safely
without
fsync
.Now that work has been done, the goals are slightly different:
based rather than invalidated-keys based.
fsync
, and eager views to execute fullysynchronized in their transaction (although it might still be safe to be
fsync-less in the transaction context, but more thought needs to be done in
that direction).
Document Storage:
Documents are no longer serialized in a wrapper document type. Instead,
the documents tree is now a versioned tree with an embedded index that
stores the document's hash. The Revision's id is now the versioned
tree's sequence_id.
This means that instead of simply pulling a document out of the database
and deserializing it, we must pull the value and index out for a key and
combine it with the key to create our document.
The other major change is introduced by the constraints of working
within Nebari's modification system. Because we don't have access to the
index for a key we're about to set, most of the logic for creating the
OperationResult has been moved outside of the CompareSwap operation.
View Storage:
Views have been refactored to store the reduced value in Nebari through
use of an embedded index. Instead of storing the entire ViewEntry
structure in the view, we now only store the serialized
Vec<Entrymapping>
. The major change here is that Nebari will nowreduce the stored index via the new
ViewIndexer
. The changes haven'tbeen made to reduce/reduce_grouped yet to use Nebari's native reduce
function -- but that is the inspiration for these changes.
When retrieving a view entry, we reconstruct the ViewEntry using the
stored index to maintain compatibility with the existing code that
worked with the ViewEntry structure.
These are a lot of remaining tasks:
based.
custom Root.
of a transaction (lazy views), the view can be persisted without fsync
and be 100% safe to use due to the append-only file format.