Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Append-only format with Sediment #57

Draft
wants to merge 8 commits into
base: main
Choose a base branch
from
Draft

Replace Append-only format with Sediment #57

wants to merge 8 commits into from

Conversation

ecton
Copy link
Member

@ecton ecton commented Jul 10, 2022

Closes #58.

This TODO list is for the previous attempt at integrating Sediment before it was rewritten in December

  • transaction log caching needs to be replaced with LRUs
  • When removing or overwriting an existing value, the GrainId of the value needs to be freed.
  • Checkpointing needs to be updated to ensure readers will never have their "read" state freed. Currently Roots doesn't checkpoint at all, an TreeFile auto-checkpoints when not transactional.

ecton added 8 commits May 29, 2022 17:02
This introduces a third ActiveState that is managed on a
per-committing-transaction basis. Rather than wait to release the locks
on trees until after the transaction is fully synchronized, these
changes release the locks after the file is fsynced but before the
transaction log is synced.

Before the transaction log is synced and the locks are released, the
write state is cloned. After the transaction is confirmed, the state is
published (taking care to ensure an older state doesn't overwrite a
newer state since the OS controls waking threads).

This works, but doesn't really offer parallelism due to the inability to
batch the fsync operations. To complete this work, I need to figure out
a good way to let the transaction manager issue the fsync commands, but
be smart enough to only do it once per tree per batched commit.
The limiting factor of the last commit was that each write operation on
each tree still had its own fsync operation. This change moves the
responsibility of fsyncing to the transaction manager, which supports
batching multiple transactions together already.

Now the transaction manager keeps a unique list of paths that need to be
synchronized as part of the batch. This allows multiple pending
transactions for a single tree to pile up while waiting on the current
batch to be fsynced.

This is safe because when opening a file with a transaction manager, the
file is automatically rolled back to the last confirmed transaction id.
The transaction manager ensures each tree is fsynced before writing the
transaction log entry and fsyncing the transaction. All new data written
by pending transactions will be rolled back if a power loss occurs while
the next batch is being fsynced unless the transaction log is fully
synced.

These changes allowed a significant speedup to BonsaiDb's commerce
benchmark, and the new bottleneck is the reliance on two fsyncs.
Now returns the current read sequence rather than write.
This commit replaces the append-only format with Sediment to the point that the benchmarks function.

The benchmarks do not use the Roots type, and most of that functionality is currently commented out. There's still significant work left before BonsaiDb can be used atop this implementation.
This is still a little ugly, but all unit tests are passing. Compaction
is currently not implemented, and currently Sediment acts as an
append-only file because Roots doesn't ever checkpoint it.

This commit replaces the append-only transaction format with a
commit-log powered transaction log using Sediment. When committing
multiple trees, the transaction log thread will send the trees to the
thread pool to be committed, while it commits the transaction log
itself. Once the transaction log is fully synced, the log manager thread
waits for all other trees to finish syncing, then continues to work like
the previous implementation.

The benefit of using Sediment is that this format now supports
checkpointing the transaction log, and it also implements several other
todos in the process.

Closes #42. Closes #35. Closes #40.
Sediment's file manager now supports removing paths.
While testing BonsaiDb, I encountered several problems with the
implementation:

- Automatic checkpointing had to be disabled. When running a
  multithreaded workflow, we may have readers that require a BatchId to
  remain present until their read state has been dropped. Sediment will
  need some mechanism to track and prevent checkpointing beyond a
  certain point until in-memory handles have been dropped.

  Disabling automatic checkpointing puts Sediment at a performance
  disadvantage, as one of the benefits is the abilty to reuse space
  instead of writing to the end.
- GrainIds shouldn't ever be 0. I replaced all GrainIds with
  Option<GrainId> to prevent asking Sediment for an invalid grain.
- After getting the Commerce Benchmark running, I didn't really see an
  improvement -- given the disabling of automatic checkpointing, I
  shouldn't be surprised, but I felt like I should have seen *some*
  improvement. I noticed in profiling that sometimes the transaction
  batches were a single commit long, and so I theorized that by not
  allowing a little delay when trying to batch transactions that
  sometimes a single-transaction batch would get through, causing the
  remaining threads to block waiting on that one transaction to be
  written.

  This recv_deadline didn't move the needle, but I left it in because
  I still think this is one part of a better batching approach. My new
  theory this morning is that the transaction batcher isn't batching
  as much as it could. For example, if the transaction channel has ids
  [1, 3, 2, 4, 6, 5] in it, right now it would be committed as [1], [2, 3,
  4], [5. 6]. This is because the first time a discontiguous ID is
  detected, the currently batched list is committed. I want to refactor
  the transaction batcher to always exhaust the channel before
  determining the list of safe-to-commit transactions.

  That change would also have benefited the old format.

  # Please enter the commit message for your changes. Lines starting
Sediment now has a CheckpointGuard which prevents checkpointing beyond
a protected BatchId. The ActiveState now holds a reference to the
CheckpointGuard pointed to by the most recent comment.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Nebari v0.6: Reimplementing Nebari's Storage Layer
1 participant