Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tlf_journal: protect against crasher during resolveBranch #21720

Merged
merged 1 commit into from
Jan 6, 2020

Conversation

strib
Copy link
Contributor

@strib strib commented Dec 18, 2019

A user hit the following issue:

  1. Their client completed a large conflict resolution, and called tlfJournal.resolveBranch() to finalize it.
  2. The new commit was written to a new mdJournal directory, which was then swapped into place, making it the main MD journal.
  3. Then, KBFS restarted, before resolveBranch() had a chance to mark all the MD revision markers in the block journal as ignorable.
  4. After the restart, the journal resumed flushing blocks. Concurrently, the user wrote more revisions into the MD journal, re-using some of the same revision numbers that had previously been conflict-resolved away in step 1.
  5. While flushing blocks, when the TLF journal discovered a revision marker for an old MD revision (which had originally been resolved and squashed in step 1), it decided it was ok to flush a new revision with that same revision number. This resulted in the revision being live to other clients before the blocks associated with that revision were available, and other clients could not read the TLF.

To protect against such a race, this commit adds a new notion of a "journal ID" for the MD journal. Each time the MD journal is cleared or resolved, and then a new revision is written into it, it gets a new unique ID that is persisted to a file. Also, each MD revision marker gets labeled with the current MD journal ID. When flushing, the code ignores any MD revision marker that doesn't match the current MD journal ID. That way if KBFS crashes/restarts after clearing the MD journal, but before updating the block journal, the old MD revision markers will be properly ignored when the flush resumes.

Issue: HOTPOT-1553

A user hit the following issue:

1. Their client completed a large conflict resolution, and called
`tlfJournal.resolveBranch()` to finalize it.
2. The new commit was written to a new mdJournal directory, which was
then swapped into place, making it the main MD journal.
3. Then, KBFS restarted, _before_ `resolveBranch()` had a chance to
mark all the MD revision markers in the block journal as ignorable.
4. After the restart, the journal resumed flushing blocks.
Concurrently, the user wrote more revisions into the MD journal,
re-using some of the same revision numbers that had previously been
conflict-resolved away in step 1.
5. While flushing blocks, when the TLF journal discovered a revision
marker for an old MD revision (which had originally been resolved and
squashed in step 1), it decided it was ok to flush a _new_ revision
with that same revision number.  This resulted in the revision being
live to other clients _before_ the blocks associated with that
revision were available, and other clients could not read the TLF.

To protect against such a race, this commit adds a new notion of a
"journal ID" for the MD journal.  Each time the MD journal is cleared
or resolved, and then a new revision is written into it, it gets a new
unique ID that is persisted to a file.  Also, each MD revision marker
gets labeled with the current MD journal ID.  When flushing, the code
ignores any MD revision marker that doesn't match the current MD
journal ID.  That way if KBFS crashes/restarts after clearing the MD
journal, but before updating the block journal, the old MD revision
markers will be properly ignored when the flush resumes.

Issue: HOTPOT-1553
@strib strib merged commit 91e6a45 into master Jan 6, 2020
@strib strib deleted the strib/HOTPOT-1553-more-atomic-journal-resolve branch January 6, 2020 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants