Skip to content

Fix #1749: leaked transient X-lock [main]#1753

Merged
badrishc merged 1 commit intomainfrom
badrishc/fix-1749
Apr 30, 2026
Merged

Fix #1749: leaked transient X-lock [main]#1753
badrishc merged 1 commit intomainfrom
badrishc/fix-1749

Conversation

@badrishc
Copy link
Copy Markdown
Collaborator

When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize, TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry does not fit on page") from inside the PostUpsertOperation / PostRMWOperation / PostDeleteOperation callback. The exception unwinds the finally block in InternalUpsert / InternalRMW / InternalDelete (and ContinuePendingRMW) before TransientXUnlock can run, leaving the hash bucket's transient exclusive lock held forever. Any subsequent op on the same bucket then spins indefinitely in a RETRY_LATER loop, pinning the server CPU at 100% until restart.

Wrap the Post*Operation call in a nested try/finally so TransientXUnlock always runs even on exception. The original TsavoriteException continues to propagate, preserving the observed behaviour for the user (connection closed) while leaving the server responsive.

While here, drop the dead 'latchOperation' / 'LatchRelease' machinery from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation enum and the ref parameter on CheckCPRConsistency*): the CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we no longer need to do the bucket latching" comment confirms it), so the switch in the LatchRelease block was unreachable. Renamed the empty LatchRelease label to Done for clarity.

Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an aofPageSize parameter on TestUtils.CreateGarnetServer so the test can trigger the oversize path with a 4 KB AOF page.

When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize,
TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry
does not fit on page") from inside the PostUpsertOperation /
PostRMWOperation / PostDeleteOperation callback. The exception unwinds
the finally block in InternalUpsert / InternalRMW / InternalDelete (and
ContinuePendingRMW) before TransientXUnlock can run, leaving the hash
bucket's transient exclusive lock held forever. Any subsequent op on the
same bucket then spins indefinitely in a RETRY_LATER loop, pinning the
server CPU at 100% until restart.

Wrap the Post*Operation call in a nested try/finally so TransientXUnlock
always runs even on exception. The original TsavoriteException continues
to propagate, preserving the observed behaviour for the user (connection
closed) while leaving the server responsive.

While here, drop the dead 'latchOperation' / 'LatchRelease' machinery
from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation
enum and the ref parameter on CheckCPRConsistency*): the
CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we
no longer need to do the bucket latching" comment confirms it), so the
switch in the LatchRelease block was unreachable. Renamed the empty
LatchRelease label to Done for clarity.

Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an
aofPageSize parameter on TestUtils.CreateGarnetServer so the test can
trigger the oversize path with a 4 KB AOF page.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings April 29, 2026 23:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Tsavorite transient hash-bucket X-lock leak that could hang the server when a Post*Operation callback (notably AOF append) throws due to an oversized AOF entry, and adds a regression test to ensure the server remains responsive.

Changes:

  • Ensure TransientXUnlock always runs by wrapping PostUpsertOperation / PostRMWOperation / PostDeleteOperation (and pending RMW completion) in a nested try/finally.
  • Remove dead/unreachable latch-release machinery (LatchOperation + related plumbing) and rename the now-semantic label to Done.
  • Add a regression test (OversizedAofEntryDoesNotHangServer) and expose aofPageSize in TestUtils.CreateGarnetServer to trigger the oversize path with small payloads.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
test/Garnet.test/TestUtils.cs Adds optional aofPageSize to test server creation and wires it to GarnetServerOptions.AofPageSize.
test/Garnet.test/RespAofTests.cs Adds regression test that reproduces the oversized-AOF-entry exception and verifies subsequent ops don’t hang.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalUpsert.cs Ensures transient X-unlock runs even if PostUpsertOperation throws; removes dead latch logic and uses Done label.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalRMW.cs Same exception-safe unlock pattern for RMW; removes dead latch logic and uses Done label.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalDelete.cs Same exception-safe unlock pattern for Delete; removes dead latch logic and uses Done label.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/Helpers.cs Removes unused LatchOperation enum.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/ContinuePending.cs Applies the same nested try/finally pattern to ContinuePendingRMW to prevent lock leaks on callback exceptions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@badrishc badrishc merged commit d746a55 into main Apr 30, 2026
66 of 67 checks passed
@badrishc badrishc deleted the badrishc/fix-1749 branch April 30, 2026 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants