Fix #1749: leaked transient X-lock [main]#1753
Merged
Conversation
When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize,
TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry
does not fit on page") from inside the PostUpsertOperation /
PostRMWOperation / PostDeleteOperation callback. The exception unwinds
the finally block in InternalUpsert / InternalRMW / InternalDelete (and
ContinuePendingRMW) before TransientXUnlock can run, leaving the hash
bucket's transient exclusive lock held forever. Any subsequent op on the
same bucket then spins indefinitely in a RETRY_LATER loop, pinning the
server CPU at 100% until restart.
Wrap the Post*Operation call in a nested try/finally so TransientXUnlock
always runs even on exception. The original TsavoriteException continues
to propagate, preserving the observed behaviour for the user (connection
closed) while leaving the server responsive.
While here, drop the dead 'latchOperation' / 'LatchRelease' machinery
from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation
enum and the ref parameter on CheckCPRConsistency*): the
CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we
no longer need to do the bucket latching" comment confirms it), so the
switch in the LatchRelease block was unreachable. Renamed the empty
LatchRelease label to Done for clarity.
Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an
aofPageSize parameter on TestUtils.CreateGarnetServer so the test can
trigger the oversize path with a 4 KB AOF page.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes a Tsavorite transient hash-bucket X-lock leak that could hang the server when a Post*Operation callback (notably AOF append) throws due to an oversized AOF entry, and adds a regression test to ensure the server remains responsive.
Changes:
- Ensure
TransientXUnlockalways runs by wrappingPostUpsertOperation/PostRMWOperation/PostDeleteOperation(and pending RMW completion) in a nestedtry/finally. - Remove dead/unreachable latch-release machinery (
LatchOperation+ related plumbing) and rename the now-semantic label toDone. - Add a regression test (
OversizedAofEntryDoesNotHangServer) and exposeaofPageSizeinTestUtils.CreateGarnetServerto trigger the oversize path with small payloads.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| test/Garnet.test/TestUtils.cs | Adds optional aofPageSize to test server creation and wires it to GarnetServerOptions.AofPageSize. |
| test/Garnet.test/RespAofTests.cs | Adds regression test that reproduces the oversized-AOF-entry exception and verifies subsequent ops don’t hang. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalUpsert.cs | Ensures transient X-unlock runs even if PostUpsertOperation throws; removes dead latch logic and uses Done label. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalRMW.cs | Same exception-safe unlock pattern for RMW; removes dead latch logic and uses Done label. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalDelete.cs | Same exception-safe unlock pattern for Delete; removes dead latch logic and uses Done label. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/Helpers.cs | Removes unused LatchOperation enum. |
| libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/ContinuePending.cs | Applies the same nested try/finally pattern to ContinuePendingRMW to prevent lock leaks on callback exceptions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
TedHartMS
approved these changes
Apr 30, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When a SET / RMW / DELETE produces an AOF entry larger than AofPageSize, TsavoriteLog.ValidateAllocatedLength throws TsavoriteException("Entry does not fit on page") from inside the PostUpsertOperation / PostRMWOperation / PostDeleteOperation callback. The exception unwinds the finally block in InternalUpsert / InternalRMW / InternalDelete (and ContinuePendingRMW) before TransientXUnlock can run, leaving the hash bucket's transient exclusive lock held forever. Any subsequent op on the same bucket then spins indefinitely in a RETRY_LATER loop, pinning the server CPU at 100% until restart.
Wrap the Post*Operation call in a nested try/finally so TransientXUnlock always runs even on exception. The original TsavoriteException continues to propagate, preserving the observed behaviour for the user (connection closed) while leaving the server responsive.
While here, drop the dead 'latchOperation' / 'LatchRelease' machinery from InternalUpsert / InternalRMW / InternalDelete (the LatchOperation enum and the ref parameter on CheckCPRConsistency*): the CheckCPRConsistency* helpers never wrote to the ref parameter ("Now we no longer need to do the bucket latching" comment confirms it), so the switch in the LatchRelease block was unreachable. Renamed the empty LatchRelease label to Done for clarity.
Adds a regression test (OversizedAofEntryDoesNotHangServer) plus an aofPageSize parameter on TestUtils.CreateGarnetServer so the test can trigger the oversize path with a 4 KB AOF page.