Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete#1796
Open
badrishc wants to merge 1 commit into
Open
Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete#1796badrishc wants to merge 1 commit into
badrishc wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR fixes a race in multi-database background checkpointing where a general BGSAVE could release an individual DB’s checkpoint lock as soon as that DB finished, allowing a subsequent BGSAVE <dbId> to sometimes succeed mid-flight and flake MultiDatabaseSaveInProgressTest (notably in fast CI Release runs). The change centralizes ownership of per-DB lock resumption so locks remain held until the full general save completes.
Changes:
- Moved per-DB checkpoint lock resumption responsibility out of
RunPausedCheckpointAsyncand into its callers. - Updated the general/per-DB
BGSAVEhelper to resume all pre-paused DB checkpoint locks only afterTask.WhenAllcompletes. - Adjusted the AOF-size-driven checkpoint path to resume the paused DB lock in its outer
finally.
06fcc46 to
7a46983
Compare
Collaborator
Author
|
Addressed in 7a46983: reworded the doc comment so the parameter range reads as plain prose instead of mixing |
…nts complete Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake in CI Release builds. The general BGSAVE path synchronously paused all per-DB checkpoint locks before returning 'Background saving started', but the per-DB checkpoint helper released each per-DB lock as soon as that single DB's checkpoint completed - not when the entire general save finished. If DB 0's checkpoint completed before the test's 'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with 'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed. See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662. Fix: - RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE) resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks are held until ALL per-DB checkpoints complete, not just each individual one. A per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint. - The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming. - Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely double-await even if the synchronous task-creation loop throws partway through. The double-await ensures we never resume a per-DB lock while its checkpoint is still running. - Remove the handedOffCount partial-resume bookkeeping that's no longer needed. - The previously-shared RunPausedCheckpointAsync helper is removed - its only other caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/ update/finally/resume sequence so its single-DB pause-resume lifecycle is visible in one place. Net effect: - General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'. - Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync with pausedCount=1): unchanged - that single per-DB lock is still released exactly when that single checkpoint completes. - AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined). - Other legal scenarios (per-DB then per-DB on different DB, per-DB then general, general blocks general) preserved. Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest + MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31) passes locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
7a46983 to
da825f2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a residual race in #1767 that caused
MultiDatabaseTests.MultiDatabaseSaveInProgressTestto flake in CI Release builds (e.g. https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662).Root cause
#1767 made the general
BGSAVEsynchronously pause all per-DB checkpoint locks (TryPauseCheckpoints(id)) before returningBackground saving started, so a subsequentBGSAVE <dbId>would observe the in-progress checkpoint and fail. ButRunPausedCheckpointAsyncreleased each per-DB lock infinallyas soon as that single DB's checkpoint completed, not when the entire general save finished.In the failing test:
If DB 0's checkpoint completed before
BGSAVE 0arrived over the wire, BGSAVE 0 succeeded and the assertion failed. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed.Fix
In
libs/server/Databases/MultiDatabaseManager.cs:RunPausedCheckpointAsync: removed theResumeCheckpoints(dbId)from itsfinallyblock — caller now owns the resume.RunPausedCheckpointsAndReleaseLocksAsync(used by both general and per-DB BGSAVE): resumes all pre-paused DBs in its outerfinally, afterTask.WhenAll. Pre-fillscheckpointTasks[]withTask.CompletedTaskand double-awaits in the catch block so a synchronous task-creation throw cannot leave a per-DB checkpoint running while its lock is being resumed. ThehandedOffCountpartial-resume logic is removed — no longer needed since the helper no longer self-resumes.TaskCheckpointBasedOnAofSizeLimitAsync(the only other caller ofRunPausedCheckpointAsync, used by AOF-size-driven checkpoints): hoistspausedDbIdto outer scope and callsResumeCheckpoints(pausedDbId)in its outerfinally.Net effect
checkpoint already in progress. ✓pausedCount=1): unchanged — that single lock is still released exactly when that single checkpoint completes.multiDbCheckpointingLock)Verification
MultiDatabaseSaveInProgressTestpass in Release config locally.MultiDatabaseTestssuite (31/31) passes locally.