Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete by badrishc · Pull Request #1796 · microsoft/garnet

badrishc · 2026-05-12T22:21:02Z

Summary

Fixes a residual race in #1767 that caused MultiDatabaseTests.MultiDatabaseSaveInProgressTest to flake in CI Release builds (e.g. https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662).

Root cause

#1767 made the general BGSAVE synchronously pause all per-DB checkpoint locks (TryPauseCheckpoints(id)) before returning Background saving started, so a subsequent BGSAVE <dbId> would observe the in-progress checkpoint and fail. But RunPausedCheckpointAsync released each per-DB lock in finally as soon as that single DB's checkpoint completed, not when the entire general save finished.

In the failing test:

// Issue general background save
res = db1.Execute("BGSAVE");
ClassicAssert.AreEqual("Background saving started", res.ToString());

// Issue background save to DB 0 while general save is in progress - illegal
Assert.Throws<RedisServerException>(() => db1.Execute("BGSAVE", "0"),
    Encoding.ASCII.GetString(CmdStrings.RESP_ERR_CHECKPOINT_ALREADY_IN_PROGRESS));

If DB 0's checkpoint completed before BGSAVE 0 arrived over the wire, BGSAVE 0 succeeded and the assertion failed. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed.

Fix

In libs/server/Databases/MultiDatabaseManager.cs:

RunPausedCheckpointAsync: removed the ResumeCheckpoints(dbId) from its finally block — caller now owns the resume.
RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE): resumes all pre-paused DBs in its outer finally, after Task.WhenAll. Pre-fills checkpointTasks[] with Task.CompletedTask and double-awaits in the catch block so a synchronous task-creation throw cannot leave a per-DB checkpoint running while its lock is being resumed. The handedOffCount partial-resume logic is removed — no longer needed since the helper no longer self-resumes.
TaskCheckpointBasedOnAofSizeLimitAsync (the only other caller of RunPausedCheckpointAsync, used by AOF-size-driven checkpoints): hoists pausedDbId to outer scope and calls ResumeCheckpoints(pausedDbId) in its outer finally.

Net effect

General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any per-DB BGSAVE issued mid-flight reliably fails with checkpoint already in progress. ✓
Per-DB BGSAVE alone (single-DB path with pausedCount=1): unchanged — that single lock is still released exactly when that single checkpoint completes.
AOF-size-driven checkpoint: unchanged — still releases the per-DB lock when its checkpoint completes (just resumed in caller's finally instead of in the helper).
Other legal scenarios preserved:
- per-DB then per-DB on different DB → both succeed
- per-DB then general → general succeeds (skips already-paused DBs)
- general then general → second one fails (guarded by multiDbCheckpointingLock)

Verification

15/15 runs of MultiDatabaseSaveInProgressTest pass in Release config locally.
Full MultiDatabaseTests suite (31/31) passes locally.
Reviewed by GPT-5.5 code-review agent — no findings.

Copilot

Pull request overview

This PR fixes a race in multi-database background checkpointing where a general BGSAVE could release an individual DB’s checkpoint lock as soon as that DB finished, allowing a subsequent BGSAVE <dbId> to sometimes succeed mid-flight and flake MultiDatabaseSaveInProgressTest (notably in fast CI Release runs). The change centralizes ownership of per-DB lock resumption so locks remain held until the full general save completes.

Changes:

Moved per-DB checkpoint lock resumption responsibility out of RunPausedCheckpointAsync and into its callers.
Updated the general/per-DB BGSAVE helper to resume all pre-paused DB checkpoint locks only after Task.WhenAll completes.
Adjusted the AOF-size-driven checkpoint path to resume the paused DB lock in its outer finally.

badrishc · 2026-05-13T02:08:17Z

Addressed in 7a46983: reworded the doc comment so the parameter range reads as plain prose instead of mixing [0..N) half-open notation with self-closing <paramref/> XML tags.

…nts complete Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake in CI Release builds. The general BGSAVE path synchronously paused all per-DB checkpoint locks before returning 'Background saving started', but the per-DB checkpoint helper released each per-DB lock as soon as that single DB's checkpoint completed - not when the entire general save finished. If DB 0's checkpoint completed before the test's 'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with 'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed. See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662. Fix: - RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE) resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks are held until ALL per-DB checkpoints complete, not just each individual one. A per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint. - The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming. - Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely double-await even if the synchronous task-creation loop throws partway through. The double-await ensures we never resume a per-DB lock while its checkpoint is still running. - Remove the handedOffCount partial-resume bookkeeping that's no longer needed. - The previously-shared RunPausedCheckpointAsync helper is removed - its only other caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/ update/finally/resume sequence so its single-DB pause-resume lifecycle is visible in one place. Net effect: - General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'. - Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync with pausedCount=1): unchanged - that single per-DB lock is still released exactly when that single checkpoint completes. - AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined). - Other legal scenarios (per-DB then per-DB on different DB, per-DB then general, general blocks general) preserved. Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest + MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31) passes locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 12, 2026 22:21

Copilot started reviewing on behalf of badrishc May 12, 2026 22:22 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread libs/server/Databases/MultiDatabaseManager.cs Outdated

badrishc force-pushed the badrishc/fix-multidb-bgsave-race branch 3 times, most recently from 06fcc46 to 7a46983 Compare May 13, 2026 02:08

badrishc force-pushed the badrishc/fix-multidb-bgsave-race branch from 7a46983 to da825f2 Compare May 13, 2026 02:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete#1796

Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete#1796
badrishc wants to merge 1 commit into
mainfrom
badrishc/fix-multidb-bgsave-race

badrishc commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

badrishc commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

badrishc commented May 12, 2026

Summary

Root cause

Fix

Net effect

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

badrishc commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants