Skip to content

Fix MultiDatabaseSaveInProgressTest#1767

Merged
badrishc merged 7 commits into
devfrom
badrishc/multidb-test-fix
May 8, 2026
Merged

Fix MultiDatabaseSaveInProgressTest#1767
badrishc merged 7 commits into
devfrom
badrishc/multidb-test-fix

Conversation

@badrishc
Copy link
Copy Markdown
Collaborator

@badrishc badrishc commented May 4, 2026

The general BGSAVE path returned 'Background saving started' before its async helper had pause-locked any per-DB checkpoint locks. A subsequent BGSAVE could win the race against TryPauseCheckpoints(dbId) and succeed, instead of failing with 'ERR checkpoint already in progress'.

This caused MultiDatabaseTests.MultiDatabaseSaveInProgressTest to flake in CI. SingleDatabaseManager.TakeCheckpointAsync already does the right thing by pausing synchronously before returning, which is why the single-DB equivalent test (SeSaveInProgressTest) does not flake.

Fix:

  • Restructure MultiDatabaseManager.TakeCheckpointAsync(bool, ILogger, CancellationToken) so the synchronous portion now acquires databasesContentLock, multiDbCheckpointingLock (when multi-db), AND calls TryPauseCheckpoints(dbId) for every active DB before returning. Compactly store only successfully paused DB IDs in dbIdsToCheckpoint. Roll back any partially-acquired state on exception in the sync phase.
  • Add alreadyPaused parameter to TakeDatabasesCheckpointAsync so the new caller can skip the inline TryPauseCheckpoints. Add a catch fallback that resumes pre-paused DB IDs not yet handed off to per-DB helpers, preventing stranded locks.
  • Existing AOF-size-driven caller (TaskCheckpointBasedOnAofSizeLimitAsync) is unaffected; it uses the default alreadyPaused=false.

Regression test: MultiDatabaseGeneralSaveBlocksGeneralSaveTest verifies that a second general BGSAVE while one is in flight (multi-db) reliably returns 'ERR checkpoint already in progress' (multiDbCheckpointingLock is now held synchronously).

Copilot AI review requested due to automatic review settings May 4, 2026 22:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a race in multi-database background checkpointing where a general BGSAVE could return before per-DB checkpoint locks were synchronously paused, causing MultiDatabaseSaveInProgressTest to flake.

Changes:

  • Restructures MultiDatabaseManager.TakeCheckpointAsync to synchronously acquire the multi-DB checkpoint lock and pause per-DB checkpoints before returning.
  • Extends TakeDatabasesCheckpointAsync with an alreadyPaused option and adds rollback logic to avoid stranded per-DB pause locks on early failure/exception.
  • Adds a regression test ensuring a second general BGSAVE reliably fails while a multi-DB checkpoint is in progress.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
test/Garnet.test/MultiDatabaseTests.cs Adds regression coverage for concurrent general BGSAVE during an active multi-DB background save.
libs/server/Databases/MultiDatabaseManager.cs Moves critical lock acquisition/pause steps into the synchronous portion of checkpoint initiation and adds safer rollback paths.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/Garnet.test/MultiDatabaseTests.cs Outdated
Comment thread test/Garnet.test/MultiDatabaseTests.cs Outdated
Comment thread test/Garnet.test/MultiDatabaseTests.cs Outdated
Comment thread libs/server/Databases/MultiDatabaseManager.cs Outdated
Comment thread libs/server/Databases/MultiDatabaseManager.cs Outdated
@badrishc badrishc marked this pull request as draft May 5, 2026 18:06
@badrishc badrishc force-pushed the badrishc/multidb-test-fix branch from 31f3212 to 6402d96 Compare May 7, 2026 22:08
badrishc and others added 5 commits May 7, 2026 18:10
The general BGSAVE path returned 'Background saving started' before its
async helper had pause-locked any per-DB checkpoint locks. A subsequent
BGSAVE <dbId> could win the race against TryPauseCheckpoints(dbId) and
succeed, instead of failing with 'ERR checkpoint already in progress'.

This caused MultiDatabaseTests.MultiDatabaseSaveInProgressTest to flake
in CI. SingleDatabaseManager.TakeCheckpointAsync already does the right
thing by pausing synchronously before returning, which is why the
single-DB equivalent test (SeSaveInProgressTest) does not flake.

Fix:
- Restructure MultiDatabaseManager.TakeCheckpointAsync(bool, ILogger,
  CancellationToken) so the synchronous portion now acquires
  databasesContentLock, multiDbCheckpointingLock (when multi-db), AND
  calls TryPauseCheckpoints(dbId) for every active DB before returning.
  Compactly store only successfully paused DB IDs in dbIdsToCheckpoint.
  Roll back any partially-acquired state on exception in the sync phase.
- Add alreadyPaused parameter to TakeDatabasesCheckpointAsync so the new
  caller can skip the inline TryPauseCheckpoints. Add a catch fallback
  that resumes pre-paused DB IDs not yet handed off to per-DB helpers,
  preventing stranded locks.
- Existing AOF-size-driven caller (TaskCheckpointBasedOnAofSizeLimitAsync)
  is unaffected; it uses the default alreadyPaused=false.

Regression test: MultiDatabaseGeneralSaveBlocksGeneralSaveTest verifies
that a second general BGSAVE while one is in flight (multi-db) reliably
returns 'ERR checkpoint already in progress' (multiDbCheckpointingLock
is now held synchronously).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sAsync

The test has been timing out in CI. Set an explicit 180s cancellation
timeout so the shared ClusterTestContext.cts is configured accordingly
and polling loops (BackOff(cts.Token)) can exit cleanly instead of
hanging until the test runner kills the process.

Matches existing convention in this file (lines 357, 1291).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
GPT 5.5 review of the prior commit flagged that handing pre-paused DB
IDs through the shared instance fields dbIdsToCheckpoint and
checkpointTasks is unsafe: HandleDatabaseAdded reallocates both fields
without coordinating with multiDbCheckpointingLock, so a SELECT that
adds a new active DB between the synchronous pause phase and the async
helper resuming after Task.Yield could swap in a fresh zero-initialized
array. The async helper would then read default 0 entries as 'paused
DB IDs', leaking the lock on the actually-paused DBs and double-resuming
DB 0 (which spins forever in SingleWriterMultiReaderLock.WriteUnlock
when called on an unlocked lock).

Fix:
- Remove the shared dbIdsToCheckpoint and checkpointTasks instance
  fields and the matching reallocation block in HandleDatabaseAdded.
- TakeCheckpointAsync(general) now allocates a local pausedDbIds buffer
  and passes it explicitly to its async helper / TakeDatabasesCheckpointAsync.
- TaskCheckpointBasedOnAofSizeLimitAsync allocates a local 1-element
  buffer (the loop breaks after the first oversized AOF anyway).
- TakeDatabasesCheckpointAsync now takes (int[] dbIds, int dbIdsCount)
  and allocates its own checkpointTasks array of exactly dbIdsCount,
  explicitly assigning Task.CompletedTask in the skip branch so
  Task.WhenAll never sees a null.

These buffers are tiny and BGSAVE / AOF-size-driven checkpoints are not
hot paths, so per-operation allocation is acceptable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1. Test variable naming clarity (was 'db1=GetDatabase(0)' / 'db2=GetDatabase(1)';
   now 'db0' / 'db1' to match the underlying Redis db index, and updated comment).

2. Assert.Throws<T>(action, message) treats the second argument as a *failure*
   message, not the expected exception message - it never validated that the
   server returned 'ERR checkpoint already in progress'. Capture the exception
   and assert on Message explicitly.

3. LASTSAVE wait loop hardened: capture baseline as long (no 2038 truncation),
   wait for advance past baseline, and add a 30s bounded timeout with a final
   ClassicAssert.Greater so a hang fails the test instead of stalling CI.

4/5. Add 'contentLockAlreadyHeld' parameter to TakeDatabasesCheckpointAsync so
    callers that already hold databasesContentLock as a reader skip the
    redundant nested re-acquisition. The lock is reentrant for readers (just
    a counter), so this was correct but redundant work; both call sites
    (TakeCheckpointAsync general and TaskCheckpointBasedOnAofSizeLimitAsync)
    now pass contentLockAlreadyHeld: true.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per the cleanup plan in plan.md, restructure the checkpoint code to remove
the accumulated complexity from the previous fix:

- Drop the alreadyPaused and contentLockAlreadyHeld parameters by adopting
  a single convention: the caller is responsible for synchronously
  acquiring databasesContentLock (read), multiDbCheckpointingLock (write,
  if multi-db), and pausing per-DB checkpoint locks before handing off
  to the shared async runner.

- Replace the dual-flag TakeDatabasesCheckpointAsync helper with two
  small, single-purpose runners:
    RunPausedCheckpointsAndReleaseLocksAsync(pausedDbIds, count,
      multiDbLockHeld, …) — used by the background-capable entry points
      (general BGSAVE, per-DB BGSAVE), runs all pre-paused per-DB
      checkpoints in parallel and releases the outer locks in finally.
    RunPausedCheckpointAsync(db, dbId, …) — single per-DB checkpoint +
      LASTSAVE update + per-DB lock resume; used by AOF-size-driven path.

- Per-DB BGSAVE (TakeCheckpointAsync(bool, int, …)) and
  TakeOnDemandCheckpointAsync now also take databasesContentLock as a
  reader. Without this, a concurrent swap-db can move the GarnetDatabase
  out from under an in-flight per-DB checkpoint: UpdateLastSaveData looks
  up databases.Map[dbId] at write time and would record LASTSAVE on the
  swapped wrapper, and a second BGSAVE for the same dbId would race
  against the in-flight checkpoint because the swapped slot has a fresh
  CheckpointingLock.

- TakeOnDemandCheckpointAsync also now guards ResumeCheckpoints behind
  the checkpointsPaused flag - otherwise it would unconditionally
  WriteUnlock a per-DB CheckpointingLock that TryPauseCheckpoints had
  refused to acquire (corrupting the lock in debug, spinning forever in
  release).

Net diff: -52 lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/multidb-test-fix branch from 6402d96 to 81b3968 Compare May 8, 2026 01:10
@badrishc badrishc marked this pull request as ready for review May 8, 2026 19:46
Comment thread libs/server/Databases/MultiDatabaseManager.cs Outdated
badrishc and others added 2 commits May 8, 2026 13:48
Address PR 1767 review nit from @vazois: in this codebase the convention
is (CancellationToken token = default, ILogger logger = null) — token
first, logger last (matches CommitToAofAsync, WaitForCommitToAofAsync,
DoCompactionAsync, TaskCheckpointBasedOnAofSizeLimitAsync). The two
TakeCheckpointAsync overloads were the odd ones out.

Reordered:
- IDatabaseManager.TakeCheckpointAsync(bool, ...) and (bool, int, ...)
- DatabaseManagerBase abstract methods + SingleDatabaseManager and
  MultiDatabaseManager overrides
- StoreWrapper.TakeCheckpointAsync(bool, ...) and (bool, int, ...)
- New private helpers RunPausedCheckpointsAndReleaseLocksAsync and
  RunPausedCheckpointAsync
- AofProcessor positional callers

Also dropped the dbId = -1 default on StoreWrapper's per-DB dispatcher
overload to avoid resolution ambiguity now that both overloads' second
parameter is CancellationToken with a default.

Kept the 'Async' suffix per .NET convention (any Task-returning method,
even when the implementation doesn't use the async keyword).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Address @vazois follow-up: the prior commit's workaround (dropping the
'dbId = -1' default on the per-DB overload to disambiguate the two
overloads after the param reorder) was structurally fragile - any new
defaultable parameter could re-introduce ambiguity.

Replace the two overloads with a single method that accepts an optional
'int dbId = -1' meaning 'all active databases':

    Task<bool> TakeCheckpointAsync(bool background, int dbId = -1,
                                   CancellationToken token = default,
                                   ILogger logger = null);

Changes:
- IDatabaseManager: one method instead of two.
- DatabaseManagerBase: one abstract method.
- SingleDatabaseManager: one override; validates dbId is -1 or 0 (the
  only DB this manager knows about).
- MultiDatabaseManager: one override that branches internally on
  dbId == -1 (all active) vs dbId >= 0 (single DB). Both paths now share
  the same try/catch admission/rollback structure.
- StoreWrapper: one wrapper method (was two).

No call sites needed updating - the two existing positional callers
(AdminCommands, AofProcessor) all use named arguments or pass dbId
explicitly, and dbId == -1 (the new default) is what the prior
'no-dbId' overload meant.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc merged commit 6bc8124 into dev May 8, 2026
22 of 23 checks passed
@badrishc badrishc deleted the badrishc/multidb-test-fix branch May 8, 2026 22:35
badrishc added a commit that referenced this pull request May 12, 2026
…nts complete

Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake
in CI Release builds. The general BGSAVE path synchronously paused all per-DB
checkpoint locks before returning 'Background saving started', but the per-DB checkpoint
helper released each per-DB lock as soon as that single DB's checkpoint completed - not
when the entire general save finished. If DB 0's checkpoint completed before the test's
'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with
'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never
loses; in CI Release it ran in 1s and reliably failed.

See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662.

Fix (in RunPausedCheckpointsAndReleaseLocksAsync only):
- Replace the per-DB call to the self-resuming RunPausedCheckpointAsync with a local
  async function TakeOneCheckpointAsync that performs only (TakeCheckpointAsync +
  UpdateLastSaveData), without resuming the per-DB lock.
- Resume ALL pre-paused per-DB locks in the outer finally, after Task.WhenAll - so
  per-DB locks are held until ALL per-DB checkpoints complete, not just each individual
  one. A per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint.
- Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely
  double-await even if the synchronous task-creation loop throws partway through. The
  double-await ensures we never resume a per-DB lock while its checkpoint is still
  running.
- Remove the handedOffCount partial-resume bookkeeping that's no longer needed.

RunPausedCheckpointAsync (used by TaskCheckpointBasedOnAofSizeLimitAsync) is unchanged
and remains self-contained (pause-resume covered by its own finally), so the AOF-size-
driven path needs no edits.

Net effect:
- General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any
  per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'.
- Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync
  with pausedCount=1): unchanged - that single per-DB lock is still released exactly
  when that single checkpoint completes.
- AOF-size-driven checkpoint: untouched.
- Other legal scenarios (per-DB then per-DB on different DB, per-DB then general,
  general blocks general) preserved.

Verification: 15/15 runs in Release config, full MultiDatabaseTests suite (31/31)
passes locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 13, 2026
…nts complete

Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake
in CI Release builds. The general BGSAVE path synchronously paused all per-DB
checkpoint locks before returning 'Background saving started', but the per-DB checkpoint
helper released each per-DB lock as soon as that single DB's checkpoint completed - not
when the entire general save finished. If DB 0's checkpoint completed before the test's
'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with
'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never
loses; in CI Release it ran in 1s and reliably failed.

See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662.

Fix:
- RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE)
  resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks
  are held until ALL per-DB checkpoints complete, not just each individual one. A
  per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint.
- The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync
  that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming.
- Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely
  double-await even if the synchronous task-creation loop throws partway through. The
  double-await ensures we never resume a per-DB lock while its checkpoint is still
  running.
- Remove the handedOffCount partial-resume bookkeeping that's no longer needed.
- The previously-shared RunPausedCheckpointAsync helper is removed - its only other
  caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/
  update/finally/resume sequence so its single-DB pause-resume lifecycle is visible
  in one place.

Net effect:
- General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any
  per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'.
- Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync
  with pausedCount=1): unchanged - that single per-DB lock is still released exactly
  when that single checkpoint completes.
- AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined).
- Other legal scenarios (per-DB then per-DB on different DB, per-DB then general,
  general blocks general) preserved.

Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest +
MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31)
passes locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 13, 2026
…nts complete

Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake
in CI Release builds. The general BGSAVE path synchronously paused all per-DB
checkpoint locks before returning 'Background saving started', but the per-DB checkpoint
helper released each per-DB lock as soon as that single DB's checkpoint completed - not
when the entire general save finished. If DB 0's checkpoint completed before the test's
'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with
'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never
loses; in CI Release it ran in 1s and reliably failed.

See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662.

Fix:
- RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE)
  resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks
  are held until ALL per-DB checkpoints complete, not just each individual one. A
  per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint.
- The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync
  that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming.
- Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely
  double-await even if the synchronous task-creation loop throws partway through. The
  double-await ensures we never resume a per-DB lock while its checkpoint is still
  running.
- Remove the handedOffCount partial-resume bookkeeping that's no longer needed.
- The previously-shared RunPausedCheckpointAsync helper is removed - its only other
  caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/
  update/finally/resume sequence so its single-DB pause-resume lifecycle is visible
  in one place.

Net effect:
- General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any
  per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'.
- Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync
  with pausedCount=1): unchanged - that single per-DB lock is still released exactly
  when that single checkpoint completes.
- AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined).
- Other legal scenarios (per-DB then per-DB on different DB, per-DB then general,
  general blocks general) preserved.

Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest +
MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31)
passes locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc added a commit that referenced this pull request May 13, 2026
…nts complete

Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake
in CI Release builds. The general BGSAVE path synchronously paused all per-DB
checkpoint locks before returning 'Background saving started', but the per-DB checkpoint
helper released each per-DB lock as soon as that single DB's checkpoint completed - not
when the entire general save finished. If DB 0's checkpoint completed before the test's
'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with
'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never
loses; in CI Release it ran in 1s and reliably failed.

See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662.

Fix:
- RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE)
  resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks
  are held until ALL per-DB checkpoints complete, not just each individual one. A
  per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint.
- The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync
  that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming.
- Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely
  double-await even if the synchronous task-creation loop throws partway through. The
  double-await ensures we never resume a per-DB lock while its checkpoint is still
  running.
- Remove the handedOffCount partial-resume bookkeeping that's no longer needed.
- The previously-shared RunPausedCheckpointAsync helper is removed - its only other
  caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/
  update/finally/resume sequence so its single-DB pause-resume lifecycle is visible
  in one place.

Net effect:
- General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any
  per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'.
- Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync
  with pausedCount=1): unchanged - that single per-DB lock is still released exactly
  when that single checkpoint completes.
- AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined).
- Other legal scenarios (per-DB then per-DB on different DB, per-DB then general,
  general blocks general) preserved.

Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest +
MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31)
passes locally.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants