Acmshanyi/cpr snapshot#19
Merged
Merged
Conversation
added 2 commits
April 20, 2026 22:41
badrishc
added a commit
to microsoft/garnet
that referenced
this pull request
May 12, 2026
Two deadlocks were identified in the prior X-lock-based OnFlush design: 1. ReadRangeIndex + CopyReadsToTail self-deadlock via PreStage. 2. RestoreTree's RIRESTORE RMW could trigger synchronous OnFlush on the same thread that holds the per-key X-lock -> shard collision -> hang. Root cause: `OnFlush` may fire as a deferred epoch action on a thread holding any RI lock. The per-key X-lock walks all shards via `CalculateIndex` so it collides with any S-lock the same thread already holds. The deferred firing is unavoidable at the Tsavorite layer. Fix: - Upgrade to bf-tree 0.5 (PR microsoft/bf-tree#19, CPR snapshot). CPR is concurrent-safe with workers, eliminating the need for per-key X-lock during OnFlush of a live tree. Memory-backed trees are also CPR-snapshot capable in 0.5 (unified API with disk-backed). - OnFlush live case: per-tree atomic (TreeEntry.SnapshotInProgress) serializes against concurrent SnapshotAllTreesForCheckpoint. No per-key lock taken. - OnFlush cold case (stub.TreeHandle == 0, just-CAS'd at tail before RestoreTree activated): per-key SHARED RI lock blocks RestoreTree's X-lock from registering a new tree mid-copy. S-vs-S compatible across threads -> no self-deadlock. If a live tree exists for the key under another stub, snapshot via that handle (CPR); else File.Copy(data.bftree). - SnapshotAllTreesForCheckpoint: same per-tree atomic, no per-key X-lock. - RestoreTree: split lock from RMW. X-lock held only for file-recovery + RegisterIndex; released BEFORE issuing RIRESTORE RMW. RMW on tombstoned key returns NOTFOUND (NeedInitialUpdate=false). - DisposeTreeUnderLock + RegisterIndex loser-disposal: defer bfTree.Dispose() and file deletion via storeEpoch.BumpCurrentEpoch(...) so concurrent readers using TreeHandle complete before native free. - Startup guard: EnableRangeIndexPreview=true is incompatible with CopyReadsToTail=true (would deadlock per-key shared+exclusive on the same thread); GarnetServer throws GarnetException at construction time. - File scheme: <hash>.scratch.cpr is bftree's CPR snapshot scratch path (configured at construction, written by every cpr_snapshot). OnFlush and SnapshotAllTreesForCheckpoint File.Copy scratch -> final destination (Copy not Move because bftree caches an FD on the configured path). Native layer: - Cargo.toml: bumped bf-tree 0.4 -> 0.5. - src/lib.rs: replaced bftree_snapshot/_memory + bftree_new_from_snapshot with bftree_cpr_snapshot, bftree_new_from_cpr_snapshot, bftree_are_all_threads_in_next_version. bftree_create accepts snapshot_file_path (configures use_snapshot=true). - BfTreeService.cs / NativeBfTreeMethods.cs: matching managed bindings. RecoverFromCprSnapshot is the unified disk+memory recovery API. - runtimes/linux-x64/native/libbftree_garnet.so refreshed. Tests: - All 61 existing RespRangeIndexTests pass. - Added RICopyReadsToTailIncompatibleStartupGuardTest for the new guard. - TestUtils.CreateGarnetServer accepts copyReadsToTail parameter. Docs: - website/docs/dev/range-index-resp-api.md: updated lifecycle table, file layout, lazy-restore flow, checkpoint flow, plus new Deadlock Safety section explaining the three invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc
added a commit
to microsoft/garnet
that referenced
this pull request
May 13, 2026
Two deadlocks were identified in the prior X-lock-based OnFlush design: 1. ReadRangeIndex + CopyReadsToTail self-deadlock via PreStage. 2. RestoreTree's RIRESTORE RMW could trigger synchronous OnFlush on the same thread that holds the per-key X-lock -> shard collision -> hang. Root cause: `OnFlush` may fire as a deferred epoch action on a thread holding any RI lock. The per-key X-lock walks all shards via `CalculateIndex` so it collides with any S-lock the same thread already holds. The deferred firing is unavoidable at the Tsavorite layer. Fix: - Upgrade to bf-tree 0.5 (PR microsoft/bf-tree#19, CPR snapshot). CPR is concurrent-safe with workers, eliminating the need for per-key X-lock during OnFlush of a live tree. Memory-backed trees are also CPR-snapshot capable in 0.5 (unified API with disk-backed). - OnFlush live case: per-tree atomic (TreeEntry.SnapshotInProgress) serializes against concurrent SnapshotAllTreesForCheckpoint. No per-key lock taken. - OnFlush cold case (stub.TreeHandle == 0, just-CAS'd at tail before RestoreTree activated): per-key SHARED RI lock blocks RestoreTree's X-lock from registering a new tree mid-copy. S-vs-S compatible across threads -> no self-deadlock. If a live tree exists for the key under another stub, snapshot via that handle (CPR); else File.Copy(data.bftree). - SnapshotAllTreesForCheckpoint: same per-tree atomic, no per-key X-lock. - RestoreTree: split lock from RMW. X-lock held only for file-recovery + RegisterIndex; released BEFORE issuing RIRESTORE RMW. RMW on tombstoned key returns NOTFOUND (NeedInitialUpdate=false). - DisposeTreeUnderLock + RegisterIndex loser-disposal: defer bfTree.Dispose() and file deletion via storeEpoch.BumpCurrentEpoch(...) so concurrent readers using TreeHandle complete before native free. - Startup guard: EnableRangeIndexPreview=true is incompatible with CopyReadsToTail=true (would deadlock per-key shared+exclusive on the same thread); GarnetServer throws GarnetException at construction time. - File scheme: <hash>.scratch.cpr is bftree's CPR snapshot scratch path (configured at construction, written by every cpr_snapshot). OnFlush and SnapshotAllTreesForCheckpoint File.Copy scratch -> final destination (Copy not Move because bftree caches an FD on the configured path). Native layer: - Cargo.toml: bumped bf-tree 0.4 -> 0.5. - src/lib.rs: replaced bftree_snapshot/_memory + bftree_new_from_snapshot with bftree_cpr_snapshot, bftree_new_from_cpr_snapshot, bftree_are_all_threads_in_next_version. bftree_create accepts snapshot_file_path (configures use_snapshot=true). - BfTreeService.cs / NativeBfTreeMethods.cs: matching managed bindings. RecoverFromCprSnapshot is the unified disk+memory recovery API. - runtimes/linux-x64/native/libbftree_garnet.so refreshed. Tests: - All 61 existing RespRangeIndexTests pass. - Added RICopyReadsToTailIncompatibleStartupGuardTest for the new guard. - TestUtils.CreateGarnetServer accepts copyReadsToTail parameter. Docs: - website/docs/dev/range-index-resp-api.md: updated lifecycle table, file layout, lazy-restore flow, checkpoint flow, plus new Deadlock Safety section explaining the three invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc
added a commit
to microsoft/garnet
that referenced
this pull request
May 13, 2026
Two deadlocks were identified in the prior X-lock-based OnFlush design: 1. ReadRangeIndex + CopyReadsToTail self-deadlock via PreStage. 2. RestoreTree's RIRESTORE RMW could trigger synchronous OnFlush on the same thread that holds the per-key X-lock -> shard collision -> hang. Root cause: `OnFlush` may fire as a deferred epoch action on a thread holding any RI lock. The per-key X-lock walks all shards via `CalculateIndex` so it collides with any S-lock the same thread already holds. The deferred firing is unavoidable at the Tsavorite layer. Fix: - Upgrade to bf-tree 0.5 (PR microsoft/bf-tree#19, CPR snapshot). CPR is concurrent-safe with workers, eliminating the need for per-key X-lock during OnFlush of a live tree. Memory-backed trees are also CPR-snapshot capable in 0.5 (unified API with disk-backed). - OnFlush live case: per-tree atomic (TreeEntry.SnapshotInProgress) serializes against concurrent SnapshotAllTreesForCheckpoint. No per-key lock taken. - OnFlush cold case (stub.TreeHandle == 0, just-CAS'd at tail before RestoreTree activated): per-key SHARED RI lock blocks RestoreTree's X-lock from registering a new tree mid-copy. S-vs-S compatible across threads -> no self-deadlock. If a live tree exists for the key under another stub, snapshot via that handle (CPR); else File.Copy(data.bftree). - SnapshotAllTreesForCheckpoint: same per-tree atomic, no per-key X-lock. - RestoreTree: split lock from RMW. X-lock held only for file-recovery + RegisterIndex; released BEFORE issuing RIRESTORE RMW. RMW on tombstoned key returns NOTFOUND (NeedInitialUpdate=false). - DisposeTreeUnderLock + RegisterIndex loser-disposal: defer bfTree.Dispose() and file deletion via storeEpoch.BumpCurrentEpoch(...) so concurrent readers using TreeHandle complete before native free. - Startup guard: EnableRangeIndexPreview=true is incompatible with CopyReadsToTail=true (would deadlock per-key shared+exclusive on the same thread); GarnetServer throws GarnetException at construction time. - File scheme: <hash>.scratch.cpr is bftree's CPR snapshot scratch path (configured at construction, written by every cpr_snapshot). OnFlush and SnapshotAllTreesForCheckpoint File.Copy scratch -> final destination (Copy not Move because bftree caches an FD on the configured path). Native layer: - Cargo.toml: bumped bf-tree 0.4 -> 0.5. - src/lib.rs: replaced bftree_snapshot/_memory + bftree_new_from_snapshot with bftree_cpr_snapshot, bftree_new_from_cpr_snapshot, bftree_are_all_threads_in_next_version. bftree_create accepts snapshot_file_path (configures use_snapshot=true). - BfTreeService.cs / NativeBfTreeMethods.cs: matching managed bindings. RecoverFromCprSnapshot is the unified disk+memory recovery API. - runtimes/linux-x64/native/libbftree_garnet.so refreshed. Tests: - All 61 existing RespRangeIndexTests pass. - Added RICopyReadsToTailIncompatibleStartupGuardTest for the new guard. - TestUtils.CreateGarnetServer accepts copyReadsToTail parameter. Docs: - website/docs/dev/range-index-resp-api.md: updated lifecycle table, file layout, lazy-restore flow, checkpoint flow, plus new Deadlock Safety section explaining the three invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc
added a commit
to microsoft/garnet
that referenced
this pull request
May 14, 2026
Two deadlocks were identified in the prior X-lock-based OnFlush design: 1. ReadRangeIndex + CopyReadsToTail self-deadlock via PreStage. 2. RestoreTree's RIRESTORE RMW could trigger synchronous OnFlush on the same thread that holds the per-key X-lock -> shard collision -> hang. Root cause: `OnFlush` may fire as a deferred epoch action on a thread holding any RI lock. The per-key X-lock walks all shards via `CalculateIndex` so it collides with any S-lock the same thread already holds. The deferred firing is unavoidable at the Tsavorite layer. Fix: - Upgrade to bf-tree 0.5 (PR microsoft/bf-tree#19, CPR snapshot). CPR is concurrent-safe with workers, eliminating the need for per-key X-lock during OnFlush of a live tree. Memory-backed trees are also CPR-snapshot capable in 0.5 (unified API with disk-backed). - OnFlush live case: per-tree atomic (TreeEntry.SnapshotInProgress) serializes against concurrent SnapshotAllTreesForCheckpoint. No per-key lock taken. - OnFlush cold case (stub.TreeHandle == 0, just-CAS'd at tail before RestoreTree activated): per-key SHARED RI lock blocks RestoreTree's X-lock from registering a new tree mid-copy. S-vs-S compatible across threads -> no self-deadlock. If a live tree exists for the key under another stub, snapshot via that handle (CPR); else File.Copy(data.bftree). - SnapshotAllTreesForCheckpoint: same per-tree atomic, no per-key X-lock. - RestoreTree: split lock from RMW. X-lock held only for file-recovery + RegisterIndex; released BEFORE issuing RIRESTORE RMW. RMW on tombstoned key returns NOTFOUND (NeedInitialUpdate=false). - DisposeTreeUnderLock + RegisterIndex loser-disposal: defer bfTree.Dispose() and file deletion via storeEpoch.BumpCurrentEpoch(...) so concurrent readers using TreeHandle complete before native free. - Startup guard: EnableRangeIndexPreview=true is incompatible with CopyReadsToTail=true (would deadlock per-key shared+exclusive on the same thread); GarnetServer throws GarnetException at construction time. - File scheme: <hash>.scratch.cpr is bftree's CPR snapshot scratch path (configured at construction, written by every cpr_snapshot). OnFlush and SnapshotAllTreesForCheckpoint File.Copy scratch -> final destination (Copy not Move because bftree caches an FD on the configured path). Native layer: - Cargo.toml: bumped bf-tree 0.4 -> 0.5. - src/lib.rs: replaced bftree_snapshot/_memory + bftree_new_from_snapshot with bftree_cpr_snapshot, bftree_new_from_cpr_snapshot, bftree_are_all_threads_in_next_version. bftree_create accepts snapshot_file_path (configures use_snapshot=true). - BfTreeService.cs / NativeBfTreeMethods.cs: matching managed bindings. RecoverFromCprSnapshot is the unified disk+memory recovery API. - runtimes/linux-x64/native/libbftree_garnet.so refreshed. Tests: - All 61 existing RespRangeIndexTests pass. - Added RICopyReadsToTailIncompatibleStartupGuardTest for the new guard. - TestUtils.CreateGarnetServer accepts copyReadsToTail parameter. Docs: - website/docs/dev/range-index-resp-api.md: updated lifecycle table, file layout, lazy-restore flow, checkpoint flow, plus new Deadlock Safety section explaining the three invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
badrishc
added a commit
to microsoft/garnet
that referenced
this pull request
May 14, 2026
…ment (#1780) * Tsavorite: add PostCopyToTail, OnTruncate triggers; OnFlush(addr); RMWInfo.SourceAddress Foundational Tsavorite-side changes for BfTree (RangeIndex) compaction lifecycle integration: - IRecordTriggers.OnFlush gains a logicalAddress parameter so applications can name per-flush snapshot files immutably. - New IRecordTriggers.PostCopyToTail(in src, srcAddr, ref dst, dstAddr) trigger fires from TryCopyToTail after CAS-success but BEFORE UnsealAndValidate, so dst is sealed during the callback (concurrent readers see SkipOnScan and retry). Re-introduces the post-CAS hook that PR 5168a2fab removed (PostSingleWriter), needed by RangeIndex compaction path. - New IRecordTriggers.OnTruncate(newBA) trigger fires from AllocatorBase.TruncateUntilAddressBlocking AFTER device truncation completes; also from the recovery-time trim path. Lets applications clean up external state (e.g. per-flush snapshot files) tied to log addresses below the new BeginAddress. - RMWInfo.SourceAddress added so RIPROMOTE PostCopyUpdater can plumb the source address (Address gets reassigned to the destination address by the time CopyUpdater runs). - PendingContext.originalAddress is now also used by TryCopyToTail's PostCopyToTail fire path to surface the source logical address. Set at three call sites: CompactionConditionalCopyToTail (currentAddress), InternalRead's CopyFromImmutable (recSrc.LogicalAddress), and ContinuePending (request.logicalAddress). - SpanByteScanIterator and ObjectScanIterator now invoke OnDiskRead in the disk-frame branch so disk-source scan records arrive at PostCopyToTail with invalidated TreeHandle bytes. Storage tier baseline preserved: BasicLockTests, SpanByteLogCompactionTests, and BasicStorageTests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: refactor for compaction lifecycle (Guid keying, two roots, address-tagged flush, PostCTT) Refactors RangeIndexManager and GarnetRecordTriggers to fix three defects in BfTree lifecycle integration: DEFECTS FIXED - Recovery sees post-checkpoint BfTree state (silent corruption): a single deterministic flush.bftree per key was overwritten on every flush, so a recover-to-prior-checkpoint read the wrong state. Replaced with immutable, address-tagged <hash>.<addr:x16>.flush.bftree files. - Compaction's TryCopyToTail leaves dst with a dangling TreeHandle: the previous PostSingleWriter hook was removed in 5168a2fa. Re-introduce via the new IRecordTriggers.PostCopyToTail trigger; for live-source the trigger transfers ownership (clears src.TreeHandle), for disk-source it pre-stages data.bftree from the per-flush snapshot and registers a pending entry. - Shift/ShiftForced (and ad-hoc Log.Truncate) leak orphan flush files: handled by the new OnTruncate trigger, which deletes per-flush snapshots whose addr < newBA and cleans orphaned .tmp files. LATENT BUG ALSO FIXED - RIPROMOTE'd dst with TreeHandle=0 captured by checkpoint had no snapshot file → recovery to that checkpoint failed. RIPROMOTE PostCopyUpdater now branches on src.TreeHandle: !=0 live transfer (existing behavior); ==0 cold case pre-stages data.bftree from <srcAddr>.flush.bftree and registers a pending entry so the checkpoint snapshot captures it correctly. KEY DESIGN CHANGES - liveIndexes re-keyed by Guid (XxHash128 of the key bytes, same scheme as the on-disk filename prefix). Hot-path access only for checkpoint coordination (WaitForTreeCheckpoint), gated by checkpointInProgress short-circuit. - TreeEntry.Tree is nullable: null = pending (data.bftree pre-staged but no native tree opened yet). Pending entries are upgraded to activated on first RestoreTree call. - TWO roots, mirroring Tsavorite's hlog/cpr-checkpoints separation: riLogRoot = {LogDir ?? CheckpointDir ?? cwd}/Store/rangeindex/ (working+flush) cprDir/<token>/rangeindex/ (per-checkpoint) Per-checkpoint snapshots are deleted automatically when Tsavorite removes the parent token directory; PurgeOldCheckpointSnapshots removed. - Flat layout (no per-key subdirectories): single Directory.CreateDirectory at startup, single EnumerateFiles for OnTruncate, no orphan-dir cleanup needed. - RestoreTree (renamed from RestoreTreeFromFlush) is now trivial: opens data.bftree directly. Pre-staging always happened earlier (PostCopyToTail-cold, RIPROMOTE-cold, OnRecoverySnapshotRead). - OnRecoverySnapshotRead pre-stages data.bftree from cpr-checkpoints/<recoveredToken>/ rangeindex/<hash>.bftree DURING recovery (snapshot files may be deleted post-recovery). - Storage tier optional: works without LogDir (falls back to CheckpointDir → cwd). API CHANGES (Tsavorite) - IRecordTriggers.OnFlush now takes (ref LogRecord, long logicalAddress). - New IRecordTriggers.PostCopyToTail<TSource>(in src, srcAddr, ref dst, dstAddr). - New IRecordTriggers.OnTruncate(long newBeginAddress). - RMWInfo.SourceAddress added; preserved through CopyUpdater so PostCopyUpdater can see the source address (rmwInfo.Address gets reassigned to dst by then). - PendingContext.originalAddress is now also used by TryCopyToTail's PostCopyToTail fire path; set at three call sites (compaction, CopyReadsToTail, ContinuePending). - SpanByteScanIterator and ObjectScanIterator now invoke OnDiskRead in the disk-frame branch. API CHANGES (Garnet) - BfTreeService.SnapshotToFileByPtr(handle, dataPath, targetPath) — static helper for OnFlush which has only the stub's TreeHandle, not the managed wrapper. - RangeIndexManager.IsTreeLive removed; metrics use stub.TreeHandle != 0 directly. - RangeIndexManager constructor signature: (enabled, riLogRoot, cprDir, removeOutdated, logger). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tests: trigger ext tests; update RecordLifecycleTests + RangeIndex disk-cleanup tests for new layout Tsavorite: new RecordTriggersExtTests (9 tests) covering OnFlush(addr), OnTruncate, PostCopyToTail, plus default-trigger flag verification. Updates RecordLifecycleTests' LifecycleRecordTriggers.OnFlush to the new (ref LogRecord, long) signature so the existing CallOnFlushGatesOnFlushInvocation test fires correctly. Garnet: updates RIDiskFileCleanupOnDeleteTest and RIDiskFileCleanupOnDeleteAfterEvictionAndRestoreTest to expect the new flat file layout under {LogDir}/Store/rangeindex/<hash>.* instead of the old {CheckpointDir}/checkpoints/rangeindex/<hash>/ per-key directories. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tests: add Defect 1 reproduction (flush files immutable) and OnTruncate orphan-tmp cleanup RIFlushFilesAreImmutablePerAddressTest: directly verifies the on-disk invariant that fixes Defect 1. Inserts v1, force-flushes (creates <hash>.<A1>.flush.bftree), captures file content, mutates to v2, force-flushes again. Asserts the v1 file STILL exists with byte-identical content (proving immutability) AND that a NEW v2 file was created at a distinct address (proving second flush did not overwrite). Pre-fix would fail: single deterministic flush.bftree was overwritten on every flush. RIOnTruncateRemovesOrphanTmpFilesTest: drops a bogus .data.bftree.tmp file in the riLogRoot, triggers ShiftBeginAddress(_, truncateLog: true), and asserts the orphan tmp file is deleted. Verifies the new IRecordTriggers.OnTruncate trigger fires post- device-truncate and that GarnetRecordTriggers.OnTruncate cleans .tmp files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * doc: update range-index-resp-api.md for compaction lifecycle integration Updates the IRecordTriggers Lifecycle table for the new triggers (PostCopyToTail, OnTruncate) and the OnFlush(addr) signature change. Adds new sections: - File Layout (two roots): riLogRoot for log-tied files, cprDir/<token>/rangeindex/ for per-checkpoint snapshots; both flat layouts (no per-key subdirs). - liveIndexes — Guid-keyed single dictionary; activated vs pending entries. - liveIndexes hot-path discipline (only WaitForTreeCheckpoint, gated by short-circuit). - Updated Lazy Promote to describe the new RIPROMOTE PostCopyUpdater cold branch. - Updated Lazy Restore to describe the trivial RestoreTree (data.bftree direct open). - Updated Checkpoint Consistency: SnapshotPending=1 on activated AND pending entries; pending → File.Copy(data.bftree → snapshot path). - New Recovery Flow section: OnRecoverySnapshotRead pre-stages above-FUA stubs DURING recovery; below-FUA stubs handled lazily via RIPROMOTE-cold + per-flush snapshots. - New Compaction Lifecycle section: PostCopyToTail live transfer vs cold pre-stage. - New Scratchpad: Key Plumbing for PostCopyToTail (originalAddress + RMWInfo.SourceAddress). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: drop StorageBackend==Disk guard from DisposeTreeUnderLock cleanup The previous guard 'if (deleteFiles && stub.StorageBackend == Disk) DeleteTreeFiles(key)' was a fragile defense that: (a) Fails defense in depth: DeleteTreeFiles is already idempotent (enumerate-and-delete on a hash prefix; no-op if no files match), so the backend check adds no protection while creating a path-dependent leak. (b) Forecloses future memory-snapshot support: the native bftree library exposes bftree_snapshot_memory / bftree_recover_memory (currently NotSupported but planned). When that ships, a memory-backed tree could legitimately have on-disk artifacts via OnFlush / PreStageAndRegisterPending. Without removing this guard, DEL on such a key would leak files. (c) Fails to clean up stale prior-incarnation files: any <hash>.* file existing on disk when the key is re-created (or when the backend value is mutated through a future code path) would be permanently unreclaimable on DEL. Drop the guard. DeleteTreeFiles always cleans up any <hash>.* files in riLogRoot when deleteFiles=true, regardless of current StorageBackend. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: restore xmldoc + inline comments on ComputeLeafPageSize/RoundUpToPowerOf2 These were inadvertently dropped during the larger refactor of RangeIndexManager.cs. Restoring the original docs: - ComputeLeafPageSize: full <param>/<returns> describing the size buckets. - ComputeLeafPageSize body: '2.5x, capped at 32KB' and 'Round up to next power of 2' inline comments at the bucket-computation steps. - RoundUpToPowerOf2: <summary> describing the bit-manipulation algorithm. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: simplify pre-stage paths to direct File.Copy (drop .tmp + File.Move pattern) Both PreStageAndRegisterPending and RebuildFromSnapshotIfPending used a .tmp + File.Move atomic-rename pattern that on closer inspection adds no safety: PreStageAndRegisterPending callers (PostCopyToTail, RIPROMOTE PostCopyUpdater) run BEFORE UnsealAndValidate on the destination record, so concurrent readers see SkipOnScan and retry — no read race possible. Tsavorite's hash-bucket CAS serializes concurrent writers per key (only one CopyUpdater/TryCopyToTail succeeds per CAS event), so two PreStage calls cannot race on the same data.bftree file. OnEvict closes the prior native BfTree before any cold-restore can happen, so no live native fd is open on data.bftree when this runs. RebuildFromSnapshotIfPending runs DURING recovery, which is single-threaded with no concurrent readers. A crash mid-copy is self-healing: the next recovery attempt re-fires OnRecoverySnapshotRead for the same stub and overwrites any partial file before any RestoreTree can observe it. Both functions now use direct File.Copy(src, dst, overwrite: true). Drop the LogTmpPath helper (no longer used); keep the .tmp suffix cleanup in OnTruncateImpl as defense in depth for any legacy or future atomic-rename code paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: fix race in PreStageAndRegisterPending — acquire per-key exclusive lock Critical concurrency fix exposed by the question 'is RIPROMOTE running under exclusive lock?'. The answer is no: • ReadRangeIndex RELEASES the shared lock before invoking PromoteToTail. • There is no TryPromoteSharedLock upgrade anywhere in RangeIndex code. • RIPROMOTE's RMW (CopyUpdater + PostCopyUpdater) runs with no rangeIndex lock held. Combined with: CASRecordIntoChain in Helpers.cs:177 unseals dst IMMEDIATELY on CAS success, so by the time PostCopyUpdater fires, concurrent readers can already observe dst with TreeHandle == 0 and invoke RestoreTree, which opens data.bftree directly. Race scenario (cold case): 1. Thread A: PostCopyUpdater starts PreStageAndRegisterPending — File.Copy partway done. 2. Thread B: ReadRangeIndex sees the unsealed dst with TreeHandle=0, releases shared lock, acquires exclusive lock for RestoreTree. 3. Thread B: RestoreTree opens data.bftree → reads partial file → corruption. Atomic-rename .tmp + File.Move doesn't fix this either — Thread B might see 'file doesn't exist' before the rename completes and return spurious NOTFOUND. Fix: PreStageAndRegisterPending now acquires the per-key EXCLUSIVE rangeIndex lock for the duration of File.Copy AND the pending-entry registration. This blocks RestoreTree's exclusive acquire until pre-stage completes; Thread B then sees the fully-written data.bftree. A direct File.Copy is sufficient under this lock — the exclusive lock serializes against any reader that would open data.bftree, against other concurrent PreStageAndRegisterPending calls for the same key, and crash-mid-copy is self-healing via OnRecoverySnapshotRead or RIPROMOTE-cold re-pre-stage. RebuildFromSnapshotIfPending continues to use direct File.Copy without a lock — recovery is single-threaded with no concurrent readers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: address GPT-5.5 PR review (3 issues; FlagTransferred, OnFlush invariant, OnTruncate) ISSUE 1 (BLOCKER): Stale source eviction removed the live entry. After PostCopyToTail-live or RIPROMOTE-live, the source record's TreeHandle was cleared but the record otherwise stayed in the chain. When the source page later evicted, OnEvict→DisposeTreeUnderLock saw TreeHandle==0 and hit the early branch, removing the liveIndexes entry that NOW belonged to the destination — losing checkpoint coverage and DEL-time native-tree disposal. Fix: Add RangeIndexStub.FlagTransferred = 4 (uses one of the reserved Flag bits; no stub-layout change). Both transfer paths now set this flag on src in BOTH live and cold branches: • GarnetRecordTriggers.PostCopyToTail (live + cold). • RMWMethods.PostCopyUpdater (RIPROMOTE live + cold). DisposeTreeUnderLock checks FlagTransferred FIRST — when set, no-op (the entry belongs to a newer destination record). For DEL/UNLINK (deleteFiles=true) we still proceed with file cleanup. GarnetRecordTriggers.OnFlush also checks FlagTransferred — a transferred-out src must NOT snapshot a stale view (TreeHandle was cleared, and any snapshot from data.bftree would race with dst's active mutations). Regression test: RIDisposeTreeUnderLockNoOpsOnTransferredSourceTest. Verified the test fails when the FlagTransferred check is removed. ISSUE 2 (MAJOR): Pre-stage failures left dst in a broken state. PreStageAndRegisterPending logged a warning and returned silently on missing source flush file or copy failure. The caller (PostCopyToTail / PostCopyUpdater) cleared FlagFlushed unconditionally, leaving an above-FUA dst stub with TreeHandle=0, FlagFlushed=0, no data.bftree, and no pending entry — violating Invariants 4 and 5. Same problem in OnFlush: when TreeHandle=0 and data.bftree was missing, OnFlush silently skipped the snapshot but still set FlagFlushed → unrestorable state. Fix: PreStageAndRegisterPending logs as ERROR (not warning) on source-missing and explicitly does NOT fall back to any other flush file (per user direction: recovering the wrong tree version is worse than failing). The destination's RestoreTree will return NOTFOUND for the affected key, surfacing the data loss explicitly. OnFlush also logs as ERROR and DOES NOT set FlagFlushed when the source data.bftree is missing — the next access routes through RestoreTree which returns NOTFOUND for the key while leaving the record otherwise consistent. ISSUE 3 (MAJOR): Cluster mode disabled all RangeIndex truncation cleanup. OnTruncateImpl returned early when removeOutdatedCheckpoints=false (cluster mode), leaking obsolete <hash>.<addr>.flush.bftree files and stale .tmp files indefinitely. Per-flush files are LOG-tied (their lifetime tracks log addresses), not checkpoint-tied — they are safe to delete once Tsavorite's BeginAddress passes their address, regardless of cluster mode. removeOutdatedCheckpoints controls CHECKPOINT-token-tied snapshot retention only. cleanup is unconditional. All 61 RangeIndex tests + 35 Tsavorite tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: DEL preserves per-flush snapshot files (LOG-tied lifetime) Bug: DeleteTreeFiles indiscriminately deleted ALL <hash>.* files in riLogRoot, including <hash>.<addr:x16>.flush.bftree files that may still be required to recover an OLDER checkpoint taken BEFORE the DEL operation. Pre-fix data-loss flow: T=0: RI.CREATE key K, write, flush → <hash>.<A1>.flush.bftree T=1: SAVE (checkpoint C1) T=2: more writes, more flushes → <hash>.<A2>.flush.bftree T=3: DEL K → ALL <hash>.* files deleted, including <A1>.flush.bftree T=4: crash; recover to C1 → recovered log has stub at A1 with FlagFlushed=1 T=5: RI.GET → RIPROMOTE-cold → looks for <hash>.<A1>.flush.bftree → MISSING → silent data loss for the recovered key. Fix: DeleteTreeFiles now only deletes the WORKING files: • <hash>.data.bftree • <hash>.data.bftree.tmp (defensive, in case of any legacy atomic-rename leftover) Per-flush snapshot files (<hash>.<addr:x16>.flush.bftree) are LOG-tied: they may be needed to recover an older checkpoint, and are reclaimed by OnTruncate only when Tsavorite's BeginAddress passes their address — at which point no checkpoint can recover the pre-delete state anyway. Per-checkpoint snapshots under cpr-checkpoints/<token>/rangeindex/<hash>.bftree are already not touched here (different root); Tsavorite removes them with the parent token directory. Side effect: DeleteTreeFiles no longer iterates the directory — direct File.Delete on two known paths. The ONLY remaining directory iteration in production code is now OnTruncateImpl (which is unavoidable: it scans for files matching addr<newBA). Updated tests: • RIDiskFileCleanupOnDeleteTest now only asserts data.bftree is removed. • RIDiskFileCleanupOnDeleteAfterEvictionAndRestoreTest now asserts that the flush.bftree files PRESERVE their count after DEL (not zero). • New RIDeletePreservesPerFlushSnapshotFilesTest explicitly asserts the LOG-tied lifetime contract: DEL preserves <hash>.<addr>.flush.bftree files. All 62 RangeIndex tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: remove .tmp dead-code distraction No production path produces <hash>.data.bftree.tmp files. Every file-write path in RangeIndexManager uses direct File.Copy(overwrite: true): • PreStageAndRegisterPending (under per-key exclusive lock) • RebuildFromSnapshotIfPending (single-threaded recovery) • Per-checkpoint snapshot capture Verified across libs/ that no File.Move calls exist anywhere, and the Rust BfTree library does not produce .tmp files either. The .tmp handling that existed was speculative cleanup for 'legacy or future atomic-rename code paths' (per its own doc comment). There is no legacy deployment of this format (this PR introduces it), and there is no plan to migrate to a .tmp + Move pattern (current design relies instead on: - Crash-safety of partial data.bftree via re-pre-stage on next access / next OnRecoverySnapshotRead overwrite. - The per-key exclusive lock serializing readers vs the in-flight Copy.) Removed: • DeleteTreeFiles: deletes only <hash>.data.bftree (one File.Delete, no helper closure, no enumeration). Updated xmldoc. • OnTruncateImpl: dropped the .data.bftree.tmp branch (~6 lines + comment); only enumerates / matches <hash>.<addr:x16>.flush.bftree now. • RebuildFromSnapshotIfPending: dropped 'cf. .tmp + File.Move' aside from xmldoc; kept the crash-safety rationale. • GarnetRecordTriggers.OnDiskRead-cold branch: dropped '(atomic via .tmp + File.Move)' aside. • DisposeTreeUnderLock xmldoc: simplified deleteFiles description (no longer references .tmp). • Test RIOnTruncateRemovesOrphanTmpFilesTest: removed (purely synthetic; asserted behavior of code with no production caller). All 61 RangeIndex tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: consolidate stub flag accessors (RecordInfo style) Refactor RangeIndexStub flag handling to mirror Tsavorite's RecordInfo pattern: typed properties for each bit, private backing field, centralized reset, and explicit reservation of bits for future use. Changes ------- RangeIndexStub: • New top-of-struct comment block documenting the full bit layout: [Unused7][Unused6][Unused5][Pinned*][Modified*][Transferred][Recovered][Flushed] • Bit constants: kFlushedBitMask, kRecoveredBitMask, kTransferredBitMask (private const byte; mask-only form since byte is small enough that offset/mask split adds no value). • Bits 3 (Modified) and 4 (Pinned) RESERVED — masks deferred until those semantics actually exist (avoids inviting premature usage). Bits 5..7 free for future expansion → ≥5 bits headroom after Modified+Pinned land. • IsFlushed / IsRecovered / IsTransferred are now SETTABLE properties (was readonly-get only) — get returns (flags & mask) != 0; set uses the standard ternary pattern. Inlined by JIT, byte-identical codegen vs the prior 'stub.Flags |= mask;' calls. • 'public byte Flags' renamed to 'private byte flags' — all writes must now go through typed properties or ResetFlags(), centralizing invariant control as more bits gain semantics. • New ResetFlags() instance method (AggressiveInlining) for the two legitimate full-reset callsites. • Old 'internal const byte FlagFlushed/FlagRecovered/FlagTransferred' members removed. RangeIndexManager.Index.cs: • CreateIndex: 'stub.Flags = 0' → 'stub.ResetFlags()'. • RecreateIndex: '&= ~FlagRecovered' → 'stub.IsRecovered = false'. • Static span-based wrappers (SetFlushedFlag, ClearFlushedFlag, SetTransferredFlag, MarkRecoveredFromCheckpoint) reimplemented via the new properties. Helper signatures + AggressiveInlining unchanged so all callsites compile without modification and produce identical machine code. • Composite helpers (ClearTreeHandle, InvalidateStub, MarkRecoveredFromCheckpoint) kept as separate semantic entry points — their underlying writes may converge today but encode different invariants (ownership transfer vs disk-load staleness vs recovery marker). RangeIndexManager.cs: • HandleRangeIndexCreateReplay (AOF replay): 'stub.Flags = 0' → 'stub.ResetFlags()'. • Doc xref FlagFlushed → IsFlushed. RangeIndexOps.cs: • Object initializer 'Flags = 0' line removed (struct default-initializes the private flags byte to 0). RMWMethods.cs / RespRangeIndexTests.cs: • Comment / message references to FlagFlushed/FlagRecovered/FlagTransferred updated to IsFlushed/IsRecovered/IsTransferred. Wire format ----------- Unchanged. Byte 33 (the flags byte) keeps the same bit assignments. Existing checkpoint snapshots, AOF entries, and on-disk records remain readable. The test that constructs a raw 35-byte stub via 'staleStub[33] = 4' continues to exercise the on-the-wire format directly (now with corrected boolean wording in its assertion messages). Perf ---- Zero. Property setters/getters are inlined; literal true/false at every callsite folds the setter ternary at JIT time → same OR/AND/STORE as before. 'private' is purely a compile-time access modifier with no runtime cost. All 61 RangeIndex tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: comment cleanup + Tsavorite format fix Comments -------- Make all RangeIndex comments self-contained, present-tense, and accurate: • Replace stale FlagFlushed/FlagRecovered/FlagTransferred references in comments and doc strings with the current property names IsFlushed/ IsRecovered/IsTransferred (also normalize 'FlagX=1' to 'IsX=true'). • Test RIFlushFilesAreImmutablePerAddressTest: rewrite xmldoc as a behavioral spec for the per-flush snapshot file immutability invariant (drop 'Defect 1', 'Pre-fix behavior', 'Post-fix behavior'); rename test key 'defect1key' → 'flushtestkey'; remove 'Defect 1 invariant' label from assertion message. • Test RIDisposeTreeUnderLockNoOpsOnTransferredSourceTest: rewrite xmldoc and assertion messages as behavioral specs (drop 'GPT-5.5-flagged blocker', 'The bug: pre-fix...', 'The fix: ...'). The discriminating contrast (without IsTransferred → entry IS removed) remains as a present-tense spec check. • GarnetRecordTriggers.OnFlush cold branch: rephrase 'Invariant 1+5' / 'violates Invariant 5' as 'per-flush snapshot invariant has been violated' and inline what the consequence is. Format ------ Tsavorite test file RecordTriggersExtTests.cs had a trailing newline beyond the canonical single-newline EOF; removed. Verification ------------ • dotnet format Garnet.slnx --verify-no-changes → clean • dotnet format Tsavorite.slnx --verify-no-changes → clean • dotnet build test/Garnet.test → 0 warnings, 0 errors • RespRangeIndexTests (61 tests) → all pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ci: point setup-rust-toolchain at the bftree workspace Intermittent CI failure on this PR (and any PR touching native bindings): Error: cargo.exe failed with exit code 101 command: cargo metadata --all-features --format-version 1 --no-deps stderr: error: could not find Cargo.toml in D:\a\garnet\garnet or any parent directory Root cause ---------- actions-rust-lang/setup-rust-toolchain@v1 defaults to cache: true with cache-workspaces: '. -> target'. Internally it invokes Swatinem/rust-cache, which runs 'cargo metadata' from the repo root to build the cache key. The Garnet repo has no root Cargo.toml — the only Cargo workspace lives at libs/native/bftree-garnet — so the metadata call exits 101. The failure is intermittent because the cache action sometimes recovers (falls back to the lockfile-based key it already discovered) and sometimes surfaces the exit 101 as a step failure. Fix --- Pass cache-workspaces: libs/native/bftree-garnet to the action so rust-cache runs cargo metadata from the directory that actually contains Cargo.toml. The cache-key derivation continues to work and the spurious exit-101 error goes away. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: restore per-key exclusive lock around OnFlush snapshot Regression introduced in commit 30885d33b ('RangeIndex: refactor for compaction lifecycle'): the dev branch's OnFlush trigger delegated to RangeIndexManager. SnapshotTreeForFlush(), which acquired the per-key exclusive lock around the native bftree snapshot call. The refactor inlined the snapshot body directly into the OnFlush trigger as a raw BfTreeService.SnapshotToFileByPtr(...) call with NO lock — silently dropping the serialization guarantee. Why this matters ---------------- At the moment OnFlush fires on a record, that record: - Is the chain head (no RIPROMOTE has occurred yet — IsTransferred=false). - Has TreeHandle pointing at the live native tree. - Has IsFlushed=false (about to be set TO true by OnFlush at the end). Worker threads doing RI.SET / RI.DEL on this record use the same TreeHandle to call native InsertByPtr / DeleteByPtr while holding the per-key SHARED lock. Without the per-key exclusive lock in OnFlush, the native SnapshotToFileByPtr can run concurrently with InsertByPtr on the same tree. If the BfTree native lib does not fully serialize snapshot vs insert internally, this races and can hang or corrupt. This is the root cause of the rare RIConcurrentOpsWithCheckpointTest hang in CI. The test maximizes the race window by having 4 worker threads tight-looping RI.SET on a single key while a 5th thread takes a SAVE checkpoint — the IN_PROGRESS → WAIT_FLUSH transition in Tsavorite's checkpoint state machine triggers OnFlush per record, racing with workers that entered before SetCheckpointBarrier set checkpointInProgress=true. The IsTransferred early-return at the top of OnFlush is NOT enough on its own. It only short-circuits re-firings of OnFlush on a stale source AFTER the source has been transferred to a tail destination via PostCopyToTail or RIPROMOTE PostCopyUpdater. For the FIRST OnFlush on any record, IsTransferred=false because the transfer is triggered BY IsFlushed=true, which OnFlush is the one to set. Fix --- Centralize the snapshot-for-flush logic in RangeIndexManager.SnapshotTreeForFlush(key, valueSpan, logicalAddress) — mirroring the dev branch pattern. Acquire the per-key exclusive lock around the snapshot/copy body. The OnFlush trigger is now a thin delegation. This restores the serialization guarantee against: - Concurrent worker InsertByPtr/DeleteByPtr on the same TreeHandle. - Concurrent SnapshotAllTreesForCheckpoint on the same TreeHandle (which already holds the same per-key exclusive lock). - Concurrent PreStageAndRegisterPending data.bftree rewrite (relevant to the cold-path File.Copy(data.bftree -> flush.bftree)). The IsFlushed flag is now set INSIDE the lock body on success, and is explicitly NOT set on the cold-path-data.bftree-missing path — preserving the existing 'route through RestoreTree -> NOTFOUND' contract for unrestorable records. (This is a small improvement over the prior code, which set IsFlushed unconditionally after the native call returned.) Test safety ----------- Added [CancelAfter(120_000)] to RIConcurrentOpsWithCheckpointTest so any remaining or unrelated hang in this stress test surfaces as an NUnit timeout failure with a stack trace, in addition to CI's existing --blame-hang-timeout dump. Verification ------------ - dotnet format Garnet.slnx --verify-no-changes -> clean - dotnet build test/Garnet.test -> 0 warnings, 0 errors - 61 RespRangeIndexTests -> all pass Note: holding additional fixes pending the dev-vs-PR design audit currently in progress (will land as separate commits if/when surfacing other lost design aspects). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: address dev-vs-PR design audit (3 findings) Restore three protective measures that were silently weakened during the compaction-lifecycle refactor (commit 30885d33b). 1. SnapshotTreeForFlush: don't gate memory-backend re-anchor on riLogRoot --------------------------------------------------------------------------- The first line of SnapshotTreeForFlush early-returned on 'string.IsNullOrEmpty(riLogRoot)' BEFORE reaching the 'StorageBackend == Memory' branch that calls SetFlushedFlag. For a memory-backed RI tree in any configuration where riLogRoot is null/empty (unit tests instantiating RangeIndexManager(true) directly, or any future deployment without a log dir), the memory branch was unreachable. Without IsFlushed=true, ReadRangeIndex never triggers PromoteToTail (RIPROMOTE) — the stub stays frozen in the read-only region. On first page eviction, OnEvict -> DisposeTreeUnderLock disposes the live BfTree (no IsTransferred guard fires). All subsequent accesses route through RestoreTree -> no data.bftree (memory tree never had one) -> silent NOTFOUND on a key that was supposed to live in memory. In production today this is unreachable because GarnetServer.cs:321 always computes a non-empty riLogRoot. Defense-in-depth fix: reorder so the memory branch fires regardless of riLogRoot, and only gate disk-backed paths on the riLogRoot check. 2. DisposeTreeUnderLock: take per-key lock around cold-eviction TryRemove ------------------------------------------------------------------------- The cold-eviction branch (TreeHandle==0, deleteFiles=false) early-returned with 'liveIndexes.TryRemove(KeyId(key), out _)' OUTSIDE the per-key exclusive lock. Every other liveIndexes mutator (RegisterIndex, UnregisterIndex, PreStageAndRegisterPending) holds the lock — this single instance broke the discipline. ConcurrentDictionary is thread-safe at the operation level, but invariants spanning multiple operations need external serialization. Race scenario: SetCheckpointBarrier marks pending entry's SnapshotPending=1, then a concurrent cold eviction TryRemoves it before SnapshotAllTreesForCheckpoint captures it. Result: silent checkpoint coverage gap for the affected key — recovery from that checkpoint produces NOTFOUND. Fix: acquire the lock first, then dispatch on TreeHandle/deleteFiles inside the locked block. Cost: one extra lock acquisition on the cold-eviction path, with no shared-lock contention because the page is being torn down. 3. RestoreTree: re-add Debug.Assert for missing data.bftree ----------------------------------------------------------- Dev had Debug.Assert(File.Exists(restorePath), '...') for the recovered-stub restore path. The PR's two-branch restore (checkpoint vs flush) was replaced by a single 'always use data.bftree' path, and the assert was softened to a LogWarning + return false. The data.bftree pre-staging invariant ('every above-FUA stub with TreeHandle=0 must have a valid data.bftree') is load-bearing for the lazy- restore chain. If pre-staging breaks (in PostCopyToTail-cold, RIPROMOTE PostCopyUpdater-cold, or OnRecoverySnapshotRead), the current code surfaces the failure as a key silently returning NOTFOUND. The original Debug.Assert would have caught it loudly in Debug-build CI. Fix: restore the Debug-build assert immediately before the LogWarning. Release builds: behavior unchanged. Debug builds (CI): assertion failure with stack trace, immediately surfacing any pre-stage chain bug. Skipped from audit ------------------ - Finding 4 (StorageBackend == Disk guard dropped from DeleteTreeFiles call): pure defensive-coding loss; DeleteTreeFiles guards via File.Exists, memory trees no-op. Not worth a fix. - Finding 5 (PurgeOldCheckpointSnapshots dropped, cleanup delegated to Tsavorite): needs verification that Tsavorite's per-token directory cleanup correctly reaches the rangeindex/ subdir. Investigation deferred — not a fix. Verification ------------ - dotnet format Garnet.slnx --verify-no-changes -> clean - dotnet build test/Garnet.test -> 0 warnings, 0 errors - 61 RespRangeIndexTests -> all pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: migrate to bf-tree 0.5 CPR snapshot, fix OnFlush deadlocks Two deadlocks were identified in the prior X-lock-based OnFlush design: 1. ReadRangeIndex + CopyReadsToTail self-deadlock via PreStage. 2. RestoreTree's RIRESTORE RMW could trigger synchronous OnFlush on the same thread that holds the per-key X-lock -> shard collision -> hang. Root cause: `OnFlush` may fire as a deferred epoch action on a thread holding any RI lock. The per-key X-lock walks all shards via `CalculateIndex` so it collides with any S-lock the same thread already holds. The deferred firing is unavoidable at the Tsavorite layer. Fix: - Upgrade to bf-tree 0.5 (PR microsoft/bf-tree#19, CPR snapshot). CPR is concurrent-safe with workers, eliminating the need for per-key X-lock during OnFlush of a live tree. Memory-backed trees are also CPR-snapshot capable in 0.5 (unified API with disk-backed). - OnFlush live case: per-tree atomic (TreeEntry.SnapshotInProgress) serializes against concurrent SnapshotAllTreesForCheckpoint. No per-key lock taken. - OnFlush cold case (stub.TreeHandle == 0, just-CAS'd at tail before RestoreTree activated): per-key SHARED RI lock blocks RestoreTree's X-lock from registering a new tree mid-copy. S-vs-S compatible across threads -> no self-deadlock. If a live tree exists for the key under another stub, snapshot via that handle (CPR); else File.Copy(data.bftree). - SnapshotAllTreesForCheckpoint: same per-tree atomic, no per-key X-lock. - RestoreTree: split lock from RMW. X-lock held only for file-recovery + RegisterIndex; released BEFORE issuing RIRESTORE RMW. RMW on tombstoned key returns NOTFOUND (NeedInitialUpdate=false). - DisposeTreeUnderLock + RegisterIndex loser-disposal: defer bfTree.Dispose() and file deletion via storeEpoch.BumpCurrentEpoch(...) so concurrent readers using TreeHandle complete before native free. - Startup guard: EnableRangeIndexPreview=true is incompatible with CopyReadsToTail=true (would deadlock per-key shared+exclusive on the same thread); GarnetServer throws GarnetException at construction time. - File scheme: <hash>.scratch.cpr is bftree's CPR snapshot scratch path (configured at construction, written by every cpr_snapshot). OnFlush and SnapshotAllTreesForCheckpoint File.Copy scratch -> final destination (Copy not Move because bftree caches an FD on the configured path). Native layer: - Cargo.toml: bumped bf-tree 0.4 -> 0.5. - src/lib.rs: replaced bftree_snapshot/_memory + bftree_new_from_snapshot with bftree_cpr_snapshot, bftree_new_from_cpr_snapshot, bftree_are_all_threads_in_next_version. bftree_create accepts snapshot_file_path (configures use_snapshot=true). - BfTreeService.cs / NativeBfTreeMethods.cs: matching managed bindings. RecoverFromCprSnapshot is the unified disk+memory recovery API. - runtimes/linux-x64/native/libbftree_garnet.so refreshed. Tests: - All 61 existing RespRangeIndexTests pass. - Added RICopyReadsToTailIncompatibleStartupGuardTest for the new guard. - TestUtils.CreateGarnetServer accepts copyReadsToTail parameter. Docs: - website/docs/dev/range-index-resp-api.md: updated lifecycle table, file layout, lazy-restore flow, checkpoint flow, plus new Deadlock Safety section explaining the three invariants. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: drop redundant X-lock from PreStageAndRegisterPending PreStage's X-lock was a defensive carryover. With atomic-rename via a per-srcAddr staging file, no concurrent reader of data.bftree can observe a partial file. Reasoning: 1. PreStage is always called with srcAddr = the chain head (the source the upper Tsavorite op is copying from). The chain head's flush file captures the latest content; data.bftree already contains that same content (because data.bftree is updated only by live-tree mutations, PreStage from the same chain head, or post-recovery pre-staging — all of which converge to the latest flushed state). So PreStage's File.Copy is semantically a no-op overwrite. 2. Atomic-rename via File.Copy(srcFlush -> staging) + File.Move(staging, dataPath, overwrite: true) gives bit-level safety: a racing RestoreTree opening data.bftree gets either the OLD complete file or the NEW complete file, never partial. Concurrent ops on a different stub for the same key are handled by: - Same-key op-thread access: serialized by Tsavorite's bucket lock. - Page-callback access on the older source A: blocked by epoch protection during PreStage's window. After PreStage returns, PostCopyToTail sets IsTransferred=true on A, so subsequent OnEvict on A no-ops via the IsTransferred short-circuit in DisposeTreeUnderLock. CopyReadsToTail incompatibility kept (and reworded): the X-lock was not the only obstacle. CopyReadsToTail interacts with PromoteToTail / RestoreTree paths in ways that hang after ShiftReadOnlyAddress; that investigation is tracked separately. Test RICopyReadsToTailIncompatibleStartupGuardTest documents the gating. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Revert "RangeIndex: drop redundant X-lock from PreStageAndRegisterPending" This reverts commit 089eb30c083369dbddc0e1b4eb8dfadfc098c275. * RangeIndex: propagate RecordType in PostCopyToTail (CTT bug fix) PostCopyToTail had a chicken-and-egg bug for any path that goes through Tsavorite's CopyReadsToTail or compaction CTT: - CTT allocates dst at tail and copies the source's value bytes, but does NOT carry the RecordDataHeader.RecordType through. After CTT, dst.RecordType == 0 by default. - Our PostCopyToTail callback's early-return checked dstLogRecord.RecordDataHeader.RecordType against RangeIndexManager.RangeIndexRecordType — and bailed out, so we never set the RecordType ourselves either. - Result: a Read-with-RangeIndex-cmd on the freshly CTT'd record fails the type-mismatch guard in ReadMethods.CheckRecordTypeMismatch and returns WRONGTYPE. With CopyReadsToTail=true on a flushed RI stub, ReadRangeIndex retried into PromoteToTail and the loop hung the GET. Fix: - Check srcLogRecord.RecordType (the source carries it correctly) for the early-return. If src is a RangeIndex record, we know dst should be one too. - Explicitly set dst.RecordDataHeader.RecordType = RangeIndexRecordType before any other dst mutation. RecordDataHeader is a pointer-wrapping struct; the setter writes through to the underlying memory even when invoked on a struct value. This also fixes any future compaction path that CTTs RangeIndex stubs. CopyReadsToTail+EnableRangeIndexPreview startup guard kept: the cold-CTT path still self-deadlocks because PreStageAndRegisterPending takes a per-key X-lock under the reader's shared RI lock. The guard text is updated to cite the precise reason. Live-CTT path (src.TreeHandle != 0) is now correct and would work alone, but we don't split the guard because operationally users either evict or they don't. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: dedicated Read_RangeIndex API to suppress CTT per-op RangeIndex performs its own controlled promotion via RIPROMOTE RMW (propagates RecordType, manages TreeHandle ownership in PostCopyUpdater, pre-stages data.bftree under per-key X-lock). Allowing Tsavorite's automatic CopyReadsToTail / CopyReadsToReadCache to race with that path caused two distinct failures when CopyReadsToTail=true was enabled at the session/KV level: 1. CTT does not propagate RecordDataHeader.RecordType to the destination record. After CTT, dst.RecordType == 0; the next Read with a RangeIndex command failed CheckRecordTypeMismatch and returned WRONGTYPE — ReadRangeIndex retried into PromoteToTail forever and the GET hung. 2. PostCopyToTail-cold takes the per-key X-lock under the reader's shared RI lock — same-shard self-deadlock. Fix: per-operation suppression of CTT for RangeIndex stub Reads. Tsavorite changes: - libs/storage/Tsavorite/cs/src/core/Index/Common/OperationOptions.cs: ReadOptions.{CopyOptions,KeyHash} setters relaxed from internal to public so callers in Garnet.server can construct ReadOptions with per-op overrides. ReadOptions was always designed to support per-op overrides (see hierarchy doc on ReadCopyFrom); the internal setter was the only thing in the way. Garnet.server changes: - libs/server/Storage/Session/MainStore/AdvancedOps.cs: new Read_RangeIndex<TStringContext> method that builds a ReadOptions with CopyOptions = ReadCopyOptions.None and dispatches to the standard Read pipeline. Read_MainStore is unchanged so other callers (Bitmap, HLL, etc.) incur zero overhead. - libs/server/Resp/RangeIndex/RangeIndexManager.Locking.cs: ReadRangeIndex hot path + RestoreTree's re-check call Read_RangeIndex. - libs/server/Storage/Session/MainStore/RangeIndexOps.cs: RIEXISTS uses Read_RangeIndex. Startup guard removed: EnableRangeIndexPreview is now compatible with CopyReadsToTail=true. Bonus: PostCopyToTail's prior RecordType-propagation fix (commit 5d46bf7d3) remains correct and covers any future code path that does CTT a RangeIndex stub through means other than ReadRangeIndex (e.g., compaction). Replaces RICopyReadsToTailIncompatibleStartupGuardTest with RICopyReadsToTailCompatibleTest that exercises the full path (insert -> ShiftReadOnlyAddress -> 50 reads). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: address PR review feedback (cleanups + doc fixes) Fixes from the PR review pass on #1780: - Remove unused 'removeOutdatedCheckpoints' field/parameter from RangeIndexManager and its callsite in GarnetServer.cs. The field was set but never read; OnTruncate's per-flush cleanup is unconditional (LOG-tied lifetime), independent of any checkpoint-retention policy. Doc on OnTruncateImpl updated to reflect this. - Drop redundant Directory.CreateDirectory(riLogRoot) calls inside CreateBfTree, PreStageAndRegisterPending, RebuildFromSnapshotIfPending, and SnapshotTreeForFlush. The constructor already creates riLogRoot once. (Per-checkpoint snapshot-dir CreateDirectory in SnapshotAllTreesForCheckpoint stays — that path is per-token and needs creation each time.) - Fix unsafe span pinning in RIDisposeTreeUnderLockNoOpsOnTransferred- SourceTest. RangeIndexManager hashes the key via PinnedSpanByte.FromPinnedSpan, which captures a raw pointer assuming the source is GC-pinned. The test was passing a span over a managed byte[] from Encoding.ASCII.GetBytes(), which can be relocated by the GC during hashing. Pin via 'fixed' for the duration of each DisposeTreeUnderLock call. - Doc updates in website/docs/dev/range-index-resp-api.md: * Replace stale FlagFlushed/FlagRecovered terminology with the actual property names IsFlushed/IsRecovered. * Add IsTransferred to the Flags field documentation. * Update PostCopyToTail row to mention RecordType propagation. * Update OnDispose row to clarify per-flush snapshots are preserved (LOG-tied; cleaned by OnTruncate when BeginAddress passes addr). * Drop stale '.tmp' atomic-rename references in OnTruncate row, PreStageAndRegisterPending bullet, Compaction Lifecycle, and Recovery Flow. Current code uses File.Copy(overwrite: true) under the per-key X-lock, not .tmp + File.Move. * Update RecoverFromSnapshot reference to RecoverFromCprSnapshot. Items flagged but not addressed in this commit (see PR thread): - Tsavorite TryCopyToTail unseal ordering (Copilot #1) — would require splitting CASRecordIntoChain's unseal step; architectural Tsavorite change. - srcLogicalAddress=0 read-cache fallback (Copilot #2) — false alarm: LogAddress.kInvalidAddress IS 0L, so the existing check 'srcLogicalAddress != kInvalidAddress' correctly handles uninitialized originalAddress. - SnapshotToFileByPtr overwrite=true (Copilot #7) — obsolete: that method no longer exists in current code (replaced by CprSnapshotByPtr in the bf-tree 0.5 migration). - AddOrUpdate atomicity in RegisterIndex (tiagonapoli) — concurrent RegisterIndex on the same key is impossible: it is called only from RestoreTree (under the per-key X-lock) and from RICREATE (single-shot per key). The reviewer acknowledged this was theoretical. - Hard-failing init on directory-create or empty riLogRoot (tiagonapoli) — defensive changes to init contract; flagged for owner review. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: fail-fast init, atomic activation, strict recovery checks Robustness improvements addressing PR review feedback: RegisterIndex (#9): - Refactored from AddOrUpdate to TryAdd + Interlocked.CompareExchange on the pending-entry activation. This is robust to caller-invariant violations: even if two threads were to call RegisterIndex on the same key (today impossible — RICREATE is single-shot on Record.Created=true; RestoreTree holds the per-key X-lock), the CompareExchange ensures only one wins the activation; the loser logs an error and defer-disposes its duplicate via storeEpoch. - Added DisposeBfTreeDeferred helper to consolidate the defer-via-storeEpoch / synchronous-fallback dispose pattern. Constructor fail-fast (#10, #12): - riLogRoot is now required when enabled=true; throws ArgumentException if null or empty. Production GarnetServer always provides one via the LogDir → CheckpointDir → cwd fallback chain. - Directory.CreateDirectory(riLogRoot) failures now throw IOException with a wrapped exception and a helpful message pointing at parent-dir / permission issues — surfaces misconfiguration at server startup rather than at first RangeIndex use. - When enabled=false, the manager remains inert (riLogRoot can be null/empty); existing defensive checks in runtime methods cover that edge case. RebuildFromSnapshotIfPending (#14): - Recovery state (cprDir non-empty + non-empty recoveredCheckpointToken) is now required; throws InvalidOperationException with diagnostic context if missing. This indicates OnRecoverySnapshotRead fired without OnRecovery(token) having captured the token first — a wiring bug that would otherwise silently lose the recovered tree. - Disabled state still returns silently (manager is inert). - Below-FUA stubs (snapshot file absent) remain expected and log-debug. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Tests: migrate RecordTriggersExtTests off Allure (post-rebase fixup) The Allure-removal commit on main (d6d488885) migrated all test classes in test/ and Tsavorite.test from AllureTestBase to TestBase, but missed RecordTriggersExtTests.cs which we added in this PR. The build broke on CI: CS0246: 'Allure' could not be found CS0246: 'AllureNUnit' could not be found CS0246: 'AllureTestBase' could not be found Same migration pattern applied: - Drop 'using Allure.NUnit;'. - Drop '[AllureNUnit]' attribute. - Inherit from 'TestBase' instead of 'AllureTestBase'. This was missed during the rebase because Tsavorite.test lives in the separate Tsavorite.slnx (per repo onboarding doc); my post-rebase 'dotnet build Garnet.slnx' did not exercise Tsavorite test compilation. Verified now via 'dotnet build libs/storage/Tsavorite/cs/Tsavorite.slnx' and the 9 RecordTriggersExt tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * update bftree * ci: drop Rust toolchain setup step Main no longer installs Rust on CI runners — the workflow relies on the prebuilt native bftree binaries checked into runtimes/<rid>/native/. Our BfTreeInterop.csproj already handles missing cargo gracefully (errors only when BOTH cargo is unavailable AND no prebuilt exists for the platform). With prebuilts present for linux-x64 and win-x64 (the two CI matrix platforms), the Rust setup step is dead weight. Saves ~30s per CI job × the build+test matrix. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * GarnetServer: drop stray blank line after EnableRangeIndexPreview cluster guard * RangeIndex: gate trigger Call* properties on IsEnabled (perf) Audit for perf-regression-when-disabled (the default config). Issue: rangeIndexManager is always constructed by GarnetServer (even when EnableRangeIndexPreview=false) so it can serve a few queries that error out cleanly when RI commands are issued. As a side effect, our trigger gates were 'rangeIndexManager != null' — always true — so Tsavorite was firing OnFlush, OnEvict, OnDiskRead, PostCopyToTail, OnTruncate callbacks on every relevant record/operation, only to do an early-return on the RecordType check. Per-record overhead, multiplied across the log on every flush/evict/CTT. Fix: change the gates to 'rangeIndexManager is { IsEnabled: true }'. With RI disabled, the gates return false and Tsavorite skips the callback entirely (verified — Tsavorite checks the gate at every callsite: ObjectAllocatorImpl.OnFlush, AllocatorBase.OnEvict / OnTruncate, AllocatorBase / AllocatorScan / *ScanIterator.OnDiskRead, TryCopyToTail.PostCopyToTail). Audit of other hot-path additions in this PR (no further changes needed): - ReadMethods.SingleReader: type-mismatch check guarded by 'recordType != 0' short-circuit. Normal GET on normal string: zero overhead. - UpsertMethods.InPlaceWriter: same 'recordType != 0' short-circuit. - RMWMethods: RangeIndex command cases only entered for RI commands. - Read_RangeIndex: separate API; never invoked from non-RI hot paths. - OnDispose / OnRecovery / OnRecoverySnapshotRead / OnCheckpoint: no Call* gates available in Tsavorite, but their bodies do RecordType early-returns and fire on rare ops (dispose, recovery, checkpoint). All 62 RangeIndex tests still pass after the gate change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * RangeIndex: construct manager only when enabled; drop IsEnabled flag Cleaner alternative to the previous 'always-construct + IsEnabled flag' pattern. With this change: - GarnetServer constructs RangeIndexManager only when EnableRangeIndexPreview=true; passes null otherwise. The fallback-path computation for riLogRoot moved inside the conditional. - RangeIndexManager constructor drops the 'enabled' parameter — it always validates riLogRoot non-empty and creates the directory (the call only happens in the enabled case now). Drops the IsEnabled property. - Trigger gates (CallOnFlush etc.) simplify back to 'rangeIndexManager != null' — same effect as the previous 'is { IsEnabled: true }' but reads more naturally. - RespServerSessionRangeIndex feature-gate checks change from '!sw.rangeIndexManager.IsEnabled' to 'sw.rangeIndexManager is null'. - RebuildFromSnapshotIfPending drops its 'if (!IsEnabled) return;' prelude; the manager itself doesn't exist when disabled. Behavior is identical: RangeIndex remains disabled by default, the trigger gates return false → Tsavorite skips all RangeIndex callbacks → zero per-op overhead on non-RangeIndex workloads. The change is about code clarity and removing a dual-state object that was just a historical artifact. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CPR snapshot of Bf-tree.
Retiring original stop-the-world only snapshot code due to duplicity.