Refactor disk-based replication checkpoint shipping and safe hlog segment truncation#1773
Merged
Conversation
Consolidate file segment and metadata transmission into a single CLUSTER SNAPSHOT_DATA <token> <type> <startAddress> <data> command. A startAddress of -1 signals a single-message payload (e.g., metadata) committed directly. Any other startAddress indicates a streamed file segment where empty data signals end-of-stream. The previous CLUSTER SEND_CKPT_FILE_SEGMENT and SEND_CKPT_METADATA commands are not removed but are deprecated in favor of SNAPSHOT_DATA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ship BfTree snapshot files (snapshot.{token}.bftree) during checkpoint
synchronization to replicas. This enables replicas to lazily restore
RangeIndex trees from checkpoint snapshots after recovery.
New types following the existing ISnapshotDataSource/ISnapshotTransmitSource/
ISnapshotReader pattern:
- RangeIndexFileDataSource: reads .bftree files via FileStream
- RangeIndexFileTransmitSource: sends chunks with per-file key hash header
- RangeIndexCheckpointReader: enumerates snapshot files for a checkpoint token
- RangeIndexFileSink: writes received .bftree data to disk on replicas
Wire protocol: for each RINDEX_SNAPSHOT file, a header message (startAddress=-1)
carries the 32-char hex key hash directory name, followed by file data chunks,
followed by an empty EOT packet. Receiver validates key hash format.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… receiving a cluster snapshot data request
…cation Add cluster-aware purging to PurgeOldCheckpointSnapshots so that in cluster mode, snapshot deletion is deferred to CheckpointStore which verifies no active readers hold the checkpoint entry. CheckpointStore now calls PurgeOldCheckpointSnapshots with enforceClusterSafety:true after confirming reader safety, both in DeleteOutdatedCheckpoints and PurgeAllCheckpointsExceptEntry. - Add clusterEnabled field to RangeIndexManager constructor - Add enforceClusterSafety parameter to PurgeOldCheckpointSnapshots - Wire CheckpointStore to purge BfTree snapshots alongside HLOG/index - Pass clusterEnabled from GarnetServer to RangeIndexManager Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t truncation Introduce PerformInternalCleanup property on ICheckpointManager to control whether Tsavorite performs internal cleanup of checkpoint snapshot files and hybrid log segments during the checkpoint state machine. When false, the external layer (cluster mode) manages cleanup with reader-safety checks. - Add PerformInternalCleanup to ICheckpointManager interface - Add virtual property to DeviceLogCommitCheckpointManager (default: true) - Override as false in GarnetClusterCheckpointManager - Guard CleanupLogCheckpoint/CleanupIndexCheckpoint in Checkpoint.cs - Activate safe ShiftBeginAddress in CheckpointStore.DeleteOutdatedCheckpoints using the oldest active checkpoint's begin address as the truncation boundary - Add ClusterReplicationHlogSegmentCleanupTest to validate hlog segment truncation does not interfere with concurrent replica sync Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the ISnapshotReader and ISnapshotTransmitSource implementations for RangeIndex BfTree checkpoints, along with the receive-side sink and helper methods for enumerating/purging BfTree snapshot files during replication. Deleted: - RangeIndexCheckpointReader.cs - RangeIndexFileTransmitSource.cs - RangeIndexFileSink.cs Cleaned up: - CheckpointFileType: removed RINDEX_SNAPSHOT enum value - CheckpointStore: removed PurgeOldCheckpointSnapshots calls - ReceiveCheckpointHandler: removed RINDEX_SNAPSHOT handling - ReplicaSyncSession: removed RangeIndex reader registration - RangeIndexManager.cs: moved replication methods to partial file - StoreWrapper: removed public RangeIndexManager property - GarnetServer: reverted clusterEnabled parameter The RangeIndexManager.Replication.cs partial file is retained as the separation of AOF replication methods remains useful. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t for backward compatibility Revert the wire format of CLUSTER SEND_CKPT_FILE_SEGMENT to accept 5 args (including segmentId) so older primaries on dev can still replicate to newer replicas. The segmentId is parsed but not used since disk-based replication now goes through CLUSTER SNAPSHOT_DATA. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Refactors disk-based replication checkpoint shipping by introducing chunk-based snapshot reader/transmit/source abstractions and a unified CLUSTER SNAPSHOT_DATA receive path, while also adding a mechanism to prevent Tsavorite from internally cleaning up/truncating checkpoints in cluster mode so that Garnet can enforce reader-safe segment truncation.
Changes:
- Added unified snapshot shipping pipeline (
ISnapshotReader/ISnapshotDataSource/ISnapshotTransmitSource) and unified receive handler (ProcessSnapshotData) + newCLUSTER SNAPSHOT_DATAcommand. - Introduced
ICheckpointManager.PerformInternalCleanupto disable Tsavorite internal checkpoint cleanup in cluster mode; moved safe hlog truncation toCheckpointStore.DeleteOutdatedCheckpoints. - Added/updated cluster + ACL tests and command metadata to cover the new internal command and cleanup behavior.
Reviewed changes
Copilot reviewed 35 out of 36 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Garnet.test/Resp/ACL/RespCommandTests.cs | Adds ACL coverage for CLUSTER SNAPSHOT_DATA when cluster is disabled. |
| test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs | Adds replication test validating hlog segment cleanup during concurrent sync. |
| test/Garnet.test.cluster/ClusterNegativeTests.cs | Updates wrong-arity coverage for replication-related CLUSTER subcommands. |
| playground/CommandInfoUpdater/SupportedCommand.cs | Registers `CLUSTER |
| playground/CommandInfoUpdater/GarnetCommandsInfo.json | Adds generated command metadata entry for CLUSTER_SNAPSHOT_DATA. |
| libs/storage/Tsavorite/cs/src/core/Index/Recovery/ICheckpointManager.cs | Adds PerformInternalCleanup switch to checkpoint manager contract. |
| libs/storage/Tsavorite/cs/src/core/Index/Recovery/Checkpoint.cs | Gates Tsavorite cleanup/truncation on PerformInternalCleanup. |
| libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/DeviceLogCommitCheckpointManager.cs | Implements PerformInternalCleanup defaulting to true. |
| libs/server/Resp/RangeIndex/RangeIndexManager.Replication.cs | Moves RangeIndex AOF replication methods into a new partial file. |
| libs/server/Resp/RangeIndex/RangeIndexManager.cs | Removes replication/AOF code now hosted in partial. |
| libs/server/Resp/Parser/RespCommand.cs | Adds CLUSTER_SNAPSHOT_DATA enum + parsing. |
| libs/server/Resp/CmdStrings.cs | Adds snapshot_data command string. |
| libs/resources/RespCommandsInfo.json | Adds resource metadata for `CLUSTER |
| libs/cluster/Session/RespClusterReplicationCommands.cs | Adds network handler for CLUSTER SNAPSHOT_DATA; refactors existing ckpt receive paths to new handler APIs. |
| libs/cluster/Session/ClusterCommands.cs | Wires CLUSTER_SNAPSHOT_DATA dispatch. |
| libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs | Renames/adjusts device factory method used by receiver (CreateCheckpointDevice). |
| libs/cluster/Server/Replication/ReplicaOps/ReceiveCheckpointHandler.cs | Removes old receive handler implementation (superseded). |
| libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/ReceiveCheckpointHandler.cs | Adds new unified receive handler with sink abstractions. |
| libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/MetadataDataSink.cs | Implements metadata sink that commits checkpoint metadata to manager. |
| libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/ISnapshotDataSink.cs | Defines sink interface for chunk-based writes. |
| libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/FileDataSink.cs | Implements device-backed sink for checkpoint file segment writes. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/TsavoriteMetadataTransmitSource.cs | Sends metadata via unified SNAPSHOT_DATA protocol. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/TsavoriteMetadataSource.cs | Implements in-memory metadata data source. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/TsavoriteCheckpointReader.cs | Implements Tsavorite-backed snapshot reader producing file+metadata transmit sources. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/SnapshotTransmissionDriver.cs | Orchestrates sending checkpoint data across sources/readers. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ReplicaSyncSession.cs | Refactors primary-side replication sync session to use new transmission driver. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ISnapshotTransmitSource.cs | Defines transmit source interface for sending data via GarnetClientSession. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ISnapshotReader.cs | Defines snapshot reader interface returning transmit sources. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ISnapshotDataSource.cs | Defines chunk-based data source interface. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/FileTransmitSource.cs | Implements file segment transmission via unified SNAPSHOT_DATA. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/FileDataSource.cs | Implements device-backed chunk reads for file transmission. |
| libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/DataSourceReadResult.cs | Adds common result struct for device-backed vs memory-backed reads. |
| libs/cluster/Server/Replication/GarnetClusterCheckpointManager.cs | Disables Tsavorite internal cleanup in cluster mode (PerformInternalCleanup=false). |
| libs/cluster/Server/Replication/CheckpointStore.cs | Adds reader-safe hlog begin-address shifting during checkpoint pruning. |
| libs/cluster/Garnet.cluster.csproj | Fixes PackageReference formatting. |
| libs/client/ClientSession/GarnetClientSessionReplicationExtensions.cs | Adds client helper to send CLUSTER SNAPSHOT_DATA. |
Comments suppressed due to low confidence (1)
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ReplicaSyncSession.cs:142
- Local variable name
tsavoriteSnaphotReaderis misspelled ("Snaphot"). Rename totsavoriteSnapshotReader(or similar) for clarity and to avoid propagating the typo into future code changes.
tiagonapoli
reviewed
May 13, 2026
tiagonapoli
reviewed
May 14, 2026
tiagonapoli
reviewed
May 14, 2026
tiagonapoli
reviewed
May 14, 2026
tiagonapoli
reviewed
May 14, 2026
…oid read operation writing to a corrupted memory in the event of a timeout.
tiagonapoli
reviewed
May 14, 2026
tiagonapoli
reviewed
May 14, 2026
tiagonapoli
reviewed
May 14, 2026
- IOCallbackContext — just a SectorAlignedMemory buffer field that roots the pinned byte[] for GC safety while IO is in-flight - Callback — unconditionally releases the semaphore (with ObjectDisposedException guard) - Caller — single WaitAsync call; on timeout throws, on cancellation propagates naturally; buffer abandoned in both cases (GC collects after IO completes) - IO error — just throws; buffer abandoned same as other error paths Rationale: Timeout/cancellation/IO-error are all catastrophic for the replication session. No subsequent reads will use the shared semaphore, so stale counts are harmless. Abandoned buffers stay alive via the IOCallbackContext reference chain until the callback fires, then GC collects them. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply the same buffer abandonment pattern from FileDataSource to FileDataSink: - On timeout/cancellation, buffer is abandoned (not returned to pool); GC collects after IO completes - Callback unconditionally releases the semaphore (with ObjectDisposedException guard) - Buffer is only returned to the pool on the successful path Extract IOCallbackContext into a shared class under Replication/ and reuse a single instance per FileDataSource/FileDataSink instead of allocating per IO. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
tiagonapoli
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Refactors the checkpoint shipping pipeline for disk-based replication, introducing clean abstractions for reading and transmitting Tsavorite checkpoint data. Also adds safe hybrid log segment truncation to prevent deletion of segments actively being read by syncing replicas.
Key changes
Checkpoint shipping abstractions (send side):
Checkpoint shipping abstractions (receive side):
Safe hlog segment truncation (PerformInternalCleanup):
RangeIndexManager refactoring:
Testing