Skip to content

Refactor disk-based replication checkpoint shipping and safe hlog segment truncation#1773

Merged
vazois merged 45 commits into
mainfrom
vazois/fsync-ref
May 14, 2026
Merged

Refactor disk-based replication checkpoint shipping and safe hlog segment truncation#1773
vazois merged 45 commits into
mainfrom
vazois/fsync-ref

Conversation

@vazois
Copy link
Copy Markdown
Contributor

@vazois vazois commented May 6, 2026

Summary

Refactors the checkpoint shipping pipeline for disk-based replication, introducing clean abstractions for reading and transmitting Tsavorite checkpoint data. Also adds safe hybrid log segment truncation to prevent deletion of segments actively being read by syncing replicas.

Key changes

Checkpoint shipping abstractions (send side):

  • ISnapshotReader / ISnapshotTransmitSource / ISnapshotDataSource interfaces
  • TsavoriteSnapshotReader / TsavoriteCheckpointReader for reading checkpoint files
  • FileTransmitSource / TsavoriteMetadataTransmitSource for transmitting data
  • SnapshotTransmissionDriver orchestrating the send pipeline
  • ReplicaSyncSession refactored to use the new abstractions

Checkpoint shipping abstractions (receive side):

  • ISnapshotDataSink interface
  • FileDataSink / MetadataDataSink implementations
  • Unified ReceiveCheckpointHandler with ProcessSnapshotData entry point
  • Unified CLUSTER SNAPSHOT_DATA command (previous per-type commands are deprecated but not removed)

Safe hlog segment truncation (PerformInternalCleanup):

  • Added PerformInternalCleanup property to ICheckpointManager interface
  • GarnetClusterCheckpointManager overrides as alse -- Tsavorite skips internal cleanup
  • CheckpointStore.DeleteOutdatedCheckpoints() calls ShiftBeginAddress with the oldest active checkpoint begin address as the safe truncation boundary
  • Ensures hlog segments are not deleted while replicas are actively reading them

RangeIndexManager refactoring:

  • Moved AOF replication methods to RangeIndexManager.Replication.cs partial file

Testing

  • Added ClusterReplicationHlogSegmentCleanupTest validating hlog segment truncation during concurrent replica sync (25+ stable runs)
  • All existing replication tests pass

vazois and others added 26 commits April 27, 2026 16:27
Consolidate file segment and metadata transmission into a single
CLUSTER SNAPSHOT_DATA <token> <type> <startAddress> <data> command.
A startAddress of -1 signals a single-message payload (e.g., metadata)
committed directly. Any other startAddress indicates a streamed file
segment where empty data signals end-of-stream.

The previous CLUSTER SEND_CKPT_FILE_SEGMENT and SEND_CKPT_METADATA
commands are not removed but are deprecated in favor of SNAPSHOT_DATA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Ship BfTree snapshot files (snapshot.{token}.bftree) during checkpoint
synchronization to replicas. This enables replicas to lazily restore
RangeIndex trees from checkpoint snapshots after recovery.

New types following the existing ISnapshotDataSource/ISnapshotTransmitSource/
ISnapshotReader pattern:
- RangeIndexFileDataSource: reads .bftree files via FileStream
- RangeIndexFileTransmitSource: sends chunks with per-file key hash header
- RangeIndexCheckpointReader: enumerates snapshot files for a checkpoint token
- RangeIndexFileSink: writes received .bftree data to disk on replicas

Wire protocol: for each RINDEX_SNAPSHOT file, a header message (startAddress=-1)
carries the 32-char hex key hash directory name, followed by file data chunks,
followed by an empty EOT packet. Receiver validates key hash format.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cation

Add cluster-aware purging to PurgeOldCheckpointSnapshots so that in
cluster mode, snapshot deletion is deferred to CheckpointStore which
verifies no active readers hold the checkpoint entry. CheckpointStore
now calls PurgeOldCheckpointSnapshots with enforceClusterSafety:true
after confirming reader safety, both in DeleteOutdatedCheckpoints and
PurgeAllCheckpointsExceptEntry.

- Add clusterEnabled field to RangeIndexManager constructor
- Add enforceClusterSafety parameter to PurgeOldCheckpointSnapshots
- Wire CheckpointStore to purge BfTree snapshots alongside HLOG/index
- Pass clusterEnabled from GarnetServer to RangeIndexManager

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t truncation

Introduce PerformInternalCleanup property on ICheckpointManager to control
whether Tsavorite performs internal cleanup of checkpoint snapshot files and
hybrid log segments during the checkpoint state machine. When false, the
external layer (cluster mode) manages cleanup with reader-safety checks.

- Add PerformInternalCleanup to ICheckpointManager interface
- Add virtual property to DeviceLogCommitCheckpointManager (default: true)
- Override as false in GarnetClusterCheckpointManager
- Guard CleanupLogCheckpoint/CleanupIndexCheckpoint in Checkpoint.cs
- Activate safe ShiftBeginAddress in CheckpointStore.DeleteOutdatedCheckpoints
  using the oldest active checkpoint's begin address as the truncation boundary
- Add ClusterReplicationHlogSegmentCleanupTest to validate hlog segment
  truncation does not interfere with concurrent replica sync

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the ISnapshotReader and ISnapshotTransmitSource implementations for
RangeIndex BfTree checkpoints, along with the receive-side sink and helper
methods for enumerating/purging BfTree snapshot files during replication.

Deleted:
- RangeIndexCheckpointReader.cs
- RangeIndexFileTransmitSource.cs
- RangeIndexFileSink.cs

Cleaned up:
- CheckpointFileType: removed RINDEX_SNAPSHOT enum value
- CheckpointStore: removed PurgeOldCheckpointSnapshots calls
- ReceiveCheckpointHandler: removed RINDEX_SNAPSHOT handling
- ReplicaSyncSession: removed RangeIndex reader registration
- RangeIndexManager.cs: moved replication methods to partial file
- StoreWrapper: removed public RangeIndexManager property
- GarnetServer: reverted clusterEnabled parameter

The RangeIndexManager.Replication.cs partial file is retained as the
separation of AOF replication methods remains useful.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t for backward compatibility

Revert the wire format of CLUSTER SEND_CKPT_FILE_SEGMENT to accept 5 args
(including segmentId) so older primaries on dev can still replicate to
newer replicas. The segmentId is parsed but not used since disk-based
replication now goes through CLUSTER SNAPSHOT_DATA.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vazois vazois marked this pull request as ready for review May 7, 2026 23:43
Copilot AI review requested due to automatic review settings May 7, 2026 23:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Refactors disk-based replication checkpoint shipping by introducing chunk-based snapshot reader/transmit/source abstractions and a unified CLUSTER SNAPSHOT_DATA receive path, while also adding a mechanism to prevent Tsavorite from internally cleaning up/truncating checkpoints in cluster mode so that Garnet can enforce reader-safe segment truncation.

Changes:

  • Added unified snapshot shipping pipeline (ISnapshotReader/ISnapshotDataSource/ISnapshotTransmitSource) and unified receive handler (ProcessSnapshotData) + new CLUSTER SNAPSHOT_DATA command.
  • Introduced ICheckpointManager.PerformInternalCleanup to disable Tsavorite internal checkpoint cleanup in cluster mode; moved safe hlog truncation to CheckpointStore.DeleteOutdatedCheckpoints.
  • Added/updated cluster + ACL tests and command metadata to cover the new internal command and cleanup behavior.

Reviewed changes

Copilot reviewed 35 out of 36 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
test/Garnet.test/Resp/ACL/RespCommandTests.cs Adds ACL coverage for CLUSTER SNAPSHOT_DATA when cluster is disabled.
test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs Adds replication test validating hlog segment cleanup during concurrent sync.
test/Garnet.test.cluster/ClusterNegativeTests.cs Updates wrong-arity coverage for replication-related CLUSTER subcommands.
playground/CommandInfoUpdater/SupportedCommand.cs Registers `CLUSTER
playground/CommandInfoUpdater/GarnetCommandsInfo.json Adds generated command metadata entry for CLUSTER_SNAPSHOT_DATA.
libs/storage/Tsavorite/cs/src/core/Index/Recovery/ICheckpointManager.cs Adds PerformInternalCleanup switch to checkpoint manager contract.
libs/storage/Tsavorite/cs/src/core/Index/Recovery/Checkpoint.cs Gates Tsavorite cleanup/truncation on PerformInternalCleanup.
libs/storage/Tsavorite/cs/src/core/Index/CheckpointManagement/DeviceLogCommitCheckpointManager.cs Implements PerformInternalCleanup defaulting to true.
libs/server/Resp/RangeIndex/RangeIndexManager.Replication.cs Moves RangeIndex AOF replication methods into a new partial file.
libs/server/Resp/RangeIndex/RangeIndexManager.cs Removes replication/AOF code now hosted in partial.
libs/server/Resp/Parser/RespCommand.cs Adds CLUSTER_SNAPSHOT_DATA enum + parsing.
libs/server/Resp/CmdStrings.cs Adds snapshot_data command string.
libs/resources/RespCommandsInfo.json Adds resource metadata for `CLUSTER
libs/cluster/Session/RespClusterReplicationCommands.cs Adds network handler for CLUSTER SNAPSHOT_DATA; refactors existing ckpt receive paths to new handler APIs.
libs/cluster/Session/ClusterCommands.cs Wires CLUSTER_SNAPSHOT_DATA dispatch.
libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs Renames/adjusts device factory method used by receiver (CreateCheckpointDevice).
libs/cluster/Server/Replication/ReplicaOps/ReceiveCheckpointHandler.cs Removes old receive handler implementation (superseded).
libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/ReceiveCheckpointHandler.cs Adds new unified receive handler with sink abstractions.
libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/MetadataDataSink.cs Implements metadata sink that commits checkpoint metadata to manager.
libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/ISnapshotDataSink.cs Defines sink interface for chunk-based writes.
libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/FileDataSink.cs Implements device-backed sink for checkpoint file segment writes.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/TsavoriteMetadataTransmitSource.cs Sends metadata via unified SNAPSHOT_DATA protocol.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/TsavoriteMetadataSource.cs Implements in-memory metadata data source.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/TsavoriteCheckpointReader.cs Implements Tsavorite-backed snapshot reader producing file+metadata transmit sources.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/SnapshotTransmissionDriver.cs Orchestrates sending checkpoint data across sources/readers.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ReplicaSyncSession.cs Refactors primary-side replication sync session to use new transmission driver.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ISnapshotTransmitSource.cs Defines transmit source interface for sending data via GarnetClientSession.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ISnapshotReader.cs Defines snapshot reader interface returning transmit sources.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ISnapshotDataSource.cs Defines chunk-based data source interface.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/FileTransmitSource.cs Implements file segment transmission via unified SNAPSHOT_DATA.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/FileDataSource.cs Implements device-backed chunk reads for file transmission.
libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/DataSourceReadResult.cs Adds common result struct for device-backed vs memory-backed reads.
libs/cluster/Server/Replication/GarnetClusterCheckpointManager.cs Disables Tsavorite internal cleanup in cluster mode (PerformInternalCleanup=false).
libs/cluster/Server/Replication/CheckpointStore.cs Adds reader-safe hlog begin-address shifting during checkpoint pruning.
libs/cluster/Garnet.cluster.csproj Fixes PackageReference formatting.
libs/client/ClientSession/GarnetClientSessionReplicationExtensions.cs Adds client helper to send CLUSTER SNAPSHOT_DATA.
Comments suppressed due to low confidence (1)

libs/cluster/Server/Replication/PrimaryOps/DiskbasedReplication/ReplicaSyncSession.cs:142

  • Local variable name tsavoriteSnaphotReader is misspelled ("Snaphot"). Rename to tsavoriteSnapshotReader (or similar) for clarity and to avoid propagating the typo into future code changes.

Comment thread libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/FileDataSink.cs Outdated
Comment thread libs/client/ClientSession/GarnetClientSessionReplicationExtensions.cs Outdated
Comment thread test/cluster/Garnet.test.cluster/ClusterNegativeTests.cs
Comment thread libs/cluster/Session/RespClusterReplicationCommands.cs
Comment thread test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs Outdated
@vazois vazois changed the base branch from dev to main May 11, 2026 22:37
Comment thread libs/client/ClientSession/GarnetClientSessionReplicationExtensions.cs Outdated
Comment thread libs/cluster/Server/Replication/ReplicaOps/DiskbasedReplication/FileDataSink.cs Outdated
vazois and others added 6 commits May 14, 2026 10:59
 - IOCallbackContext — just a SectorAlignedMemory buffer field that roots the pinned byte[] for GC safety while IO is in-flight
 - Callback — unconditionally releases the semaphore (with ObjectDisposedException guard)
 - Caller — single WaitAsync call; on timeout throws, on cancellation propagates naturally; buffer abandoned in both cases (GC collects after IO completes)
 - IO error — just throws; buffer abandoned same as other error paths

Rationale: Timeout/cancellation/IO-error are all catastrophic for the replication session. No subsequent reads will use the shared semaphore, so stale counts are harmless. Abandoned buffers stay alive via the IOCallbackContext reference chain until the callback fires, then GC collects them.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Apply the same buffer abandonment pattern from FileDataSource to FileDataSink:
 - On timeout/cancellation, buffer is abandoned (not returned to pool); GC collects after IO completes
 - Callback unconditionally releases the semaphore (with ObjectDisposedException guard)
 - Buffer is only returned to the pool on the successful path

Extract IOCallbackContext into a shared class under Replication/ and reuse a single
instance per FileDataSource/FileDataSink instead of allocating per IO.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@vazois vazois merged commit 5e1c098 into main May 14, 2026
194 checks passed
@vazois vazois deleted the vazois/fsync-ref branch May 14, 2026 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants