Skip to content

Port diskless replication dedup fix to new AofSyncDriver architecture#1794

Merged
vazois merged 3 commits into
mainfrom
users/matrembl/simon-aoffixv2
May 14, 2026
Merged

Port diskless replication dedup fix to new AofSyncDriver architecture#1794
vazois merged 3 commits into
mainfrom
users/matrembl/simon-aoffixv2

Conversation

@Mathos1432
Copy link
Copy Markdown
Contributor

Why is this change being made?

Ports Simon Nattress's diskless-replication fix (originally landed on release/v1 in PR #1556's predecessor architecture) onto main, where the AofSyncTaskInfo/AofTaskStore files were replaced by AofSyncDriver/AofSyncDriverStore/AofSyncTask in #1556 (Parallel Replication).

On main, the dedup bug still exists in the new code: TryAddReplicationDrivers (the diskless path) compares existing sync drivers against rss.replicaNodeId, but for diskless replication that field is null — the diskless ReplicaSyncSession carries the node ID on replicaSyncMetadata.originNodeId instead. As a result dedup never matches, every call adds a new driver, numDrivers grows unboundedly, and the resulting RoleInfo[] from INFO REPLICATION eventually exceeds the network output buffer.

The companion GarnetClientSession-leak fix from the original release/v1 commits does not need porting: the new AofSyncTask.Dispose() (introduced in #1556) already disposes garnetClient directly, so that code path was correct from the start.

What changed?

  • libs/cluster/Server/Replication/PrimaryOps/AofOperations/AofSyncDriverStore.cs — In TryAddReplicationDrivers (diskless path), compare syncDriver.RemoteNodeId against rss.replicaSyncMetadata.originNodeId instead of rss.replicaNodeId. The singular TryAddReplicationDriver (disk-based / CLUSTER AOFSYNC) is unaffected.
  • libs/cluster/Server/Replication/PrimaryOps/AofOperations/AofSyncDriver.cs — Update the terminal log message in RunAsync's finally block to remove the misleading "client disposed" wording; the client is no longer disposed in this finally (it's disposed by AofSyncTask.Dispose()).

The three test-only commits from the original release/v1 series (1f2db6c6a, 36836f11f, fae32e4656) were skipped — they modify AofSyncTaskInfoTests.cs, which targets the deleted AofSyncTaskInfo API and does not apply to the new architecture.

How was this validated?

  • Local build: dotnet build libs/cluster/Garnet.cluster.csproj -c Debug — 0 warnings, 0 errors
  • Local cluster replication tests: dotnet test test/Garnet.test.cluster -f net10.0 -c Debug --filter "FullyQualifiedName~Replication" — 283 passed, 0 failed, 123 skipped (14m 15s)
  • Integration pipeline: link to be added

nattress and others added 2 commits May 12, 2026 15:01
Two related bugs in AofTaskStore caused unbounded accumulation
of AofSyncTaskInfo tasks on clusters using diskless replication.

1. TryAddReplicationTasks (the diskless path) compared existing
   tasks against rss.replicaNodeId for dedup. ReplicaSyncSession
   has two node ID fields: replicaNodeId (set by the disk-based
   constructor, null for diskless) and replicaSyncMetadata.originNodeId
   (set by the diskless constructor). The AofSyncTaskInfo was
   created with originNodeId, but dedup compared against the null
   replicaNodeId — so it never matched and every call added a new
   task. Over time numTasks grew unboundedly, inflating the
   RoleInfo[] from INFO REPLICATION until the response exceeded
   the network output buffer.

   Fix: use rss.replicaSyncMetadata.originNodeId in the dedup
   comparison. The singular TryAddReplicationTask (disk-based
   and CLUSTER AOFSYNC) is unaffected.

2. AofSyncTaskInfo.Dispose() did not dispose its owned
   GarnetClientSession. When ReplicaSyncTaskAsync is running,
   CTS cancellation causes it to exit and the finally block
   cleans up. But when ReplicaSyncTaskAsync has not yet started
   (e.g. the task fails to be added), Dispose() is the only
   cleanup path and the session was leaked.

   Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose()
   and remove the redundant call from ReplicaSyncTaskAsync's
   finally block, giving a single disposal site.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that the client disposal is no longer happening in the finally block.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@Mathos1432 Mathos1432 marked this pull request as ready for review May 14, 2026 17:38
Copilot AI review requested due to automatic review settings May 14, 2026 17:38
@vazois vazois merged commit 5c78314 into main May 14, 2026
143 of 144 checks passed
@vazois vazois deleted the users/matrembl/simon-aoffixv2 branch May 14, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants