Port diskless replication dedup fix to new AofSyncDriver architecture#1794
Merged
Conversation
Two related bugs in AofTaskStore caused unbounded accumulation of AofSyncTaskInfo tasks on clusters using diskless replication. 1. TryAddReplicationTasks (the diskless path) compared existing tasks against rss.replicaNodeId for dedup. ReplicaSyncSession has two node ID fields: replicaNodeId (set by the disk-based constructor, null for diskless) and replicaSyncMetadata.originNodeId (set by the diskless constructor). The AofSyncTaskInfo was created with originNodeId, but dedup compared against the null replicaNodeId — so it never matched and every call added a new task. Over time numTasks grew unboundedly, inflating the RoleInfo[] from INFO REPLICATION until the response exceeded the network output buffer. Fix: use rss.replicaSyncMetadata.originNodeId in the dedup comparison. The singular TryAddReplicationTask (disk-based and CLUSTER AOFSYNC) is unaffected. 2. AofSyncTaskInfo.Dispose() did not dispose its owned GarnetClientSession. When ReplicaSyncTaskAsync is running, CTS cancellation causes it to exit and the finally block cleans up. But when ReplicaSyncTaskAsync has not yet started (e.g. the task fails to be added), Dispose() is the only cleanup path and the session was leaked. Fix: add garnetClient?.Dispose() to AofSyncTaskInfo.Dispose() and remove the redundant call from ReplicaSyncTaskAsync's finally block, giving a single disposal site. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Clarify that the client disposal is no longer happening in the finally block. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
badrishc
approved these changes
May 14, 2026
vazois
approved these changes
May 14, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why is this change being made?
Ports Simon Nattress's diskless-replication fix (originally landed on
release/v1in PR #1556's predecessor architecture) ontomain, where theAofSyncTaskInfo/AofTaskStorefiles were replaced byAofSyncDriver/AofSyncDriverStore/AofSyncTaskin #1556 (Parallel Replication).On
main, the dedup bug still exists in the new code:TryAddReplicationDrivers(the diskless path) compares existing sync drivers againstrss.replicaNodeId, but for diskless replication that field is null — the disklessReplicaSyncSessioncarries the node ID onreplicaSyncMetadata.originNodeIdinstead. As a result dedup never matches, every call adds a new driver,numDriversgrows unboundedly, and the resultingRoleInfo[]fromINFO REPLICATIONeventually exceeds the network output buffer.The companion
GarnetClientSession-leak fix from the original release/v1 commits does not need porting: the newAofSyncTask.Dispose()(introduced in #1556) already disposesgarnetClientdirectly, so that code path was correct from the start.What changed?
libs/cluster/Server/Replication/PrimaryOps/AofOperations/AofSyncDriverStore.cs— InTryAddReplicationDrivers(diskless path), comparesyncDriver.RemoteNodeIdagainstrss.replicaSyncMetadata.originNodeIdinstead ofrss.replicaNodeId. The singularTryAddReplicationDriver(disk-based /CLUSTER AOFSYNC) is unaffected.libs/cluster/Server/Replication/PrimaryOps/AofOperations/AofSyncDriver.cs— Update the terminal log message inRunAsync'sfinallyblock to remove the misleading "client disposed" wording; the client is no longer disposed in this finally (it's disposed byAofSyncTask.Dispose()).The three test-only commits from the original release/v1 series (
1f2db6c6a,36836f11f,fae32e4656) were skipped — they modifyAofSyncTaskInfoTests.cs, which targets the deletedAofSyncTaskInfoAPI and does not apply to the new architecture.How was this validated?
dotnet build libs/cluster/Garnet.cluster.csproj -c Debug— 0 warnings, 0 errorsdotnet test test/Garnet.test.cluster -f net10.0 -c Debug --filter "FullyQualifiedName~Replication"— 283 passed, 0 failed, 123 skipped (14m 15s)