Rework SimpleSetupCluster in cluster tests, and longer timeouts for select tests#1762
Merged
kevin-montrose merged 9 commits intomainfrom May 1, 2026
Merged
Conversation
…ughout; async's a few more tests; adds some null checks
…elAfter] so ClusterTestContext knows of it
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves cluster test reliability by making cluster setup more deterministic/async-aware and by extending timeouts for a few long-running cluster/vector migration tests. It also adds additional “wait until stable” checks during SimpleSetupCluster and updates several test helpers to use async StackExchange.Redis APIs.
Changes:
- Reworked
ClusterTestUtils.SimpleSetupClusterinto an async implementation with additional stabilization checks between meet/slot/replication steps. - Updated multiple cluster tests to use async setup and increased timeouts (via
[CancelAfter(120_000)]) for select slow/flaky tests. - Converted several ClusterTestUtils helpers (connect/reconnect, nodes parsing, replication polling) to async variants and updated callsites.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Garnet.test.cluster/VectorSets/ClusterVectorSetTests.cs | Converts several tests and cluster setup wrapper to async; adds longer [CancelAfter] timeouts for select tests. |
| test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs | Converts select tests/helpers to async waiting APIs (e.g., connected replica polling). |
| test/Garnet.test.cluster/ClusterTestUtils.cs | Core change: async SimpleSetupClusterAsync, new async helpers, and additional stabilization checks during cluster formation. |
| test/Garnet.test.cluster/ClusterTestContext.cs | Uses NUnit timeout metadata to size the context CTS; makes replica attach/sync helper async-aware. |
| test/Garnet.test.cluster/ClusterNegativeTests.cs | Updates tests to await async connected-replica polling. |
| test/Garnet.test.cluster/ClusterMigrateTests.cs | Updates one test to await async “pick any other node” helper. |
| libs/cluster/Session/RespClusterBasicCommands.cs | Wraps CLUSTER GOSSIP handling in try/catch with logging; retains epoch-release pattern around merge. |
| libs/cluster/Server/Replication/ReplicationManager.cs | Makes logging null-safe (logger?.) in resync path. |
Comments suppressed due to low confidence (2)
test/Garnet.test.cluster/ClusterTestUtils.cs:2348
ClusterNodesAsync(IPEndPoint ...)is markedasyncbut doesn'tawaitanything and just calls the synchronousserver.ClusterNodes(). WithTreatWarningsAsErrors=true, this will produce CS1998 and fail the build. Either call/await the real async API (e.g.,await server.ClusterNodesAsync()if available) or removeasyncand return a completed Task from the synchronous result.
public async Task<ClusterConfiguration> ClusterNodesAsync(IPEndPoint endPoint, ILogger logger = null)
{
try
{
var server = redis.GetServer(endPoint);
return server.ClusterNodes();
}
test/Garnet.test.cluster/ClusterTestContext.cs:813
- The loops that wait for replica recovery/AOF sync and validate replica data use
for (var i = primary_count; i < replica_count; i++), butreplica_countis a count, not the upper node index. Forprimary_count=1, replica_count=1this loop never runs, so the method can return without waiting/validating any replicas. Usei < primary_count + replica_count(matching the earlier loops in this method) for the recovery/AOF sync and validation loops as well.
// Wait for recovery and AofSync
for (var i = primary_count; i < replica_count; i++)
{
clusterTestUtils.WaitForReplicaRecovery(i, logger);
clusterTestUtils.WaitForReplicaAofSync(0, i, logger);
}
await clusterTestUtils.WaitForConnectedReplicaCountAsync(0, replica_count, logger: logger).ConfigureAwait(false);
// Validate data on replicas
for (var i = primary_count; i < replica_count; i++)
{
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vazois
approved these changes
May 1, 2026
badrishc
approved these changes
May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reworks
ClusterTestUtils.SimpleSetupClusterto be async and to pause and very state between each step - this increases reliability and help with diagnosing failures as we know the cluster is in a stable state before beginning the "real" test.Also increases timeouts (to 2 minutes) for
RepeatedCreateDeleteAsync,MigrateVectorSetWhileModifyingAsync, andMigrateVectorStressAsync- these tests occasionally take nearly 1 minute in Debug builds on CI machines, presumably some failures just legitimately ran out of time.As part of increasing timeouts, makes
ClusterTestContextaware of[CancelAfter]and uses its value during setup if available.TODO:
devversion ([DEV] Rework SimpleSetupCluster in cluster tests, and longer timeouts for select tests #1763)