Skip to content

Rework SimpleSetupCluster in cluster tests, and longer timeouts for select tests#1762

Merged
kevin-montrose merged 9 commits intomainfrom
users/kmontrose/testSimpleClusterRework
May 1, 2026
Merged

Rework SimpleSetupCluster in cluster tests, and longer timeouts for select tests#1762
kevin-montrose merged 9 commits intomainfrom
users/kmontrose/testSimpleClusterRework

Conversation

@kevin-montrose
Copy link
Copy Markdown
Contributor

@kevin-montrose kevin-montrose commented May 1, 2026

Reworks ClusterTestUtils.SimpleSetupCluster to be async and to pause and very state between each step - this increases reliability and help with diagnosing failures as we know the cluster is in a stable state before beginning the "real" test.

Also increases timeouts (to 2 minutes) for RepeatedCreateDeleteAsync, MigrateVectorSetWhileModifyingAsync, and MigrateVectorStressAsync - these tests occasionally take nearly 1 minute in Debug builds on CI machines, presumably some failures just legitimately ran out of time.

As part of increasing timeouts, makes ClusterTestContext aware of [CancelAfter] and uses its value during setup if available.

TODO:

Copilot AI review requested due to automatic review settings May 1, 2026 14:41
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves cluster test reliability by making cluster setup more deterministic/async-aware and by extending timeouts for a few long-running cluster/vector migration tests. It also adds additional “wait until stable” checks during SimpleSetupCluster and updates several test helpers to use async StackExchange.Redis APIs.

Changes:

  • Reworked ClusterTestUtils.SimpleSetupCluster into an async implementation with additional stabilization checks between meet/slot/replication steps.
  • Updated multiple cluster tests to use async setup and increased timeouts (via [CancelAfter(120_000)]) for select slow/flaky tests.
  • Converted several ClusterTestUtils helpers (connect/reconnect, nodes parsing, replication polling) to async variants and updated callsites.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/Garnet.test.cluster/VectorSets/ClusterVectorSetTests.cs Converts several tests and cluster setup wrapper to async; adds longer [CancelAfter] timeouts for select tests.
test/Garnet.test.cluster/ReplicationTests/ClusterReplicationBaseTests.cs Converts select tests/helpers to async waiting APIs (e.g., connected replica polling).
test/Garnet.test.cluster/ClusterTestUtils.cs Core change: async SimpleSetupClusterAsync, new async helpers, and additional stabilization checks during cluster formation.
test/Garnet.test.cluster/ClusterTestContext.cs Uses NUnit timeout metadata to size the context CTS; makes replica attach/sync helper async-aware.
test/Garnet.test.cluster/ClusterNegativeTests.cs Updates tests to await async connected-replica polling.
test/Garnet.test.cluster/ClusterMigrateTests.cs Updates one test to await async “pick any other node” helper.
libs/cluster/Session/RespClusterBasicCommands.cs Wraps CLUSTER GOSSIP handling in try/catch with logging; retains epoch-release pattern around merge.
libs/cluster/Server/Replication/ReplicationManager.cs Makes logging null-safe (logger?.) in resync path.
Comments suppressed due to low confidence (2)

test/Garnet.test.cluster/ClusterTestUtils.cs:2348

  • ClusterNodesAsync(IPEndPoint ...) is marked async but doesn't await anything and just calls the synchronous server.ClusterNodes(). With TreatWarningsAsErrors=true, this will produce CS1998 and fail the build. Either call/await the real async API (e.g., await server.ClusterNodesAsync() if available) or remove async and return a completed Task from the synchronous result.
        public async Task<ClusterConfiguration> ClusterNodesAsync(IPEndPoint endPoint, ILogger logger = null)
        {
            try
            {
                var server = redis.GetServer(endPoint);
                return server.ClusterNodes();
            }

test/Garnet.test.cluster/ClusterTestContext.cs:813

  • The loops that wait for replica recovery/AOF sync and validate replica data use for (var i = primary_count; i < replica_count; i++), but replica_count is a count, not the upper node index. For primary_count=1, replica_count=1 this loop never runs, so the method can return without waiting/validating any replicas. Use i < primary_count + replica_count (matching the earlier loops in this method) for the recovery/AOF sync and validation loops as well.
            // Wait for recovery and AofSync
            for (var i = primary_count; i < replica_count; i++)
            {
                clusterTestUtils.WaitForReplicaRecovery(i, logger);
                clusterTestUtils.WaitForReplicaAofSync(0, i, logger);
            }

            await clusterTestUtils.WaitForConnectedReplicaCountAsync(0, replica_count, logger: logger).ConfigureAwait(false);

            // Validate data on replicas
            for (var i = primary_count; i < replica_count; i++)
            {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/Garnet.test.cluster/ClusterTestUtils.cs Outdated
Comment thread test/Garnet.test.cluster/ClusterTestUtils.cs
@kevin-montrose kevin-montrose merged commit 38cac99 into main May 1, 2026
47 of 49 checks passed
@kevin-montrose kevin-montrose deleted the users/kmontrose/testSimpleClusterRework branch May 1, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants