CLOUDP-350185 Fix flaky e2e_multi_cluster_sharded_snippets test #503

Julien-Ben · 2025-10-08T09:35:31Z

Fix flaky e2e_multi_cluster_sharded_snippets test

Problem

The e2e_multi_cluster_sharded_snippets test fails intermittently when the Kubernetes API server times out during resource creation.
Example run: https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_12f405afd0f823091430f0be8f4ac21d87a9559c_25_10_05_20_58_10/files?execution=0&sorts=STATUS%3AASC

What I noticed in my investigation:

Test deploys 5 sharded MongoDB clusters simultaneously (~75-100 services across 3 clusters)
Around 7-8 minutes in, K8s API server times out on a service update operation
Operator marks resource as Failed with error: "the server was unable to return a response in the time allotted, but may still be processing the request"
Test immediately fails
Minutes later, the resource actually reaches Running (the timeout was transient)

Investigation

Operator creates hundreds of K8s API operations during reconciliation
This overloads the kind cluster's API server
K8s API timeouts are transient - services and pods are created successfully, just slower than expected
After being marked Failed, resources recover within 4-5 minutes

This Fix

Add K8s API timeout patterns to the intermediate_events list in mongodb.py:

"but may still be processing the request" (server-side timeout)
"Client.Timeout exceeded while awaiting headers" (client-side timeout)

Effect:

When operator marks resource as Failed with K8s API timeout error, test skips the failure
Test continues waiting for resource to reach Running
Test passes once resource recovers (which it does)

This is the same pattern used for other transient failures like agent registration timeouts and Ops Manager connection issues.

Proper Fix (Future Work)

The operator should not mark resources as Failed on K8s API timeout. Instead, for example:

Detect K8s API timeout errors
Retry with exponential backoff
Only mark Failed after multiple consecutive timeouts

Proof of work

Ran 4 patches to check for flakiness after the fix:

All patches reached Running despite intermediate failures like:

[2025/10/08 15:03:38.199] DEBUG    2025-10-08 13:03:38,198 [mongodb_utils_state]  Found intermediate event in failure: Client.Timeout exceeded while awaiting headers in Failed to create configmap: a-1759927824-grtlr6pj55z/pod-template-shards-0-hostname-override in cluster: kind-e2e-cluster-1, err: Put "https://10.97.0.1/api/v1/namespaces/a-1759927824-grtlr6pj55z/configmaps/pod-template-shards-0-hostname-override?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers). Skipping the failure state

The test now properly skips these transient API timeout failures and waits for resources to recover.

backup_minio tests are failing, but in many other branches too

github-actions · 2025-10-08T09:36:22Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.5.0 Release Notes

New Features

Improve automation agent certificate rotation: the agent now restarts automatically when its certificate is renewed, ensuring smooth operation without manual intervention and allowing seamless certificate updates without requiring manual Pod restarts.

Bug Fixes

MongoDBMultiCluster: fix resource stuck in Pending state if any clusterSpecList item has 0 members. After the fix, a value of 0 members is handled correctly, similarly to how it's done in the MongoDB resource.
MultiClusterSharded: Blocked removing non-zero member cluster from MongoDB resource. This prevents from scaling down member cluster without current configuration available, which could lead to unexpected issues.

m1kola · 2025-10-09T08:43:50Z

Glad that we will have one less flake :) Thanks a lot for fixing it.

I don't understand why there is this logic in the first place. We are checking "did we get into a Failed state?" while waiting for something else. What is the point?

I think if we drop it this logic - it will solve the issue as well. But maybe there is some reason behind it and I'm just not seeing it.

MaciejKaras

Nice investigation! :)

Julien-Ben · 2025-10-09T10:21:45Z

@m1kola

I don't understand why there is this logic in the first place. We are checking "did we get into a Failed state?" while waiting for something else. What is the point?

IIUC the aim is to ignore the operator when it puts a resource in "Failed" phase, but for a reason that is transient. It happens all the time when waiting for agents to reach running phase

m1kola · 2025-10-09T13:19:04Z

IIUC the aim is to ignore the operator when it puts a resource in "Failed" phase, but for a reason that is transient. It happens all the time when waiting for agents to reach running phase

@Julien-Ben yes, I understand that bit. But why do we raise "Got into Failed phase while waiting for Running" at all? If we are waiting for the running state - we should be ignoring Failed (for any reason) or whatever other state might happen while we are waiting.

Basically why not something like this?

def in_desired_state(
    current_state: Phase,
    desired_state: Phase,
    current_generation: int,
    observed_generation: int,
    current_message: str,
    msg_regexp: Optional[str] = None,
) -> bool:
    """Returns true if the current_state is equal to desired state, fails fast if got into Failed error.
    Optionally checks if the message matches the specified regexp expression"""
    if current_state is None:
        return False

    if current_generation != observed_generation:
        # We shouldn't check the status further if the Operator hasn't started working on the new spec yet
        return False

    is_in_desired_state = current_state == desired_state
    if msg_regexp is not None:
        print("msg_regexp: " + str(msg_regexp))
        regexp = re.compile(msg_regexp)
        is_in_desired_state = is_in_desired_state and current_message is not None and regexp.match(current_message)

    return is_in_desired_state

Julien-Ben · 2025-10-09T15:35:29Z

@m1kola
Ha I think this is the difference

we should be ignoring Failed (for any reason) or whatever other state might happen while we are waiting

In most cases it's better to fail immediately rather than always waiting for the timeout

Add API server errors to intermediate events

ba9c241

Julien-Ben added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Oct 8, 2025

Julien-Ben added 2 commits October 8, 2025 14:32

Add Client.Timeout to intermediate events

5622841

Merge branch 'master' into sharded-snippets-flakiness-investigation

1089552

Julien-Ben marked this pull request as ready for review October 9, 2025 07:57

Julien-Ben requested a review from a team as a code owner October 9, 2025 07:57

Julien-Ben requested review from MaciejKaras and fealebenpae October 9, 2025 07:57

m1kola approved these changes Oct 9, 2025

View reviewed changes

MaciejKaras approved these changes Oct 9, 2025

View reviewed changes

Add new transient error message

0f59dfa

Julien-Ben merged commit 6538475 into master Oct 9, 2025
29 of 37 checks passed

Julien-Ben deleted the sharded-snippets-flakiness-investigation branch October 9, 2025 15:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLOUDP-350185 Fix flaky e2e_multi_cluster_sharded_snippets test #503

CLOUDP-350185 Fix flaky e2e_multi_cluster_sharded_snippets test #503

Uh oh!

Julien-Ben commented Oct 8, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

m1kola commented Oct 9, 2025

Uh oh!

MaciejKaras left a comment

Uh oh!

Julien-Ben commented Oct 9, 2025

Uh oh!

m1kola commented Oct 9, 2025

Uh oh!

Julien-Ben commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLOUDP-350185 Fix flaky e2e_multi_cluster_sharded_snippets test #503

CLOUDP-350185 Fix flaky e2e_multi_cluster_sharded_snippets test #503

Uh oh!

Conversation

Julien-Ben commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fix flaky e2e_multi_cluster_sharded_snippets test

Problem

Investigation

This Fix

Proper Fix (Future Work)

Proof of work

Uh oh!

github-actions bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.5.0 Release Notes

New Features

Bug Fixes

Uh oh!

m1kola commented Oct 9, 2025

Uh oh!

MaciejKaras left a comment

Choose a reason for hiding this comment

Uh oh!

Julien-Ben commented Oct 9, 2025

Uh oh!

m1kola commented Oct 9, 2025

Uh oh!

Julien-Ben commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Julien-Ben commented Oct 8, 2025 •

edited

Loading

github-actions bot commented Oct 8, 2025 •

edited

Loading