Skip to content

Conversation

Julien-Ben
Copy link
Collaborator

@Julien-Ben Julien-Ben commented Oct 8, 2025

Fix flaky e2e_multi_cluster_sharded_snippets test

Problem

The e2e_multi_cluster_sharded_snippets test fails intermittently when the Kubernetes API server times out during resource creation.
Example run: https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_12f405afd0f823091430f0be8f4ac21d87a9559c_25_10_05_20_58_10/files?execution=0&sorts=STATUS%3AASC

What I noticed in my investigation:

  1. Test deploys 5 sharded MongoDB clusters simultaneously (~75-100 services across 3 clusters)
  2. Around 7-8 minutes in, K8s API server times out on a service update operation
  3. Operator marks resource as Failed with error: "the server was unable to return a response in the time allotted, but may still be processing the request"
  4. Test immediately fails
  5. Minutes later, the resource actually reaches Running (the timeout was transient)

Investigation

  • Operator creates hundreds of K8s API operations during reconciliation
  • This overloads the kind cluster's API server
  • K8s API timeouts are transient - services and pods are created successfully, just slower than expected
  • After being marked Failed, resources recover within 4-5 minutes

This Fix

Add K8s API timeout patterns to the intermediate_events list in mongodb.py:

  • "but may still be processing the request" (server-side timeout)
  • "Client.Timeout exceeded while awaiting headers" (client-side timeout)

Effect:

  • When operator marks resource as Failed with K8s API timeout error, test skips the failure
  • Test continues waiting for resource to reach Running
  • Test passes once resource recovers (which it does)

This is the same pattern used for other transient failures like agent registration timeouts and Ops Manager connection issues.

Proper Fix (Future Work)

The operator should not mark resources as Failed on K8s API timeout. Instead, for example:

  1. Detect K8s API timeout errors
  2. Retry with exponential backoff
  3. Only mark Failed after multiple consecutive timeouts

Proof of work

Ran 4 patches to check for flakiness after the fix:

  1. Patch 1
  2. Patch 2
  3. Patch 3
  4. Patch 4

All patches reached Running despite intermediate failures like:

[2025/10/08 15:03:38.199] DEBUG    2025-10-08 13:03:38,198 [mongodb_utils_state]  Found intermediate event in failure: Client.Timeout exceeded while awaiting headers in Failed to create configmap: a-1759927824-grtlr6pj55z/pod-template-shards-0-hostname-override in cluster: kind-e2e-cluster-1, err: Put "https://10.97.0.1/api/v1/namespaces/a-1759927824-grtlr6pj55z/configmaps/pod-template-shards-0-hostname-override?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers). Skipping the failure state

The test now properly skips these transient API timeout failures and waits for resources to recover.

backup_minio tests are failing, but in many other branches too

@Julien-Ben Julien-Ben added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Oct 8, 2025
Copy link

github-actions bot commented Oct 8, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.5.0 Release Notes

New Features

  • Improve automation agent certificate rotation: the agent now restarts automatically when its certificate is renewed, ensuring smooth operation without manual intervention and allowing seamless certificate updates without requiring manual Pod restarts.

Bug Fixes

  • MongoDBMultiCluster: fix resource stuck in Pending state if any clusterSpecList item has 0 members. After the fix, a value of 0 members is handled correctly, similarly to how it's done in the MongoDB resource.
  • MultiClusterSharded: Blocked removing non-zero member cluster from MongoDB resource. This prevents from scaling down member cluster without current configuration available, which could lead to unexpected issues.

@Julien-Ben Julien-Ben marked this pull request as ready for review October 9, 2025 07:57
@Julien-Ben Julien-Ben requested a review from a team as a code owner October 9, 2025 07:57
@m1kola
Copy link
Contributor

m1kola commented Oct 9, 2025

Glad that we will have one less flake :) Thanks a lot for fixing it.

I don't understand why there is this logic in the first place. We are checking "did we get into a Failed state?" while waiting for something else. What is the point?

I think if we drop it this logic - it will solve the issue as well. But maybe there is some reason behind it and I'm just not seeing it.

Copy link
Collaborator

@MaciejKaras MaciejKaras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice investigation! :)

@Julien-Ben
Copy link
Collaborator Author

@m1kola

I don't understand why there is this logic in the first place. We are checking "did we get into a Failed state?" while waiting for something else. What is the point?

IIUC the aim is to ignore the operator when it puts a resource in "Failed" phase, but for a reason that is transient. It happens all the time when waiting for agents to reach running phase

@m1kola
Copy link
Contributor

m1kola commented Oct 9, 2025

IIUC the aim is to ignore the operator when it puts a resource in "Failed" phase, but for a reason that is transient. It happens all the time when waiting for agents to reach running phase

@Julien-Ben yes, I understand that bit. But why do we raise "Got into Failed phase while waiting for Running" at all? If we are waiting for the running state - we should be ignoring Failed (for any reason) or whatever other state might happen while we are waiting.

Basically why not something like this?

def in_desired_state(
    current_state: Phase,
    desired_state: Phase,
    current_generation: int,
    observed_generation: int,
    current_message: str,
    msg_regexp: Optional[str] = None,
) -> bool:
    """Returns true if the current_state is equal to desired state, fails fast if got into Failed error.
    Optionally checks if the message matches the specified regexp expression"""
    if current_state is None:
        return False

    if current_generation != observed_generation:
        # We shouldn't check the status further if the Operator hasn't started working on the new spec yet
        return False

    is_in_desired_state = current_state == desired_state
    if msg_regexp is not None:
        print("msg_regexp: " + str(msg_regexp))
        regexp = re.compile(msg_regexp)
        is_in_desired_state = is_in_desired_state and current_message is not None and regexp.match(current_message)

    return is_in_desired_state

@Julien-Ben
Copy link
Collaborator Author

@m1kola
Ha I think this is the difference

we should be ignoring Failed (for any reason) or whatever other state might happen while we are waiting

In most cases it's better to fail immediately rather than always waiting for the timeout

@Julien-Ben Julien-Ben merged commit 6538475 into master Oct 9, 2025
29 of 37 checks passed
@Julien-Ben Julien-Ben deleted the sharded-snippets-flakiness-investigation branch October 9, 2025 15:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Use this label in Pull Request to not require new changelog entry file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants