-
Notifications
You must be signed in to change notification settings - Fork 21
CLOUDP-350185 Fix flaky e2e_multi_cluster_sharded_snippets test #503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MCK 1.5.0 Release NotesNew Features
Bug Fixes
|
Glad that we will have one less flake :) Thanks a lot for fixing it. I don't understand why there is this logic in the first place. We are checking "did we get into a Failed state?" while waiting for something else. What is the point? I think if we drop it this logic - it will solve the issue as well. But maybe there is some reason behind it and I'm just not seeing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice investigation! :)
IIUC the aim is to ignore the operator when it puts a resource in "Failed" phase, but for a reason that is transient. It happens all the time when waiting for agents to reach running phase |
@Julien-Ben yes, I understand that bit. But why do we raise "Got into Failed phase while waiting for Running" at all? If we are waiting for the running state - we should be ignoring Failed (for any reason) or whatever other state might happen while we are waiting. Basically why not something like this? def in_desired_state(
current_state: Phase,
desired_state: Phase,
current_generation: int,
observed_generation: int,
current_message: str,
msg_regexp: Optional[str] = None,
) -> bool:
"""Returns true if the current_state is equal to desired state, fails fast if got into Failed error.
Optionally checks if the message matches the specified regexp expression"""
if current_state is None:
return False
if current_generation != observed_generation:
# We shouldn't check the status further if the Operator hasn't started working on the new spec yet
return False
is_in_desired_state = current_state == desired_state
if msg_regexp is not None:
print("msg_regexp: " + str(msg_regexp))
regexp = re.compile(msg_regexp)
is_in_desired_state = is_in_desired_state and current_message is not None and regexp.match(current_message)
return is_in_desired_state |
@m1kola
In most cases it's better to fail immediately rather than always waiting for the timeout |
Fix flaky e2e_multi_cluster_sharded_snippets test
Problem
The
e2e_multi_cluster_sharded_snippets
test fails intermittently when the Kubernetes API server times out during resource creation.Example run: https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_snippets_12f405afd0f823091430f0be8f4ac21d87a9559c_25_10_05_20_58_10/files?execution=0&sorts=STATUS%3AASC
What I noticed in my investigation:
"the server was unable to return a response in the time allotted, but may still be processing the request"
Investigation
This Fix
Add K8s API timeout patterns to the
intermediate_events
list inmongodb.py
:"but may still be processing the request"
(server-side timeout)"Client.Timeout exceeded while awaiting headers"
(client-side timeout)Effect:
This is the same pattern used for other transient failures like agent registration timeouts and Ops Manager connection issues.
Proper Fix (Future Work)
The operator should not mark resources as Failed on K8s API timeout. Instead, for example:
Proof of work
Ran 4 patches to check for flakiness after the fix:
All patches reached Running despite intermediate failures like:
The test now properly skips these transient API timeout failures and waits for resources to recover.
backup_minio
tests are failing, but in many other branches too