OCPBUGS-79471: fix(test): reduce e2e autorepair flakes from unhealthy node conditions#8278
Conversation
WaitForReadyNodesByNodePool only checked NodeReady=True, so the test proceeded with replacement nodes that had kubelet pressure conditions (MemoryPressure, DiskPressure, PIDPressure). CAPI then set MachineNodeHealthyCondition=False, causing AllNodesHealthy=False and failing validateNodePoolConditions. Add per-node pressure predicates to WaitForReadyNodesByNodePool so the test waits for a fully healthy replacement. Also add WithInterval(15s) to validateNodePoolConditions to reduce API polling from ~400 to ~80 GETs over 20 min, eliminating client rate limiter errors. Refs: OCPBUGS-79471 Signed-off-by: rutvik23 Commit-Message-Assisted-by: Claude (via Claude Code)
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
Skipping CI for Draft Pull Request. |
|
@rutvik23: This pull request references Jira Issue OCPBUGS-79471, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
📝 WalkthroughWalkthroughThis pull request modifies two end-to-end test files to enhance test assertions and control polling behavior. The first change extends the node pool autorepair test to verify that replacement nodes have kubelet pressure conditions explicitly cleared (NodeMemoryPressure, NodeDiskPressure, NodePIDPressure set to False) alongside existing replica count validation. Test logging is also improved to include the AWS instance ID of terminated nodes. The second change adjusts the polling interval in node pool condition validation from the default cadence to 15 seconds while maintaining the 20-minute timeout window. 🚥 Pre-merge checks | ✅ 7 | ❌ 3❌ Failed checks (2 warnings, 1 inconclusive)
✅ Passed checks (7 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/e2e/nodepool_autorepair_test.go (1)
64-64: Minor: log wording reads awkwardly."Terminating AWS Instance with an autorepair NodePool" is grammatically odd. Consider "Terminating AWS Instance for autorepair NodePool" or "... on autorepair-enabled NodePool".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/e2e/nodepool_autorepair_test.go` at line 64, Update the awkward t.Log message "Terminating AWS Instance with an autorepair NodePool" to a clearer phrasing such as "Terminating AWS Instance for autorepair-enabled NodePool" (locate the t.Log(...) call in the nodepool_autorepair_test.go test and replace the string accordingly to improve grammar and clarity).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/e2e/nodepool_autorepair_test.go`:
- Around line 96-118: The current use of e2eutil.ConditionPredicate for
NodeMemoryPressure/DiskPressure/PIDPressure will fail when a replacement node
lacks those conditions because ConditionPredicate returns false if a condition
is missing; update the wait predicate to treat missing pressure conditions as
satisfied by either (a) adding a new e2eutil.ConditionPredicate option/mode
"treatMissingAsFalse" and use it for the three checks, or (b) replace the three
ConditionPredicate calls with a single custom predicate (via
e2eutil.WithPredicates) that inspects node.Status.Conditions for each Type
(corev1.NodeMemoryPressure, corev1.NodeDiskPressure, corev1.NodePIDPressure) and
treats absent entries as metav1.ConditionFalse so the wait proceeds for
freshly-registered nodes; locate and update usages of e2eutil.ConditionPredicate
and the wait invocation around corev1.Node to implement this tolerant behaviour.
---
Nitpick comments:
In `@test/e2e/nodepool_autorepair_test.go`:
- Line 64: Update the awkward t.Log message "Terminating AWS Instance with an
autorepair NodePool" to a clearer phrasing such as "Terminating AWS Instance for
autorepair-enabled NodePool" (locate the t.Log(...) call in the
nodepool_autorepair_test.go test and replace the string accordingly to improve
grammar and clarity).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 1966745a-7fbb-4ef0-9769-3baf4a68283a
📒 Files selected for processing (2)
test/e2e/nodepool_autorepair_test.gotest/e2e/nodepool_test.go
| // Ensure replacement nodes are free of kubelet pressure conditions | ||
| // (MemoryPressure, DiskPressure, PIDPressure). A replacement node | ||
| // can be NodeReady=True but still have pressure conditions, which | ||
| // causes CAPI to set MachineNodeHealthyCondition=False | ||
| // (NodeConditionsFailed). This makes the NodePool's AllNodesHealthy | ||
| // condition False and fails validateNodePoolConditions downstream. | ||
| // By checking pressure conditions here, we wait for the replacement | ||
| // node to be fully healthy before proceeding. | ||
| e2eutil.WithPredicates( | ||
| e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{ | ||
| Type: string(corev1.NodeMemoryPressure), | ||
| Status: metav1.ConditionFalse, | ||
| }), | ||
| e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{ | ||
| Type: string(corev1.NodeDiskPressure), | ||
| Status: metav1.ConditionFalse, | ||
| }), | ||
| e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{ | ||
| Type: string(corev1.NodePIDPressure), | ||
| Status: metav1.ConditionFalse, | ||
| }), | ||
| ), | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Inspect ConditionPredicate implementation to see how missing conditions are handled.
fd -t f 'predicates?.go' test/e2e/util
rg -nP -C5 'func ConditionPredicate\b'Repository: openshift/hypershift
Length of output: 898
🏁 Script executed:
cat -n test/e2e/util/eventually.go | sed -n '564,590p'Repository: openshift/hypershift
Length of output: 835
Address missing pressure condition handling in wait predicate.
ConditionPredicate returns false when a condition is not present in the node's status, which means a freshly-registered replacement node whose kubelet hasn't yet published MemoryPressure, DiskPressure, or PIDPressure will cause the wait to spin indefinitely. These conditions should be treated as matching False (no pressure) when absent. Either modify ConditionPredicate to accept an optional "treat missing as satisfied" mode for these scenarios, or switch to a collection predicate that checks node.Status.Conditions directly and tolerates missing conditions.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@test/e2e/nodepool_autorepair_test.go` around lines 96 - 118, The current use
of e2eutil.ConditionPredicate for NodeMemoryPressure/DiskPressure/PIDPressure
will fail when a replacement node lacks those conditions because
ConditionPredicate returns false if a condition is missing; update the wait
predicate to treat missing pressure conditions as satisfied by either (a) adding
a new e2eutil.ConditionPredicate option/mode "treatMissingAsFalse" and use it
for the three checks, or (b) replace the three ConditionPredicate calls with a
single custom predicate (via e2eutil.WithPredicates) that inspects
node.Status.Conditions for each Type (corev1.NodeMemoryPressure,
corev1.NodeDiskPressure, corev1.NodePIDPressure) and treats absent entries as
metav1.ConditionFalse so the wait proceeds for freshly-registered nodes; locate
and update usages of e2eutil.ConditionPredicate and the wait invocation around
corev1.Node to implement this tolerant behaviour.
There was a problem hiding this comment.
The window where a node exists without pressure conditions populated is extremely brief (sub-second). By the time the old node is removed AND the new node appears AND passes NodeReady=True, kubelet has already populated all conditions. The built-in count + "old node gone" predicates keep the poll going long enough that this race is effectively unreachable. In theory, maybe just one extra poll cycle (3 seconds) would occur if a freshly-registered node happens to be listed before kubelet populates conditions. Should not be a flake risk ideally.
There was a problem hiding this comment.
@rutvik23, your reasoning is sound, and I'll withdraw the concern.
The critical insight is that NodeReady=True is a stronger prerequisite than the pressure conditions being present. Kubelet evaluates all node conditions—including MemoryPressure, DiskPressure, and PIDPressure—as part of computing NodeReady. By the time a replacement node reaches NodeReady=True, all three pressure conditions will already be populated in node.Status.Conditions. The sub-second window you describe is effectively gated away by the existing NodeReady check.
My original concern about "spinning indefinitely" was overstated. Even in the theoretical worst case, ConditionPredicate returns (false, reason, nil) for a missing condition (i.e., keep retrying, not an error), so it would result in at most one extra poll iteration—not a stuck wait. Given a 20-minute timeout and the two collection predicates already acting as guards, this is a non-issue in practice.
The implementation is correct as-is.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8278 +/- ##
=======================================
Coverage 35.62% 35.62%
=======================================
Files 767 767
Lines 93333 93333
=======================================
Hits 33248 33248
Misses 57396 57396
Partials 2689 2689 🚀 New features to boost your workflow:
|
|
/cc @sjenning /cc @jparrill /cc @bryan-cox |
|
@rutvik23: GitHub didn't allow me to request PR reviews from the following users: /cc. Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/jira refresh |
|
@sjenning: This pull request references Jira Issue OCPBUGS-79471, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/lgtm |
|
Scheduling tests matching the |
Test Resultse2e-aws
e2e-aks
|
|
/test e2e-aks-4-22 |
|
/verified by e2e |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bryan-cox, rutvik23 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@bryan-cox: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-kubevirt-aws-ovn-reduced |
|
I have all the evidence needed. Let me produce the final report. Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth TestCreateCluster and TestAutoscaling failed because zero KubeVirt worker nodes joined the hosted clusters within the 45-minute timeout. The root cause is an ODF/Ceph storage infrastructure issue: the DataVolumes backing the KubeVirt VM root disks were stuck in Root CauseThe failure chain is:
The same pattern affected both TestCreateCluster (expected 2 nodes, got 0) and TestAutoscaling (expected 1 node, got 0) identically, confirming the problem is infrastructure-wide rather than test-specific. PR #8278 is not the cause. The PR only modifies Recommendations
Evidence
|
|
/hold Revision c4cf980 was retested 3 times: holding |
|
/retest |
|
/hold cancel |
|
@rutvik23: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@rutvik23: Jira Issue Verification Checks: Jira Issue OCPBUGS-79471 Jira Issue OCPBUGS-79471 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
What this PR does / why we need it:
Summary
MemoryPressure=False,DiskPressure=False,PIDPressure=False) to the autorepair test'sWaitForReadyNodesByNodePoolcall, ensuring the replacement node isfully healthy before proceeding
WithInterval(15s)tovalidateNodePoolConditionsto reduceAPI server polling and eliminate client rate limiter errors
Problem
The
TestNodePoolAutoRepaire2e test flakes when a replacement nodecomes up
NodeReady=Truebut with kubelet pressure conditions. CAPIsets
MachineNodeHealthyCondition=False(NodeConditionsFailed),which propagates to
AllNodesHealthy=Falseon the NodePool.validateNodePoolConditionsthen times out after 20 minutes.Additionally,
validateNodePoolConditionspolls every 3s (default),generating ~400 API GETs over 20 min, triggering client rate limiter
errors.
Root cause
WaitForReadyNodesByNodePoolonly verifiedNodeReady=Trueand didnot check kubelet health conditions. The test passed at ~7 min with an
unhealthy node, leaving
validateNodePoolConditionsto discover theproblem with insufficient time for the node to stabilize.
Design decisions
ConditionPredicate/WithPredicatesinstead of customhelpers — the framework already logs predicate failures, giving
specific diagnostics (e.g.,
wanted MemoryPressure=False, got True)count predicate in
WaitForNReadyNodesWithOptionsalready ensuresthe old node is fully removed
WaitForNodePoolConfigUpdateComplete(15-20s with explicit commentsabout rate limiting)
Consistently) —validateNodePoolConditionsalready serves as a 20-min post-
Run()stabilization checkclient access in the test interface; node-level pressure checks are
functionally equivalent to what CAPI evaluates
Which issue(s) this PR fixes:
Fixes OCPBUGS-79471
Special notes for your reviewer:
Checklist:
Summary by CodeRabbit