Skip to content

OCPBUGS-79471: fix(test): reduce e2e autorepair flakes from unhealthy node conditions#8278

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
rutvik23:OCPBUGS-79471
Apr 23, 2026
Merged

OCPBUGS-79471: fix(test): reduce e2e autorepair flakes from unhealthy node conditions#8278
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
rutvik23:OCPBUGS-79471

Conversation

@rutvik23
Copy link
Copy Markdown
Contributor

@rutvik23 rutvik23 commented Apr 18, 2026

What this PR does / why we need it:

Summary

  • Add per-node kubelet health predicates (MemoryPressure=False,
    DiskPressure=False, PIDPressure=False) to the autorepair test's
    WaitForReadyNodesByNodePool call, ensuring the replacement node is
    fully healthy before proceeding
  • Add WithInterval(15s) to validateNodePoolConditions to reduce
    API server polling and eliminate client rate limiter errors

Problem

  • The TestNodePoolAutoRepair e2e test flakes when a replacement node
    comes up NodeReady=True but with kubelet pressure conditions. CAPI
    sets MachineNodeHealthyCondition=False (NodeConditionsFailed),
    which propagates to AllNodesHealthy=False on the NodePool.
    validateNodePoolConditions then times out after 20 minutes.

  • Additionally, validateNodePoolConditions polls every 3s (default),
    generating ~400 API GETs over 20 min, triggering client rate limiter
    errors.

Root cause

  • WaitForReadyNodesByNodePool only verified NodeReady=True and did
    not check kubelet health conditions. The test passed at ~7 min with an
    unhealthy node, leaving validateNodePoolConditions to discover the
    problem with insufficient time for the node to stabilize.

Design decisions

  • Reused ConditionPredicate/WithPredicates instead of custom
    helpers — the framework already logs predicate failures, giving
    specific diagnostics (e.g., wanted MemoryPressure=False, got True)
  • Kept original collection predicates unchanged — the built-in
    count predicate in WaitForNReadyNodesWithOptions already ensures
    the old node is fully removed
  • 15s polling interval matches the pattern in
    WaitForNodePoolConfigUpdateComplete (15-20s with explicit comments
    about rate limiting)
  • No stabilization window (Consistently)validateNodePoolConditions
    already serves as a 20-min post-Run() stabilization check
  • No direct Machine condition check — would require management
    client access in the test interface; node-level pressure checks are
    functionally equivalent to what CAPI evaluates

Which issue(s) this PR fixes:

Fixes OCPBUGS-79471

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
    • Enhanced node pool autorepair test with improved logging including AWS instance IDs for better diagnostics.
    • Added validation for kubelet pressure conditions to ensure proper node health verification.
    • Optimized node pool status polling to reduce test execution overhead while maintaining reliability.

WaitForReadyNodesByNodePool only checked NodeReady=True, so the test
proceeded with replacement nodes that had kubelet pressure conditions
(MemoryPressure, DiskPressure, PIDPressure). CAPI then set
MachineNodeHealthyCondition=False, causing AllNodesHealthy=False and
failing validateNodePoolConditions.

Add per-node pressure predicates to WaitForReadyNodesByNodePool so the
test waits for a fully healthy replacement. Also add WithInterval(15s)
to validateNodePoolConditions to reduce API polling from ~400 to ~80
GETs over 20 min, eliminating client rate limiter errors.

Refs: OCPBUGS-79471

Signed-off-by: rutvik23
Commit-Message-Assisted-by: Claude (via Claude Code)
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 18, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 18, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 18, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@rutvik23: This pull request references Jira Issue OCPBUGS-79471, which is invalid:

  • expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Summary

  • Add per-node kubelet health predicates (MemoryPressure=False,
    DiskPressure=False, PIDPressure=False) to the autorepair test's
    WaitForReadyNodesByNodePool call, ensuring the replacement node is
    fully healthy before proceeding
  • Add WithInterval(15s) to validateNodePoolConditions to reduce
    API server polling and eliminate client rate limiter errors

Problem

  • The TestNodePoolAutoRepair e2e test flakes when a replacement node
    comes up NodeReady=True but with kubelet pressure conditions. CAPI
    sets MachineNodeHealthyCondition=False (NodeConditionsFailed),
    which propagates to AllNodesHealthy=False on the NodePool.
    validateNodePoolConditions then times out after 20 minutes.

  • Additionally, validateNodePoolConditions polls every 3s (default),
    generating ~400 API GETs over 20 min, triggering client rate limiter
    errors.

Root cause

  • WaitForReadyNodesByNodePool only verified NodeReady=True and did
    not check kubelet health conditions. The test passed at ~7 min with an
    unhealthy node, leaving validateNodePoolConditions to discover the
    problem with insufficient time for the node to stabilize.

Design decisions

  • Reused ConditionPredicate/WithPredicates instead of custom
    helpers — the framework already logs predicate failures, giving
    specific diagnostics (e.g., wanted MemoryPressure=False, got True)
  • Kept original collection predicates unchanged — the built-in
    count predicate in WaitForNReadyNodesWithOptions already ensures
    the old node is fully removed
  • 15s polling interval matches the pattern in
    WaitForNodePoolConfigUpdateComplete (15-20s with explicit comments
    about rate limiting)
  • No stabilization window (Consistently)validateNodePoolConditions
    already serves as a 20-min post-Run() stabilization check
  • No direct Machine condition check — would require management
    client access in the test interface; node-level pressure checks are
    functionally equivalent to what CAPI evaluates

Which issue(s) this PR fixes:

Fixes OCPBUGS-79471

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 18, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 18, 2026

📝 Walkthrough

Walkthrough

This pull request modifies two end-to-end test files to enhance test assertions and control polling behavior. The first change extends the node pool autorepair test to verify that replacement nodes have kubelet pressure conditions explicitly cleared (NodeMemoryPressure, NodeDiskPressure, NodePIDPressure set to False) alongside existing replica count validation. Test logging is also improved to include the AWS instance ID of terminated nodes. The second change adjusts the polling interval in node pool condition validation from the default cadence to 15 seconds while maintaining the 20-minute timeout window.

🚥 Pre-merge checks | ✅ 7 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning The autorepair test violates the Timeouts quality requirement due to a critical bug: ConditionPredicate returns false when kubelet pressure conditions are absent from freshly-registered replacement nodes, causing indefinite spinning. Modify the predicate to treat missing kubelet pressure conditions as satisfied, and add a failure message to line 74 describing why the provider ID is required.
Single Node Openshift (Sno) Test Compatibility ❓ Inconclusive Unable to verify test structure without access to actual code repository files or shell execution environment. Provide the actual content of test/e2e/nodepool*.go files or direct access to the repository to assess test declarations and structure.
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main changes: reducing e2e test flakes by adding kubelet health condition checks and adjusting polling intervals.
Stable And Deterministic Test Names ✅ Passed The pull request does not introduce dynamic or unstable test names. All test case names are static string literals with no dynamic values such as node names, instance IDs, timestamps, or generated identifiers.
Microshift Test Compatibility ✅ Passed PR modifies existing tests only; no new Ginkgo e2e tests are added.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies only E2E test files, adding kubelet pressure condition checks and adjusting polling intervals without affecting deployment manifests or operator code.
Ote Binary Stdout Contract ✅ Passed All changes are contained within Ginkgo test blocks (It(), test utility calls) with no process-level stdout writes in main(), init(), or BeforeSuite(). Predicates and wait condition modifications occur only within test code where stdout is properly intercepted.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This pull request does not add any new Ginkgo e2e tests. The changes only modify existing helper methods without IPv4 assumptions or external connectivity requirements.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 18, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/e2e/nodepool_autorepair_test.go (1)

64-64: Minor: log wording reads awkwardly.

"Terminating AWS Instance with an autorepair NodePool" is grammatically odd. Consider "Terminating AWS Instance for autorepair NodePool" or "... on autorepair-enabled NodePool".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/nodepool_autorepair_test.go` at line 64, Update the awkward t.Log
message "Terminating AWS Instance with an autorepair NodePool" to a clearer
phrasing such as "Terminating AWS Instance for autorepair-enabled NodePool"
(locate the t.Log(...) call in the nodepool_autorepair_test.go test and replace
the string accordingly to improve grammar and clarity).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/nodepool_autorepair_test.go`:
- Around line 96-118: The current use of e2eutil.ConditionPredicate for
NodeMemoryPressure/DiskPressure/PIDPressure will fail when a replacement node
lacks those conditions because ConditionPredicate returns false if a condition
is missing; update the wait predicate to treat missing pressure conditions as
satisfied by either (a) adding a new e2eutil.ConditionPredicate option/mode
"treatMissingAsFalse" and use it for the three checks, or (b) replace the three
ConditionPredicate calls with a single custom predicate (via
e2eutil.WithPredicates) that inspects node.Status.Conditions for each Type
(corev1.NodeMemoryPressure, corev1.NodeDiskPressure, corev1.NodePIDPressure) and
treats absent entries as metav1.ConditionFalse so the wait proceeds for
freshly-registered nodes; locate and update usages of e2eutil.ConditionPredicate
and the wait invocation around corev1.Node to implement this tolerant behaviour.

---

Nitpick comments:
In `@test/e2e/nodepool_autorepair_test.go`:
- Line 64: Update the awkward t.Log message "Terminating AWS Instance with an
autorepair NodePool" to a clearer phrasing such as "Terminating AWS Instance for
autorepair-enabled NodePool" (locate the t.Log(...) call in the
nodepool_autorepair_test.go test and replace the string accordingly to improve
grammar and clarity).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 1966745a-7fbb-4ef0-9769-3baf4a68283a

📥 Commits

Reviewing files that changed from the base of the PR and between ad1df60 and c4cf980.

📒 Files selected for processing (2)
  • test/e2e/nodepool_autorepair_test.go
  • test/e2e/nodepool_test.go

Comment on lines +96 to 118
// Ensure replacement nodes are free of kubelet pressure conditions
// (MemoryPressure, DiskPressure, PIDPressure). A replacement node
// can be NodeReady=True but still have pressure conditions, which
// causes CAPI to set MachineNodeHealthyCondition=False
// (NodeConditionsFailed). This makes the NodePool's AllNodesHealthy
// condition False and fails validateNodePoolConditions downstream.
// By checking pressure conditions here, we wait for the replacement
// node to be fully healthy before proceeding.
e2eutil.WithPredicates(
e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{
Type: string(corev1.NodeMemoryPressure),
Status: metav1.ConditionFalse,
}),
e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{
Type: string(corev1.NodeDiskPressure),
Status: metav1.ConditionFalse,
}),
e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{
Type: string(corev1.NodePIDPressure),
Status: metav1.ConditionFalse,
}),
),
)
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect ConditionPredicate implementation to see how missing conditions are handled.
fd -t f 'predicates?.go' test/e2e/util
rg -nP -C5 'func ConditionPredicate\b'

Repository: openshift/hypershift

Length of output: 898


🏁 Script executed:

cat -n test/e2e/util/eventually.go | sed -n '564,590p'

Repository: openshift/hypershift

Length of output: 835


Address missing pressure condition handling in wait predicate.

ConditionPredicate returns false when a condition is not present in the node's status, which means a freshly-registered replacement node whose kubelet hasn't yet published MemoryPressure, DiskPressure, or PIDPressure will cause the wait to spin indefinitely. These conditions should be treated as matching False (no pressure) when absent. Either modify ConditionPredicate to accept an optional "treat missing as satisfied" mode for these scenarios, or switch to a collection predicate that checks node.Status.Conditions directly and tolerates missing conditions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/nodepool_autorepair_test.go` around lines 96 - 118, The current use
of e2eutil.ConditionPredicate for NodeMemoryPressure/DiskPressure/PIDPressure
will fail when a replacement node lacks those conditions because
ConditionPredicate returns false if a condition is missing; update the wait
predicate to treat missing pressure conditions as satisfied by either (a) adding
a new e2eutil.ConditionPredicate option/mode "treatMissingAsFalse" and use it
for the three checks, or (b) replace the three ConditionPredicate calls with a
single custom predicate (via e2eutil.WithPredicates) that inspects
node.Status.Conditions for each Type (corev1.NodeMemoryPressure,
corev1.NodeDiskPressure, corev1.NodePIDPressure) and treats absent entries as
metav1.ConditionFalse so the wait proceeds for freshly-registered nodes; locate
and update usages of e2eutil.ConditionPredicate and the wait invocation around
corev1.Node to implement this tolerant behaviour.

Copy link
Copy Markdown
Contributor Author

@rutvik23 rutvik23 Apr 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The window where a node exists without pressure conditions populated is extremely brief (sub-second). By the time the old node is removed AND the new node appears AND passes NodeReady=True, kubelet has already populated all conditions. The built-in count + "old node gone" predicates keep the poll going long enough that this race is effectively unreachable. In theory, maybe just one extra poll cycle (3 seconds) would occur if a freshly-registered node happens to be listed before kubelet populates conditions. Should not be a flake risk ideally.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rutvik23, your reasoning is sound, and I'll withdraw the concern.

The critical insight is that NodeReady=True is a stronger prerequisite than the pressure conditions being present. Kubelet evaluates all node conditions—including MemoryPressure, DiskPressure, and PIDPressure—as part of computing NodeReady. By the time a replacement node reaches NodeReady=True, all three pressure conditions will already be populated in node.Status.Conditions. The sub-second window you describe is effectively gated away by the existing NodeReady check.

My original concern about "spinning indefinitely" was overstated. Even in the theoretical worst case, ConditionPredicate returns (false, reason, nil) for a missing condition (i.e., keep retrying, not an error), so it would result in at most one extra poll iteration—not a stuck wait. Given a 20-minute timeout and the two collection predicates already acting as guards, this is a non-issue in practice.

The implementation is correct as-is.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 35.62%. Comparing base (ad1df60) to head (c4cf980).
⚠️ Report is 42 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #8278   +/-   ##
=======================================
  Coverage   35.62%   35.62%           
=======================================
  Files         767      767           
  Lines       93333    93333           
=======================================
  Hits        33248    33248           
  Misses      57396    57396           
  Partials     2689     2689           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rutvik23
Copy link
Copy Markdown
Contributor Author

/cc @sjenning /cc @jparrill /cc @bryan-cox

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 19, 2026

@rutvik23: GitHub didn't allow me to request PR reviews from the following users: /cc.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @sjenning /cc @jparrill /cc @bryan-cox

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@sjenning
Copy link
Copy Markdown
Contributor

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@sjenning: This pull request references Jira Issue OCPBUGS-79471, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@sjenning sjenning marked this pull request as ready for review April 20, 2026 15:37
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2026
@sjenning
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot requested a review from enxebre April 20, 2026 15:38
@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@cwbotbot
Copy link
Copy Markdown

cwbotbot commented Apr 20, 2026

Test Results

e2e-aws

e2e-aks

@bryan-cox
Copy link
Copy Markdown
Member

/test e2e-aks-4-22

Copy link
Copy Markdown
Member

@bryan-cox bryan-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@bryan-cox
Copy link
Copy Markdown
Member

/verified by e2e

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 20, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, rutvik23

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2026
@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 20, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bryan-cox: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD f61ce8f and 2 for PR HEAD c4cf980 in total

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD 9a6b857 and 1 for PR HEAD c4cf980 in total

@rutvik23
Copy link
Copy Markdown
Contributor Author

/test e2e-kubevirt-aws-ovn-reduced

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/retest-required

Remaining retests: 0 against base HEAD d0a4024 and 0 for PR HEAD c4cf980 in total

@hypershift-jira-solve-ci
Copy link
Copy Markdown

I have all the evidence needed. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Test Failure Analysis

Error

Failed to wait for 2 nodes to become ready in 45m0s: context deadline exceeded
observed **v1.Node collection invalid: expected 2 nodes, got 0

DataVolume create-cluster-4hkq7-5kq6r-5fm5x-rhcos is not ready; Phase: Pending
target PVC Pending and Waiting for a volume to be created either by the external
provisioner 'openshift-storage.rbd.csi.ceph.com' or manually by the system administrator.

Summary

Both TestCreateCluster and TestAutoscaling failed because zero KubeVirt worker nodes joined the hosted clusters within the 45-minute timeout. The root cause is an ODF/Ceph storage infrastructure issue: the DataVolumes backing the KubeVirt VM root disks were stuck in Pending phase for the entire test duration because the Ceph RBD CSI provisioner (openshift-storage.rbd.csi.ceph.com) never created the underlying PersistentVolumes. Without root disks, the VirtualMachines could not start, so no worker nodes ever registered. This failure is unrelated to the PR changes, which only modify autorepair test node-health predicates and polling intervals.

Root Cause

The failure chain is:

  1. ODF/Ceph CSI provisioner unresponsive — The openshift-storage.rbd.csi.ceph.com CSI provisioner did not create PVs for the DataVolume PVCs, despite the StorageCluster reporting as deployed and the ocs-storagecluster-ceph-rbd storage class being present and set as default.

  2. DataVolumes stuck in Pending — Both DataVolumes (create-cluster-4hkq7-5kq6r-5fm5x-rhcos and create-cluster-4hkq7-5kq6r-lzxbt-rhcos) were created at 11:34:56Z and immediately entered Pending phase with the condition: "target PVC Pending and Waiting for a volume to be created either by the external provisioner 'openshift-storage.rbd.csi.ceph.com'". They remained in this state for the entire 45-minute test window.

  3. KubeVirt VMs could not start — Since the DataVolumes provide the RHCOS root disk image for each VM, the VirtualMachines could not boot. The KubeVirtMachine resources reported DVPending and VMProvisioned=False.

  4. No worker nodes registered — Without running VMs, no kubelet could start, so zero nodes joined the hosted cluster's API server. The test waited 45 minutes for nodes and timed out.

  5. Hosted cluster incomplete — With no worker nodes, cluster operators requiring node-level components (console, dns, image-registry, ingress, monitoring, network, etc.) could not become available, leaving ClusterVersionAvailable=False and ClusterVersionSucceeding=False.

The same pattern affected both TestCreateCluster (expected 2 nodes, got 0) and TestAutoscaling (expected 1 node, got 0) identically, confirming the problem is infrastructure-wide rather than test-specific.

PR #8278 is not the cause. The PR only modifies test/e2e/nodepool_autorepair_test.go (adding kubelet pressure condition checks to WaitForReadyNodesByNodePool) and test/e2e/nodepool_test.go (adding WithInterval(15s) to reduce polling). Neither file affects KubeVirt VM creation, DataVolume provisioning, or storage configuration. The TestCreateCluster and TestAutoscaling tests do not use the modified autorepair predicates.

Recommendations
  1. Retry the job — This is an infrastructure flake caused by the ODF/Ceph storage backend not provisioning volumes. The PR changes are unrelated.

  2. If retry fails again, investigate the Ceph cluster health on the management cluster:

    • Check if the csi-rbdplugin-provisioner pods in openshift-storage namespace are running and not crash-looping
    • Check the Ceph cluster health status (ceph -s) via the rook-ceph-tools pod
    • Review events in the openshift-storage namespace for Ceph OSD or MON errors
  3. Consider filing an ODF infrastructure bug if this pattern recurs across multiple unrelated PRs, as it indicates a systemic issue with the Ceph RBD provisioner in the CI environment.

Evidence
Evidence Detail
Failed Tests TestCreateCluster/ValidateHostedCluster (2803.42s), TestAutoscaling/ValidateHostedCluster (2805.39s)
Failed Step e2e-kubevirt-aws-ovn-reduced-run-e2e-local (53m6s)
Node Count Expected 2 (TestCreateCluster) + 1 (TestAutoscaling), got 0 + 0
DataVolume Phase Pending from 11:34:57Z through test timeout (~45 min stuck)
DV Pending Reason PVC waiting for openshift-storage.rbd.csi.ceph.com to provision volume
DataVolume Name create-cluster-4hkq7-5kq6r-5fm5x-rhcos (and -lzxbt-rhcos)
Storage Class ocs-storagecluster-ceph-rbd (default, provisioner: openshift-storage.rbd.csi.ceph.com)
ODF Install Completed successfully; StorageCluster reported ready
NodePool Condition AllMachinesReady=False, reason: DVPending
Machine Phase PendingInfrastructureReady=False (DVPending), NodeHealthy=False (WaitingForNodeRef)
KubeVirtMachine VMProvisioned=False, Ready=False
HostedCluster Conditions ClusterVersionAvailable=False, DataPlaneConnectionAvailable=Unknown (NoWorkerNodesAvailable)
PR Impact None — changes limited to autorepair test predicates and polling interval

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

/hold

Revision c4cf980 was retested 3 times: holding

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2026
@rutvik23
Copy link
Copy Markdown
Contributor Author

/retest

@bryan-cox
Copy link
Copy Markdown
Member

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

@rutvik23: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot Bot merged commit 06193dc into openshift:main Apr 23, 2026
36 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@rutvik23: Jira Issue Verification Checks: Jira Issue OCPBUGS-79471
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-79471 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

What this PR does / why we need it:

Summary

  • Add per-node kubelet health predicates (MemoryPressure=False,
    DiskPressure=False, PIDPressure=False) to the autorepair test's
    WaitForReadyNodesByNodePool call, ensuring the replacement node is
    fully healthy before proceeding
  • Add WithInterval(15s) to validateNodePoolConditions to reduce
    API server polling and eliminate client rate limiter errors

Problem

  • The TestNodePoolAutoRepair e2e test flakes when a replacement node
    comes up NodeReady=True but with kubelet pressure conditions. CAPI
    sets MachineNodeHealthyCondition=False (NodeConditionsFailed),
    which propagates to AllNodesHealthy=False on the NodePool.
    validateNodePoolConditions then times out after 20 minutes.

  • Additionally, validateNodePoolConditions polls every 3s (default),
    generating ~400 API GETs over 20 min, triggering client rate limiter
    errors.

Root cause

  • WaitForReadyNodesByNodePool only verified NodeReady=True and did
    not check kubelet health conditions. The test passed at ~7 min with an
    unhealthy node, leaving validateNodePoolConditions to discover the
    problem with insufficient time for the node to stabilize.

Design decisions

  • Reused ConditionPredicate/WithPredicates instead of custom
    helpers — the framework already logs predicate failures, giving
    specific diagnostics (e.g., wanted MemoryPressure=False, got True)
  • Kept original collection predicates unchanged — the built-in
    count predicate in WaitForNReadyNodesWithOptions already ensures
    the old node is fully removed
  • 15s polling interval matches the pattern in
    WaitForNodePoolConfigUpdateComplete (15-20s with explicit comments
    about rate limiting)
  • No stabilization window (Consistently)validateNodePoolConditions
    already serves as a 20-min post-Run() stabilization check
  • No direct Machine condition check — would require management
    client access in the test interface; node-level pressure checks are
    functionally equivalent to what CAPI evaluates

Which issue(s) this PR fixes:

Fixes OCPBUGS-79471

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Tests
  • Enhanced node pool autorepair test with improved logging including AWS instance IDs for better diagnostics.
  • Added validation for kubelet pressure conditions to ensure proper node health verification.
  • Optimized node pool status polling to reduce test execution overhead while maintaining reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/testing Indicates the PR includes changes for e2e testing jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants