OCPBUGS-79471: fix(test): reduce e2e autorepair flakes from unhealthy node conditions by rutvik23 · Pull Request #8278 · openshift/hypershift

rutvik23 · 2026-04-18T17:38:06Z

What this PR does / why we need it:

Summary

Add per-node kubelet health predicates (MemoryPressure=False,
DiskPressure=False, PIDPressure=False) to the autorepair test's
WaitForReadyNodesByNodePool call, ensuring the replacement node is
fully healthy before proceeding
Add WithInterval(15s) to validateNodePoolConditions to reduce
API server polling and eliminate client rate limiter errors

Problem

The TestNodePoolAutoRepair e2e test flakes when a replacement node
comes up NodeReady=True but with kubelet pressure conditions. CAPI
sets MachineNodeHealthyCondition=False (NodeConditionsFailed),
which propagates to AllNodesHealthy=False on the NodePool.
validateNodePoolConditions then times out after 20 minutes.
Additionally, validateNodePoolConditions polls every 3s (default),
generating ~400 API GETs over 20 min, triggering client rate limiter
errors.

Root cause

WaitForReadyNodesByNodePool only verified NodeReady=True and did
not check kubelet health conditions. The test passed at ~7 min with an
unhealthy node, leaving validateNodePoolConditions to discover the
problem with insufficient time for the node to stabilize.

Design decisions

Reused ConditionPredicate/WithPredicates instead of custom
helpers — the framework already logs predicate failures, giving
specific diagnostics (e.g., wanted MemoryPressure=False, got True)
Kept original collection predicates unchanged — the built-in
count predicate in WaitForNReadyNodesWithOptions already ensures
the old node is fully removed
15s polling interval matches the pattern in
WaitForNodePoolConfigUpdateComplete (15-20s with explicit comments
about rate limiting)
No stabilization window (Consistently) — validateNodePoolConditions
already serves as a 20-min post-Run() stabilization check
No direct Machine condition check — would require management
client access in the test interface; node-level pressure checks are
functionally equivalent to what CAPI evaluates

Which issue(s) this PR fixes:

Fixes OCPBUGS-79471

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.
Relevant issues have been referenced.
This change includes docs.
This change includes unit tests.

Summary by CodeRabbit

Tests
- Enhanced node pool autorepair test with improved logging including AWS instance IDs for better diagnostics.
- Added validation for kubelet pressure conditions to ensure proper node health verification.
- Optimized node pool status polling to reduce test execution overhead while maintaining reliability.

WaitForReadyNodesByNodePool only checked NodeReady=True, so the test proceeded with replacement nodes that had kubelet pressure conditions (MemoryPressure, DiskPressure, PIDPressure). CAPI then set MachineNodeHealthyCondition=False, causing AllNodesHealthy=False and failing validateNodePoolConditions. Add per-node pressure predicates to WaitForReadyNodesByNodePool so the test waits for a fully healthy replacement. Also add WithInterval(15s) to validateNodePoolConditions to reduce API polling from ~400 to ~80 GETs over 20 min, eliminating client rate limiter errors. Refs: OCPBUGS-79471 Signed-off-by: rutvik23 Commit-Message-Assisted-by: Claude (via Claude Code)

openshift-merge-bot · 2026-04-18T17:38:09Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

openshift-ci · 2026-04-18T17:38:10Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-04-18T17:38:13Z

@rutvik23: This pull request references Jira Issue OCPBUGS-79471, which is invalid:

expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "4.22.0" instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

Summary

Add per-node kubelet health predicates (MemoryPressure=False,
DiskPressure=False, PIDPressure=False) to the autorepair test's
WaitForReadyNodesByNodePool call, ensuring the replacement node is
fully healthy before proceeding

Add WithInterval(15s) to validateNodePoolConditions to reduce
API server polling and eliminate client rate limiter errors

Problem

The TestNodePoolAutoRepair e2e test flakes when a replacement node
comes up NodeReady=True but with kubelet pressure conditions. CAPI
sets MachineNodeHealthyCondition=False (NodeConditionsFailed),
which propagates to AllNodesHealthy=False on the NodePool.
validateNodePoolConditions then times out after 20 minutes.

Additionally, validateNodePoolConditions polls every 3s (default),
generating ~400 API GETs over 20 min, triggering client rate limiter
errors.

Root cause

WaitForReadyNodesByNodePool only verified NodeReady=True and did
not check kubelet health conditions. The test passed at ~7 min with an
unhealthy node, leaving validateNodePoolConditions to discover the
problem with insufficient time for the node to stabilize.

Design decisions

Reused ConditionPredicate/WithPredicates instead of custom
helpers — the framework already logs predicate failures, giving
specific diagnostics (e.g., wanted MemoryPressure=False, got True)

Kept original collection predicates unchanged — the built-in
count predicate in WaitForNReadyNodesWithOptions already ensures
the old node is fully removed

15s polling interval matches the pattern in
WaitForNodePoolConfigUpdateComplete (15-20s with explicit comments
about rate limiting)

No stabilization window (Consistently) — validateNodePoolConditions
already serves as a 20-min post-Run() stabilization check

No direct Machine condition check — would require management
client access in the test interface; node-level pressure checks are
functionally equivalent to what CAPI evaluates

Which issue(s) this PR fixes:

Fixes OCPBUGS-79471

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-04-18T17:38:22Z

📝 Walkthrough

Walkthrough

This pull request modifies two end-to-end test files to enhance test assertions and control polling behavior. The first change extends the node pool autorepair test to verify that replacement nodes have kubelet pressure conditions explicitly cleared (NodeMemoryPressure, NodeDiskPressure, NodePIDPressure set to False) alongside existing replica count validation. Test logging is also improved to include the AWS instance ID of terminated nodes. The second change adjusts the polling interval in node pool condition validation from the default cadence to 15 seconds while maintaining the 20-minute timeout window.

🚥 Pre-merge checks | ✅ 7 | ❌ 3

❌ Failed checks (2 warnings, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality	⚠️ Warning	The autorepair test violates the Timeouts quality requirement due to a critical bug: ConditionPredicate returns false when kubelet pressure conditions are absent from freshly-registered replacement nodes, causing indefinite spinning.	Modify the predicate to treat missing kubelet pressure conditions as satisfied, and add a failure message to line 74 describing why the provider ID is required.
Single Node Openshift (Sno) Test Compatibility	❓ Inconclusive	Unable to verify test structure without access to actual code repository files or shell execution environment.	Provide the actual content of test/e2e/nodepool*.go files or direct access to the repository to assess test declarations and structure.

✅ Passed checks (7 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main changes: reducing e2e test flakes by adding kubelet health condition checks and adjusting polling intervals.
Stable And Deterministic Test Names	✅ Passed	The pull request does not introduce dynamic or unstable test names. All test case names are static string literals with no dynamic values such as node names, instance IDs, timestamps, or generated identifiers.
Microshift Test Compatibility	✅ Passed	PR modifies existing tests only; no new Ginkgo e2e tests are added.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies only E2E test files, adding kubelet pressure condition checks and adjusting polling intervals without affecting deployment manifests or operator code.
Ote Binary Stdout Contract	✅ Passed	All changes are contained within Ginkgo test blocks (It(), test utility calls) with no process-level stdout writes in main(), init(), or BeforeSuite(). Predicates and wait condition modifications occur only within test code where stdout is properly intercepted.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This pull request does not add any new Ginkgo e2e tests. The changes only modify existing helper methods without IPv4 assumptions or external connectivity requirements.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

test/e2e/nodepool_autorepair_test.go (1)
64-64: Minor: log wording reads awkwardly.

"Terminating AWS Instance with an autorepair NodePool" is grammatically odd. Consider "Terminating AWS Instance for autorepair NodePool" or "... on autorepair-enabled NodePool".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/nodepool_autorepair_test.go` at line 64, Update the awkward t.Log
message "Terminating AWS Instance with an autorepair NodePool" to a clearer
phrasing such as "Terminating AWS Instance for autorepair-enabled NodePool"
(locate the t.Log(...) call in the nodepool_autorepair_test.go test and replace
the string accordingly to improve grammar and clarity).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/nodepool_autorepair_test.go`:
- Around line 96-118: The current use of e2eutil.ConditionPredicate for
NodeMemoryPressure/DiskPressure/PIDPressure will fail when a replacement node
lacks those conditions because ConditionPredicate returns false if a condition
is missing; update the wait predicate to treat missing pressure conditions as
satisfied by either (a) adding a new e2eutil.ConditionPredicate option/mode
"treatMissingAsFalse" and use it for the three checks, or (b) replace the three
ConditionPredicate calls with a single custom predicate (via
e2eutil.WithPredicates) that inspects node.Status.Conditions for each Type
(corev1.NodeMemoryPressure, corev1.NodeDiskPressure, corev1.NodePIDPressure) and
treats absent entries as metav1.ConditionFalse so the wait proceeds for
freshly-registered nodes; locate and update usages of e2eutil.ConditionPredicate
and the wait invocation around corev1.Node to implement this tolerant behaviour.

---

Nitpick comments:
In `@test/e2e/nodepool_autorepair_test.go`:
- Line 64: Update the awkward t.Log message "Terminating AWS Instance with an
autorepair NodePool" to a clearer phrasing such as "Terminating AWS Instance for
autorepair-enabled NodePool" (locate the t.Log(...) call in the
nodepool_autorepair_test.go test and replace the string accordingly to improve
grammar and clarity).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 1966745a-7fbb-4ef0-9769-3baf4a68283a

📥 Commits

Reviewing files that changed from the base of the PR and between ad1df60 and c4cf980.

📒 Files selected for processing (2)

test/e2e/nodepool_autorepair_test.go
test/e2e/nodepool_test.go

coderabbitai · 2026-04-18T17:42:30Z

+		// Ensure replacement nodes are free of kubelet pressure conditions
+		// (MemoryPressure, DiskPressure, PIDPressure). A replacement node
+		// can be NodeReady=True but still have pressure conditions, which
+		// causes CAPI to set MachineNodeHealthyCondition=False
+		// (NodeConditionsFailed). This makes the NodePool's AllNodesHealthy
+		// condition False and fails validateNodePoolConditions downstream.
+		// By checking pressure conditions here, we wait for the replacement
+		// node to be fully healthy before proceeding.
+		e2eutil.WithPredicates(
+			e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{
+				Type:   string(corev1.NodeMemoryPressure),
+				Status: metav1.ConditionFalse,
+			}),
+			e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{
+				Type:   string(corev1.NodeDiskPressure),
+				Status: metav1.ConditionFalse,
+			}),
+			e2eutil.ConditionPredicate[*corev1.Node](e2eutil.Condition{
+				Type:   string(corev1.NodePIDPressure),
+				Status: metav1.ConditionFalse,
+			}),
+		),
 	)


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Inspect ConditionPredicate implementation to see how missing conditions are handled. fd -t f 'predicates?.go' test/e2e/util rg -nP -C5 'func ConditionPredicate\b'

Repository: openshift/hypershift

Length of output: 898

🏁 Script executed:

cat -n test/e2e/util/eventually.go | sed -n '564,590p'

Repository: openshift/hypershift

Length of output: 835

Address missing pressure condition handling in wait predicate.

ConditionPredicate returns false when a condition is not present in the node's status, which means a freshly-registered replacement node whose kubelet hasn't yet published MemoryPressure, DiskPressure, or PIDPressure will cause the wait to spin indefinitely. These conditions should be treated as matching False (no pressure) when absent. Either modify ConditionPredicate to accept an optional "treat missing as satisfied" mode for these scenarios, or switch to a collection predicate that checks node.Status.Conditions directly and tolerates missing conditions.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@test/e2e/nodepool_autorepair_test.go` around lines 96 - 118, The current use of e2eutil.ConditionPredicate for NodeMemoryPressure/DiskPressure/PIDPressure will fail when a replacement node lacks those conditions because ConditionPredicate returns false if a condition is missing; update the wait predicate to treat missing pressure conditions as satisfied by either (a) adding a new e2eutil.ConditionPredicate option/mode "treatMissingAsFalse" and use it for the three checks, or (b) replace the three ConditionPredicate calls with a single custom predicate (via e2eutil.WithPredicates) that inspects node.Status.Conditions for each Type (corev1.NodeMemoryPressure, corev1.NodeDiskPressure, corev1.NodePIDPressure) and treats absent entries as metav1.ConditionFalse so the wait proceeds for freshly-registered nodes; locate and update usages of e2eutil.ConditionPredicate and the wait invocation around corev1.Node to implement this tolerant behaviour.

The window where a node exists without pressure conditions populated is extremely brief (sub-second). By the time the old node is removed AND the new node appears AND passes NodeReady=True, kubelet has already populated all conditions. The built-in count + "old node gone" predicates keep the poll going long enough that this race is effectively unreachable. In theory, maybe just one extra poll cycle (3 seconds) would occur if a freshly-registered node happens to be listed before kubelet populates conditions. Should not be a flake risk ideally.

@rutvik23, your reasoning is sound, and I'll withdraw the concern.

The critical insight is that NodeReady=True is a stronger prerequisite than the pressure conditions being present. Kubelet evaluates all node conditions—including MemoryPressure, DiskPressure, and PIDPressure—as part of computing NodeReady. By the time a replacement node reaches NodeReady=True, all three pressure conditions will already be populated in node.Status.Conditions. The sub-second window you describe is effectively gated away by the existing NodeReady check.

My original concern about "spinning indefinitely" was overstated. Even in the theoretical worst case, ConditionPredicate returns (false, reason, nil) for a missing condition (i.e., keep retrying, not an error), so it would result in at most one extra poll iteration—not a stuck wait. Given a 20-minute timeout and the two collection predicates already acting as guards, this is a non-issue in practice.

The implementation is correct as-is.

codecov · 2026-04-18T17:52:46Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 35.62%. Comparing base (ad1df60) to head (c4cf980).
⚠️ Report is 42 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #8278   +/-   ##
=======================================
  Coverage   35.62%   35.62%           
=======================================
  Files         767      767           
  Lines       93333    93333           
=======================================
  Hits        33248    33248           
  Misses      57396    57396           
  Partials     2689     2689

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rutvik23 · 2026-04-19T15:12:37Z

/cc @sjenning /cc @jparrill /cc @bryan-cox

openshift-ci · 2026-04-19T15:12:42Z

@rutvik23: GitHub didn't allow me to request PR reviews from the following users: /cc.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @sjenning /cc @jparrill /cc @bryan-cox

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

sjenning · 2026-04-20T15:33:30Z

/jira refresh

openshift-ci-robot · 2026-04-20T15:33:37Z

@sjenning: This pull request references Jira Issue OCPBUGS-79471, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

sjenning · 2026-04-20T15:38:00Z

/lgtm

openshift-merge-bot · 2026-04-20T15:38:22Z

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

cwbotbot · 2026-04-20T17:30:05Z

Test Results

e2e-aws

Status: ✅ PASS
Started: 2026-04-20T15:39:17Z
View Job
View Job History

e2e-aks

Status: ✅ PASS
Started: 2026-04-20T15:39:17Z
View Job
View Job History

bryan-cox · 2026-04-20T18:08:04Z

/test e2e-aks-4-22

bryan-cox

/approve

bryan-cox · 2026-04-20T18:31:05Z

/verified by e2e

openshift-ci · 2026-04-20T18:31:07Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bryan-cox, rutvik23

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [bryan-cox]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-04-20T18:31:18Z

@bryan-cox: This PR has been marked as verified by e2e.

Details

In response to this:

/verified by e2e

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-merge-bot · 2026-04-20T19:32:16Z

/retest-required

Remaining retests: 0 against base HEAD f61ce8f and 2 for PR HEAD c4cf980 in total

openshift-merge-bot · 2026-04-20T22:29:38Z

/retest-required

Remaining retests: 0 against base HEAD 9a6b857 and 1 for PR HEAD c4cf980 in total

rutvik23 · 2026-04-21T04:58:24Z

/test e2e-kubevirt-aws-ovn-reduced

openshift-merge-bot · 2026-04-21T09:32:30Z

/retest-required

Remaining retests: 0 against base HEAD d0a4024 and 0 for PR HEAD c4cf980 in total

hypershift-jira-solve-ci · 2026-04-21T13:18:42Z

I have all the evidence needed. Let me produce the final report.

Test Failure Analysis Complete

Job Information

Prow Job: pull-ci-openshift-hypershift-main-e2e-kubevirt-aws-ovn-reduced
Build ID: 2046522614028963840
Target: e2e-kubevirt-aws-ovn-reduced
PR: OCPBUGS-79471: fix(test): reduce e2e autorepair flakes from unhealthy node conditions #8278 (OCPBUGS-79471: fix(test): reduce e2e autorepair flakes from unhealthy node conditions)
Failed Tests: TestCreateCluster/ValidateHostedCluster, TestAutoscaling/ValidateHostedCluster

Test Failure Analysis

Error

Failed to wait for 2 nodes to become ready in 45m0s: context deadline exceeded
observed **v1.Node collection invalid: expected 2 nodes, got 0

DataVolume create-cluster-4hkq7-5kq6r-5fm5x-rhcos is not ready; Phase: Pending
target PVC Pending and Waiting for a volume to be created either by the external
provisioner 'openshift-storage.rbd.csi.ceph.com' or manually by the system administrator.

Summary

Both TestCreateCluster and TestAutoscaling failed because zero KubeVirt worker nodes joined the hosted clusters within the 45-minute timeout. The root cause is an ODF/Ceph storage infrastructure issue: the DataVolumes backing the KubeVirt VM root disks were stuck in Pending phase for the entire test duration because the Ceph RBD CSI provisioner (openshift-storage.rbd.csi.ceph.com) never created the underlying PersistentVolumes. Without root disks, the VirtualMachines could not start, so no worker nodes ever registered. This failure is unrelated to the PR changes, which only modify autorepair test node-health predicates and polling intervals.

Root Cause

The failure chain is:

ODF/Ceph CSI provisioner unresponsive — The openshift-storage.rbd.csi.ceph.com CSI provisioner did not create PVs for the DataVolume PVCs, despite the StorageCluster reporting as deployed and the ocs-storagecluster-ceph-rbd storage class being present and set as default.
DataVolumes stuck in Pending — Both DataVolumes (create-cluster-4hkq7-5kq6r-5fm5x-rhcos and create-cluster-4hkq7-5kq6r-lzxbt-rhcos) were created at 11:34:56Z and immediately entered Pending phase with the condition: "target PVC Pending and Waiting for a volume to be created either by the external provisioner 'openshift-storage.rbd.csi.ceph.com'". They remained in this state for the entire 45-minute test window.
KubeVirt VMs could not start — Since the DataVolumes provide the RHCOS root disk image for each VM, the VirtualMachines could not boot. The KubeVirtMachine resources reported DVPending and VMProvisioned=False.
No worker nodes registered — Without running VMs, no kubelet could start, so zero nodes joined the hosted cluster's API server. The test waited 45 minutes for nodes and timed out.
Hosted cluster incomplete — With no worker nodes, cluster operators requiring node-level components (console, dns, image-registry, ingress, monitoring, network, etc.) could not become available, leaving ClusterVersionAvailable=False and ClusterVersionSucceeding=False.

The same pattern affected both TestCreateCluster (expected 2 nodes, got 0) and TestAutoscaling (expected 1 node, got 0) identically, confirming the problem is infrastructure-wide rather than test-specific.

PR #8278 is not the cause. The PR only modifies test/e2e/nodepool_autorepair_test.go (adding kubelet pressure condition checks to WaitForReadyNodesByNodePool) and test/e2e/nodepool_test.go (adding WithInterval(15s) to reduce polling). Neither file affects KubeVirt VM creation, DataVolume provisioning, or storage configuration. The TestCreateCluster and TestAutoscaling tests do not use the modified autorepair predicates.

Recommendations

Retry the job — This is an infrastructure flake caused by the ODF/Ceph storage backend not provisioning volumes. The PR changes are unrelated.
If retry fails again, investigate the Ceph cluster health on the management cluster:
- Check if the csi-rbdplugin-provisioner pods in openshift-storage namespace are running and not crash-looping
- Check the Ceph cluster health status (ceph -s) via the rook-ceph-tools pod
- Review events in the openshift-storage namespace for Ceph OSD or MON errors
Consider filing an ODF infrastructure bug if this pattern recurs across multiple unrelated PRs, as it indicates a systemic issue with the Ceph RBD provisioner in the CI environment.

Evidence

Evidence	Detail
Failed Tests	TestCreateCluster/ValidateHostedCluster (2803.42s), TestAutoscaling/ValidateHostedCluster (2805.39s)
Failed Step	`e2e-kubevirt-aws-ovn-reduced-run-e2e-local` (53m6s)
Node Count	Expected 2 (TestCreateCluster) + 1 (TestAutoscaling), got 0 + 0
DataVolume Phase	`Pending` from `11:34:57Z` through test timeout (~45 min stuck)
DV Pending Reason	PVC waiting for `openshift-storage.rbd.csi.ceph.com` to provision volume
DataVolume Name	`create-cluster-4hkq7-5kq6r-5fm5x-rhcos` (and `-lzxbt-rhcos`)
Storage Class	`ocs-storagecluster-ceph-rbd` (default, provisioner: `openshift-storage.rbd.csi.ceph.com`)
ODF Install	Completed successfully; StorageCluster reported ready
NodePool Condition	`AllMachinesReady=False`, reason: `DVPending`
Machine Phase	`Pending` — `InfrastructureReady=False` (DVPending), `NodeHealthy=False` (WaitingForNodeRef)
KubeVirtMachine	`VMProvisioned=False`, `Ready=False`
HostedCluster Conditions	`ClusterVersionAvailable=False`, `DataPlaneConnectionAvailable=Unknown` (NoWorkerNodesAvailable)
PR Impact	None — changes limited to autorepair test predicates and polling interval

openshift-merge-bot · 2026-04-21T13:31:00Z

/hold

Revision c4cf980 was retested 3 times: holding

rutvik23 · 2026-04-23T05:50:44Z

/retest

bryan-cox · 2026-04-23T13:35:11Z

/hold cancel

openshift-ci · 2026-04-23T13:56:50Z

@rutvik23: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci-robot · 2026-04-23T14:00:36Z

@rutvik23: Jira Issue Verification Checks: Jira Issue OCPBUGS-79471
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-79471 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

What this PR does / why we need it:

Summary

Add per-node kubelet health predicates (MemoryPressure=False,
DiskPressure=False, PIDPressure=False) to the autorepair test's
WaitForReadyNodesByNodePool call, ensuring the replacement node is
fully healthy before proceeding

Add WithInterval(15s) to validateNodePoolConditions to reduce
API server polling and eliminate client rate limiter errors

Problem

The TestNodePoolAutoRepair e2e test flakes when a replacement node
comes up NodeReady=True but with kubelet pressure conditions. CAPI
sets MachineNodeHealthyCondition=False (NodeConditionsFailed),
which propagates to AllNodesHealthy=False on the NodePool.
validateNodePoolConditions then times out after 20 minutes.

Additionally, validateNodePoolConditions polls every 3s (default),
generating ~400 API GETs over 20 min, triggering client rate limiter
errors.

Root cause

WaitForReadyNodesByNodePool only verified NodeReady=True and did
not check kubelet health conditions. The test passed at ~7 min with an
unhealthy node, leaving validateNodePoolConditions to discover the
problem with insufficient time for the node to stabilize.

Design decisions

Reused ConditionPredicate/WithPredicates instead of custom
helpers — the framework already logs predicate failures, giving
specific diagnostics (e.g., wanted MemoryPressure=False, got True)

Kept original collection predicates unchanged — the built-in
count predicate in WaitForNReadyNodesWithOptions already ensures
the old node is fully removed

15s polling interval matches the pattern in
WaitForNodePoolConfigUpdateComplete (15-20s with explicit comments
about rate limiting)

No stabilization window (Consistently) — validateNodePoolConditions
already serves as a 20-min post-Run() stabilization check

No direct Machine condition check — would require management
client access in the test interface; node-level pressure checks are
functionally equivalent to what CAPI evaluates

Which issue(s) this PR fixes:

Fixes OCPBUGS-79471

Special notes for your reviewer:

Checklist:

Subject and description added to both, commit and PR.

Relevant issues have been referenced.

This change includes docs.

This change includes unit tests.

Summary by CodeRabbit

Tests

Enhanced node pool autorepair test with improved logging including AWS instance IDs for better diagnostics.

Added validation for kubelet pressure conditions to ensure proper node health verification.

Optimized node pool status polling to reduce test execution overhead while maintaining reliability.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 18, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 18, 2026

openshift-ci-robot added the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Apr 18, 2026

openshift-ci Bot added the do-not-merge/needs-area label Apr 18, 2026

openshift-ci Bot added area/testing Indicates the PR includes changes for e2e testing and removed do-not-merge/needs-area labels Apr 18, 2026

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

openshift-ci Bot requested review from bryan-cox, jparrill and sjenning April 19, 2026 15:12

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 20, 2026

sjenning marked this pull request as ready for review April 20, 2026 15:37

openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2026

openshift-ci Bot assigned sjenning Apr 20, 2026

openshift-ci Bot requested a review from enxebre April 20, 2026 15:38

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 20, 2026

bryan-cox approved these changes Apr 20, 2026

View reviewed changes

openshift-ci Bot assigned bryan-cox Apr 20, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 20, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 20, 2026

bryan-cox mentioned this pull request Apr 20, 2026

fix: set CNV_SUBSCRIPTION_SOURCE for kubevirt AWS e2e jobs openshift/release#78056

Closed

openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 21, 2026

openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 23, 2026

openshift-merge-bot Bot merged commit 06193dc into openshift:main Apr 23, 2026
36 checks passed

Conversation

rutvik23 commented Apr 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it:

Summary

Problem

Root cause

Design decisions

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Apr 18, 2026

Uh oh!

openshift-ci Bot commented Apr 18, 2026

Uh oh!

openshift-ci-robot commented Apr 18, 2026

What this PR does / why we need it:

Summary

Problem

Root cause

Design decisions

Which issue(s) this PR fixes:

Special notes for your reviewer:

Checklist:

Uh oh!

coderabbitai Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

❌ Failed checks (2 warnings, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rutvik23 Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 19, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rutvik23 commented Apr 19, 2026

Uh oh!

openshift-ci Bot commented Apr 19, 2026

Uh oh!

sjenning commented Apr 20, 2026

Uh oh!

openshift-ci-robot commented Apr 20, 2026

Uh oh!

sjenning commented Apr 20, 2026

Uh oh!

openshift-merge-bot Bot commented Apr 20, 2026

Uh oh!

cwbotbot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

e2e-aws

e2e-aks

Uh oh!

bryan-cox commented Apr 20, 2026

Uh oh!

bryan-cox left a comment

Choose a reason for hiding this comment

Uh oh!

bryan-cox commented Apr 20, 2026

Uh oh!

openshift-ci Bot commented Apr 20, 2026

Uh oh!

openshift-ci-robot commented Apr 20, 2026

Uh oh!

openshift-merge-bot Bot commented Apr 20, 2026

rutvik23 commented Apr 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 18, 2026 •

edited

Loading

coderabbitai Bot Apr 18, 2026 •

edited

Loading

rutvik23 Apr 19, 2026 •

edited

Loading

codecov Bot commented Apr 18, 2026 •

edited

Loading

cwbotbot commented Apr 20, 2026 •

edited

Loading