OCPBUGS-79544: test: add monitortest to detect pods stuck in Pending state by bitoku · Pull Request #31045 · openshift/origin

bitoku · 2026-04-21T07:29:32Z

Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection.

Assisted-by: Claude Code https://claude.com/claude-code

Summary by CodeRabbit

New Features
- Added detection and reporting for pods stuck in Pending state that never complete, including detailed per-pod failure summaries and counts.
Tests
- Added comprehensive tests for stuck-pending detection covering empty intervals, completed pods, single/multiple stuck pods, filtering cases, and ensuring paired passing entries for flake tracking.

openshift-merge-bot · 2026-04-21T07:29:34Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

coderabbitai · 2026-04-21T07:29:57Z

Walkthrough

Adds a new function that scans monitor intervals for pods "never completed" in Pending state, emits JUnit test cases (failure + pass pattern), and integrates those results into the existing watchpods monitor output flow (including a behavior change when w.podInformer is nil).

Changes

Cohort / File(s)	Summary
Stuck-pending detection & formatter `pkg/monitortests/node/watchpods/stuck_pending_pods.go`	New `stuckPendingPodsJunit(finalIntervals)` implementation: filters intervals for Source=`SourcePodState` with Reason=`PodWasPending` / message `never completed`, formats pod/namespace and UTC timestamps, returns JUnit test case(s) with combined SystemOut and FailureOutput and a companion passing entry.
Integration / informer behavior `pkg/monitortests/node/watchpods/monitortest.go`	Modifies `EvaluateTestsFromConstructedIntervals` to initialize/return with `stuckPendingPodsJunit(finalIntervals)` when `w.podInformer` is nil; also appends informer-derived JUnit cases to the precomputed stuck-pending results when informer is present (removed prior local empty slice).
Unit tests `pkg/monitortests/node/watchpods/stuck_pending_pods_test.go`	New table-driven tests exercising empty intervals, completed pods, single/multiple stuck pods, source/reason filtering, and mixed scenarios; asserts failure output contents, count summaries, and presence of the pass/flake companion case.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a monitortest to detect pods stuck in Pending state, with the JIRA reference providing context.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	The PR adds standard Go unit tests using table-driven subtests, not Ginkgo-style BDD tests with It(), Describe(), Context(), or When() constructs.
Test Structure And Quality	✅ Passed	Custom check for Ginkgo test code is not applicable; test uses standard Go testing.T with table-driven patterns, not Ginkgo BDD framework.
Microshift Test Compatibility	✅ Passed	PR adds standard Go unit tests and monitor test framework code, not Ginkgo e2e tests. The custom check targets Ginkgo patterns (It, Describe, Context, When) which are absent.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	PR adds standard Go unit tests and monitoring framework utilities, not Ginkgo e2e tests. No SNO compatibility issues present.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds monitoring test code to detect pods stuck in Pending state. No scheduling constraints, node selectors, affinity rules, or topology assumptions are introduced.
Ote Binary Stdout Contract	✅ Passed	The pull request introduces code with no stdout writes in process-level code, maintaining compliance with the OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This pull request does not introduce new Ginkgo e2e tests. The code adds unit tests using Go's standard testing package with t.Run() subtests, not Ginkgo patterns, and contains no hardcoded IPv4 addresses or IPv4-only network assumptions.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci-robot · 2026-04-21T07:30:09Z

@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (5.0.0) matches configured target version for branch (5.0.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @asahay19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection.

Assisted-by: Claude Code https://claude.com/claude-code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pkg/monitortests/node/watchpods/stuck_pending_pods.go (1)

24-27: Prefer shared pod locator formatting and full timestamp format

This string construction is slightly ad-hoc. Consider using the shared pod locator formatter and RFC3339 timestamps so logs are unambiguous across years.

Proposed refactor

 		for _, interval := range stuckPods {
-			pod := interval.Locator.Keys[monitorapi.LocatorPodKey]
-			namespace := interval.Locator.Keys[monitorapi.LocatorNamespaceKey]
-			failures = append(failures, fmt.Sprintf("ns/%s pod/%s was Pending from %s to %s and never completed",
-				namespace, pod, interval.From.UTC().Format("01-02T15:04:05Z"), interval.To.UTC().Format("01-02T15:04:05Z")))
+			failures = append(failures, fmt.Sprintf("%s was Pending from %s to %s and never completed",
+				monitorapi.NonUniquePodLocatorFrom(interval.Locator),
+				interval.From.UTC().Format(time.RFC3339),
+				interval.To.UTC().Format(time.RFC3339)))
 		}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/monitortests/node/watchpods/stuck_pending_pods.go` around lines 24 - 27,
Replace the ad-hoc fmt.Sprintf construction with the shared pod locator
formatter and RFC3339 timestamps: build the locator from interval.Locator (using
monitorapi.LocatorPodKey and monitorapi.LocatorNamespaceKey or pass
interval.Locator into the shared formatter), call the shared formatter (e.g.,
FormatPodLocator or the existing pod locator formatter) to produce the "ns/...
pod/..." text, and format interval.From and interval.To with time.RFC3339 before
appending to failures instead of using "01-02T15:04:05Z".

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/monitortests/node/watchpods/monitortest.go`:
- Around line 90-91: The current early return skips calling
stuckPendingPodsJunit when w.podInformer is nil; update the control flow in the
function containing w.podInformer to always invoke
stuckPendingPodsJunit(finalIntervals) and append its results to ret regardless
of informer presence (e.g., move or duplicate the append call so it executes
before any return that depends on w.podInformer), while keeping the existing
early return for other informer-dependent checks intact; reference
stuckPendingPodsJunit and w.podInformer to locate where to adjust control flow.

---

Nitpick comments:
In `@pkg/monitortests/node/watchpods/stuck_pending_pods.go`:
- Around line 24-27: Replace the ad-hoc fmt.Sprintf construction with the shared
pod locator formatter and RFC3339 timestamps: build the locator from
interval.Locator (using monitorapi.LocatorPodKey and
monitorapi.LocatorNamespaceKey or pass interval.Locator into the shared
formatter), call the shared formatter (e.g., FormatPodLocator or the existing
pod locator formatter) to produce the "ns/... pod/..." text, and format
interval.From and interval.To with time.RFC3339 before appending to failures
instead of using "01-02T15:04:05Z".

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ea6f1c14-a420-4a4a-9184-6d261aaa7067

📥 Commits

Reviewing files that changed from the base of the PR and between ff48d7a and cfd6bd3.

📒 Files selected for processing (3)

pkg/monitortests/node/watchpods/monitortest.go
pkg/monitortests/node/watchpods/stuck_pending_pods.go
pkg/monitortests/node/watchpods/stuck_pending_pods_test.go

Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection. Assisted-by: Claude Code <https://claude.com/claude-code>

coderabbitai

🧹 Nitpick comments (1)

pkg/monitortests/node/watchpods/stuck_pending_pods_test.go (1)

117-143: Tighten flake-pattern assertions to enforce cardinality.

Current checks validate existence, but duplicate unexpected failing entries could still pass. Consider asserting exactly one failing case and exactly one matching pass case.

Proposed assertion tightening

-				var failCase *junitapi.JUnitTestCase
-				for _, tc := range junits {
-					if tc.FailureOutput != nil {
-						failCase = tc
-						break
-					}
-				}
-				require.NotNil(t, failCase, "expected a failing test case")
+				var failCases []*junitapi.JUnitTestCase
+				for _, tc := range junits {
+					if tc.FailureOutput != nil {
+						failCases = append(failCases, tc)
+					}
+				}
+				require.Len(t, failCases, 1, "expected exactly one failing test case")
+				failCase := failCases[0]
 				assert.Contains(t, failCase.FailureOutput.Output, "stuck in Pending state")
 				if tt.wantSubstr != "" {
 					assert.Contains(t, failCase.FailureOutput.Output, tt.wantSubstr)
 				}
 				if tt.wantCount > 0 {
 					assert.Contains(t, failCase.FailureOutput.Output,
 						fmt.Sprintf("%d pod(s)", tt.wantCount))
 				}
 
 				// Verify the flake pattern: both a failure and a pass with the same test name
-				var hasPass bool
+				var matchingPassCount int
 				for _, tc := range junits {
 					if tc.FailureOutput == nil && tc.Name == failCase.Name {
-						hasPass = true
+						matchingPassCount++
 					}
 				}
-				assert.True(t, hasPass, "expected a matching pass entry for flake pattern")
+				assert.Equal(t, 1, matchingPassCount, "expected exactly one matching pass entry for flake pattern")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pkg/monitortests/node/watchpods/stuck_pending_pods_test.go` around lines 117
- 143, The flake-pattern check currently only asserts existence of a failing
case and a matching pass; tighten it by counting entries in junits: compute
failCount as number of tc where tc.FailureOutput != nil and
tc.FailureOutput.Output contains "stuck in Pending state" (and
tt.wantSubstr/tt.wantCount when applicable), and compute passCount as number of
tc where tc.FailureOutput == nil and tc.Name equals the failing test name; then
require.Equal(t, 1, failCount) and require.Equal(t, 1, passCount) instead of the
current boolean/existence checks (references: junits, failCase,
tc.FailureOutput, tc.Name, hasPass).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/monitortests/node/watchpods/stuck_pending_pods_test.go`:
- Around line 117-143: The flake-pattern check currently only asserts existence
of a failing case and a matching pass; tighten it by counting entries in junits:
compute failCount as number of tc where tc.FailureOutput != nil and
tc.FailureOutput.Output contains "stuck in Pending state" (and
tt.wantSubstr/tt.wantCount when applicable), and compute passCount as number of
tc where tc.FailureOutput == nil and tc.Name equals the failing test name; then
require.Equal(t, 1, failCount) and require.Equal(t, 1, passCount) instead of the
current boolean/existence checks (references: junits, failCase,
tc.FailureOutput, tc.Name, hasPass).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7003f2dc-a5fe-4c25-bf4e-c8ce054ba10d

📥 Commits

Reviewing files that changed from the base of the PR and between cfd6bd3 and 6ec90cf.

📒 Files selected for processing (3)

pkg/monitortests/node/watchpods/monitortest.go
pkg/monitortests/node/watchpods/stuck_pending_pods.go
pkg/monitortests/node/watchpods/stuck_pending_pods_test.go

🚧 Files skipped from review as they are similar to previous changes (2)

pkg/monitortests/node/watchpods/monitortest.go
pkg/monitortests/node/watchpods/stuck_pending_pods.go

openshift-merge-bot · 2026-04-21T10:17:27Z

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

dgoodwin · 2026-04-21T10:56:45Z

/lgtm

openshift-ci · 2026-04-21T10:57:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bitoku, dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bitoku · 2026-04-21T14:23:28Z

/retest

openshift-ci · 2026-04-21T18:06:38Z

@bitoku: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-trt · 2026-04-21T20:28:37Z

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 6ec90cf

Job Name	New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips	High - "[Monitor:pod-lifecycle][sig-node] pods should not be stuck in Pending state forever" is a new test that was not present in all runs against the current commit.

New tests seen in this PR at sha: 6ec90cf

"[Monitor:pod-lifecycle][sig-node] pods should not be stuck in Pending state forever" [Total: 13, Pass: 13, Fail: 0, Flake: 0]

bitoku · 2026-04-27T07:20:42Z

/verified by CI

openshift-ci-robot · 2026-04-27T07:20:56Z

@bitoku: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-27T08:02:20Z

@bitoku: Jira Issue Verification Checks: Jira Issue OCPBUGS-79544
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-79544 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection.

Assisted-by: Claude Code https://claude.com/claude-code

Summary by CodeRabbit

New Features

Added detection and reporting for pods stuck in Pending state that never complete, including detailed per-pod failure summaries and counts.

Tests

Added comprehensive tests for stuck-pending detection covering empty intervals, completed pods, single/multiple stuck pods, filtering cases, and ensuring paired passing entries for flake tracking.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bitoku · 2026-04-27T08:21:43Z

/cherry-pick release-4.22 release-4.21

openshift-cherrypick-robot · 2026-04-27T08:22:30Z

@bitoku: new pull request created: #31073

Details

In response to this:

/cherry-pick release-4.22 release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

bitoku changed the title ~~test: add monitortest to detect pods stuck in Pending state~~ OCPBUGS-79544: test: add monitortest to detect pods stuck in Pending state Apr 21, 2026

openshift-ci Bot requested review from deads2k and p0lyn0mial April 21, 2026 07:30

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 21, 2026

openshift-ci Bot requested a review from asahay19 April 21, 2026 07:30

bitoku mentioned this pull request Apr 21, 2026

OCPBUGS-79544: Add [Late] test to collect CRI-O goroutine dumps via SIGUSR1 #31013

Merged

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

Comment thread pkg/monitortests/node/watchpods/monitortest.go Outdated

bitoku force-pushed the monitortest-stuck-pending-pods branch from cfd6bd3 to 6ec90cf Compare April 21, 2026 09:45

coderabbitai Bot reviewed Apr 21, 2026

View reviewed changes

openshift-ci Bot assigned dgoodwin Apr 21, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2026

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2026

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 27, 2026

openshift-merge-bot Bot merged commit 653bd69 into openshift:main Apr 27, 2026
21 checks passed

openshift-cherrypick-robot mentioned this pull request Apr 27, 2026

[release-4.22] OCPBUGS-84382: test: add monitortest to detect pods stuck in Pending state #31073

Open

Conversation

bitoku commented Apr 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

openshift-merge-bot Bot commented Apr 21, 2026

Uh oh!

coderabbitai Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci-robot commented Apr 21, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented Apr 21, 2026

Uh oh!

dgoodwin commented Apr 21, 2026

Uh oh!

openshift-ci Bot commented Apr 21, 2026

Uh oh!

bitoku commented Apr 21, 2026

Uh oh!

openshift-ci Bot commented Apr 21, 2026

Uh oh!

openshift-trt Bot commented Apr 21, 2026

Uh oh!

bitoku commented Apr 27, 2026

Uh oh!

openshift-ci-robot commented Apr 27, 2026

Uh oh!

Uh oh!

openshift-ci-robot commented Apr 27, 2026

Summary by CodeRabbit

Uh oh!

bitoku commented Apr 27, 2026

Uh oh!

openshift-cherrypick-robot commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bitoku commented Apr 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 21, 2026 •

edited

Loading