Skip to content

OCPBUGS-79544: test: add monitortest to detect pods stuck in Pending state#31045

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
bitoku:monitortest-stuck-pending-pods
Apr 27, 2026
Merged

OCPBUGS-79544: test: add monitortest to detect pods stuck in Pending state#31045
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
bitoku:monitortest-stuck-pending-pods

Conversation

@bitoku
Copy link
Copy Markdown
Contributor

@bitoku bitoku commented Apr 21, 2026

Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection.

Assisted-by: Claude Code https://claude.com/claude-code

Summary by CodeRabbit

  • New Features

    • Added detection and reporting for pods stuck in Pending state that never complete, including detailed per-pod failure summaries and counts.
  • Tests

    • Added comprehensive tests for stuck-pending detection covering empty intervals, completed pods, single/multiple stuck pods, filtering cases, and ensuring paired passing entries for flake tracking.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 21, 2026

Walkthrough

Adds a new function that scans monitor intervals for pods "never completed" in Pending state, emits JUnit test cases (failure + pass pattern), and integrates those results into the existing watchpods monitor output flow (including a behavior change when w.podInformer is nil).

Changes

Cohort / File(s) Summary
Stuck-pending detection & formatter
pkg/monitortests/node/watchpods/stuck_pending_pods.go
New stuckPendingPodsJunit(finalIntervals) implementation: filters intervals for Source=SourcePodState with Reason=PodWasPending / message never completed, formats pod/namespace and UTC timestamps, returns JUnit test case(s) with combined SystemOut and FailureOutput and a companion passing entry.
Integration / informer behavior
pkg/monitortests/node/watchpods/monitortest.go
Modifies EvaluateTestsFromConstructedIntervals to initialize/return with stuckPendingPodsJunit(finalIntervals) when w.podInformer is nil; also appends informer-derived JUnit cases to the precomputed stuck-pending results when informer is present (removed prior local empty slice).
Unit tests
pkg/monitortests/node/watchpods/stuck_pending_pods_test.go
New table-driven tests exercising empty intervals, completed pods, single/multiple stuck pods, source/reason filtering, and mixed scenarios; asserts failure output contents, count summaries, and presence of the pass/flake companion case.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a monitortest to detect pods stuck in Pending state, with the JIRA reference providing context.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The PR adds standard Go unit tests using table-driven subtests, not Ginkgo-style BDD tests with It(), Describe(), Context(), or When() constructs.
Test Structure And Quality ✅ Passed Custom check for Ginkgo test code is not applicable; test uses standard Go testing.T with table-driven patterns, not Ginkgo BDD framework.
Microshift Test Compatibility ✅ Passed PR adds standard Go unit tests and monitor test framework code, not Ginkgo e2e tests. The custom check targets Ginkgo patterns (It, Describe, Context, When) which are absent.
Single Node Openshift (Sno) Test Compatibility ✅ Passed PR adds standard Go unit tests and monitoring framework utilities, not Ginkgo e2e tests. No SNO compatibility issues present.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds monitoring test code to detect pods stuck in Pending state. No scheduling constraints, node selectors, affinity rules, or topology assumptions are introduced.
Ote Binary Stdout Contract ✅ Passed The pull request introduces code with no stdout writes in process-level code, maintaining compliance with the OTE Binary Stdout Contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This pull request does not introduce new Ginkgo e2e tests. The code adds unit tests using Go's standard testing package with t.Run() subtests, not Ginkgo patterns, and contains no hardcoded IPv4 addresses or IPv4-only network assumptions.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@bitoku bitoku changed the title test: add monitortest to detect pods stuck in Pending state OCPBUGS-79544: test: add monitortest to detect pods stuck in Pending state Apr 21, 2026
@openshift-ci openshift-ci Bot requested review from deads2k and p0lyn0mial April 21, 2026 07:30
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Apr 21, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @asahay19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection.

Assisted-by: Claude Code https://claude.com/claude-code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pkg/monitortests/node/watchpods/stuck_pending_pods.go (1)

24-27: Prefer shared pod locator formatting and full timestamp format

This string construction is slightly ad-hoc. Consider using the shared pod locator formatter and RFC3339 timestamps so logs are unambiguous across years.

Proposed refactor
 		for _, interval := range stuckPods {
-			pod := interval.Locator.Keys[monitorapi.LocatorPodKey]
-			namespace := interval.Locator.Keys[monitorapi.LocatorNamespaceKey]
-			failures = append(failures, fmt.Sprintf("ns/%s pod/%s was Pending from %s to %s and never completed",
-				namespace, pod, interval.From.UTC().Format("01-02T15:04:05Z"), interval.To.UTC().Format("01-02T15:04:05Z")))
+			failures = append(failures, fmt.Sprintf("%s was Pending from %s to %s and never completed",
+				monitorapi.NonUniquePodLocatorFrom(interval.Locator),
+				interval.From.UTC().Format(time.RFC3339),
+				interval.To.UTC().Format(time.RFC3339)))
 		}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/monitortests/node/watchpods/stuck_pending_pods.go` around lines 24 - 27,
Replace the ad-hoc fmt.Sprintf construction with the shared pod locator
formatter and RFC3339 timestamps: build the locator from interval.Locator (using
monitorapi.LocatorPodKey and monitorapi.LocatorNamespaceKey or pass
interval.Locator into the shared formatter), call the shared formatter (e.g.,
FormatPodLocator or the existing pod locator formatter) to produce the "ns/...
pod/..." text, and format interval.From and interval.To with time.RFC3339 before
appending to failures instead of using "01-02T15:04:05Z".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/monitortests/node/watchpods/monitortest.go`:
- Around line 90-91: The current early return skips calling
stuckPendingPodsJunit when w.podInformer is nil; update the control flow in the
function containing w.podInformer to always invoke
stuckPendingPodsJunit(finalIntervals) and append its results to ret regardless
of informer presence (e.g., move or duplicate the append call so it executes
before any return that depends on w.podInformer), while keeping the existing
early return for other informer-dependent checks intact; reference
stuckPendingPodsJunit and w.podInformer to locate where to adjust control flow.

---

Nitpick comments:
In `@pkg/monitortests/node/watchpods/stuck_pending_pods.go`:
- Around line 24-27: Replace the ad-hoc fmt.Sprintf construction with the shared
pod locator formatter and RFC3339 timestamps: build the locator from
interval.Locator (using monitorapi.LocatorPodKey and
monitorapi.LocatorNamespaceKey or pass interval.Locator into the shared
formatter), call the shared formatter (e.g., FormatPodLocator or the existing
pod locator formatter) to produce the "ns/... pod/..." text, and format
interval.From and interval.To with time.RFC3339 before appending to failures
instead of using "01-02T15:04:05Z".
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: ea6f1c14-a420-4a4a-9184-6d261aaa7067

📥 Commits

Reviewing files that changed from the base of the PR and between ff48d7a and cfd6bd3.

📒 Files selected for processing (3)
  • pkg/monitortests/node/watchpods/monitortest.go
  • pkg/monitortests/node/watchpods/stuck_pending_pods.go
  • pkg/monitortests/node/watchpods/stuck_pending_pods_test.go

Comment thread pkg/monitortests/node/watchpods/monitortest.go Outdated
Add a JUnit evaluation to the pod-lifecycle monitortest that scans
intervals for pods with PodWasPending reason and "never completed"
message, indicating pods that entered Pending and never left it.
This reliably detects stuck image pulls and scheduling failures
across all cluster configurations by leveraging already-collected
interval data rather than brittle node-level inspection.

Assisted-by: Claude Code <https://claude.com/claude-code>
@bitoku bitoku force-pushed the monitortest-stuck-pending-pods branch from cfd6bd3 to 6ec90cf Compare April 21, 2026 09:45
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/monitortests/node/watchpods/stuck_pending_pods_test.go (1)

117-143: Tighten flake-pattern assertions to enforce cardinality.

Current checks validate existence, but duplicate unexpected failing entries could still pass. Consider asserting exactly one failing case and exactly one matching pass case.

Proposed assertion tightening
-				var failCase *junitapi.JUnitTestCase
-				for _, tc := range junits {
-					if tc.FailureOutput != nil {
-						failCase = tc
-						break
-					}
-				}
-				require.NotNil(t, failCase, "expected a failing test case")
+				var failCases []*junitapi.JUnitTestCase
+				for _, tc := range junits {
+					if tc.FailureOutput != nil {
+						failCases = append(failCases, tc)
+					}
+				}
+				require.Len(t, failCases, 1, "expected exactly one failing test case")
+				failCase := failCases[0]
 				assert.Contains(t, failCase.FailureOutput.Output, "stuck in Pending state")
 				if tt.wantSubstr != "" {
 					assert.Contains(t, failCase.FailureOutput.Output, tt.wantSubstr)
 				}
 				if tt.wantCount > 0 {
 					assert.Contains(t, failCase.FailureOutput.Output,
 						fmt.Sprintf("%d pod(s)", tt.wantCount))
 				}
 
 				// Verify the flake pattern: both a failure and a pass with the same test name
-				var hasPass bool
+				var matchingPassCount int
 				for _, tc := range junits {
 					if tc.FailureOutput == nil && tc.Name == failCase.Name {
-						hasPass = true
+						matchingPassCount++
 					}
 				}
-				assert.True(t, hasPass, "expected a matching pass entry for flake pattern")
+				assert.Equal(t, 1, matchingPassCount, "expected exactly one matching pass entry for flake pattern")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/monitortests/node/watchpods/stuck_pending_pods_test.go` around lines 117
- 143, The flake-pattern check currently only asserts existence of a failing
case and a matching pass; tighten it by counting entries in junits: compute
failCount as number of tc where tc.FailureOutput != nil and
tc.FailureOutput.Output contains "stuck in Pending state" (and
tt.wantSubstr/tt.wantCount when applicable), and compute passCount as number of
tc where tc.FailureOutput == nil and tc.Name equals the failing test name; then
require.Equal(t, 1, failCount) and require.Equal(t, 1, passCount) instead of the
current boolean/existence checks (references: junits, failCase,
tc.FailureOutput, tc.Name, hasPass).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/monitortests/node/watchpods/stuck_pending_pods_test.go`:
- Around line 117-143: The flake-pattern check currently only asserts existence
of a failing case and a matching pass; tighten it by counting entries in junits:
compute failCount as number of tc where tc.FailureOutput != nil and
tc.FailureOutput.Output contains "stuck in Pending state" (and
tt.wantSubstr/tt.wantCount when applicable), and compute passCount as number of
tc where tc.FailureOutput == nil and tc.Name equals the failing test name; then
require.Equal(t, 1, failCount) and require.Equal(t, 1, passCount) instead of the
current boolean/existence checks (references: junits, failCase,
tc.FailureOutput, tc.Name, hasPass).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7003f2dc-a5fe-4c25-bf4e-c8ce054ba10d

📥 Commits

Reviewing files that changed from the base of the PR and between cfd6bd3 and 6ec90cf.

📒 Files selected for processing (3)
  • pkg/monitortests/node/watchpods/monitortest.go
  • pkg/monitortests/node/watchpods/stuck_pending_pods.go
  • pkg/monitortests/node/watchpods/stuck_pending_pods_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/monitortests/node/watchpods/monitortest.go
  • pkg/monitortests/node/watchpods/stuck_pending_pods.go

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

@dgoodwin
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Apr 21, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 21, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bitoku, dgoodwin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 21, 2026
@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 21, 2026

/retest

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 21, 2026

@bitoku: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt
Copy link
Copy Markdown

openshift-trt Bot commented Apr 21, 2026

Risk analysis has seen new tests most likely introduced by this PR.
Please ensure that new tests meet guidelines for naming and stability.

New Test Risks for sha: 6ec90cf

Job Name New Test Risk
pull-ci-openshift-origin-main-e2e-aws-ovn-fips High - "[Monitor:pod-lifecycle][sig-node] pods should not be stuck in Pending state forever" is a new test that was not present in all runs against the current commit.

New tests seen in this PR at sha: 6ec90cf

  • "[Monitor:pod-lifecycle][sig-node] pods should not be stuck in Pending state forever" [Total: 13, Pass: 13, Fail: 0, Flake: 0]

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 27, 2026

/verified by CI

@openshift-ci-robot openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Apr 27, 2026
@openshift-ci-robot
Copy link
Copy Markdown

@bitoku: This PR has been marked as verified by CI.

Details

In response to this:

/verified by CI

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-merge-bot openshift-merge-bot Bot merged commit 653bd69 into openshift:main Apr 27, 2026
21 checks passed
@openshift-ci-robot
Copy link
Copy Markdown

@bitoku: Jira Issue Verification Checks: Jira Issue OCPBUGS-79544
✔️ This pull request was pre-merge verified.
✔️ All associated pull requests have merged.
✔️ All associated, merged pull requests were pre-merge verified.

Jira Issue OCPBUGS-79544 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓

Details

In response to this:

Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection.

Assisted-by: Claude Code https://claude.com/claude-code

Summary by CodeRabbit

  • New Features

  • Added detection and reporting for pods stuck in Pending state that never complete, including detailed per-pod failure summaries and counts.

  • Tests

  • Added comprehensive tests for stuck-pending detection covering empty intervals, completed pods, single/multiple stuck pods, filtering cases, and ensuring paired passing entries for flake tracking.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@bitoku
Copy link
Copy Markdown
Contributor Author

bitoku commented Apr 27, 2026

/cherry-pick release-4.22 release-4.21

@openshift-cherrypick-robot
Copy link
Copy Markdown

@bitoku: new pull request created: #31073

Details

In response to this:

/cherry-pick release-4.22 release-4.21

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. verified Signifies that the PR passed pre-merge verification criteria

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants