OCPBUGS-79544: test: add monitortest to detect pods stuck in Pending state#31045
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
WalkthroughAdds a new function that scans monitor intervals for pods "never completed" in Pending state, emits JUnit test cases (failure + pass pattern), and integrates those results into the existing watchpods monitor output flow (including a behavior change when Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@bitoku: This pull request references Jira Issue OCPBUGS-79544, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/monitortests/node/watchpods/stuck_pending_pods.go (1)
24-27: Prefer shared pod locator formatting and full timestamp formatThis string construction is slightly ad-hoc. Consider using the shared pod locator formatter and RFC3339 timestamps so logs are unambiguous across years.
Proposed refactor
for _, interval := range stuckPods { - pod := interval.Locator.Keys[monitorapi.LocatorPodKey] - namespace := interval.Locator.Keys[monitorapi.LocatorNamespaceKey] - failures = append(failures, fmt.Sprintf("ns/%s pod/%s was Pending from %s to %s and never completed", - namespace, pod, interval.From.UTC().Format("01-02T15:04:05Z"), interval.To.UTC().Format("01-02T15:04:05Z"))) + failures = append(failures, fmt.Sprintf("%s was Pending from %s to %s and never completed", + monitorapi.NonUniquePodLocatorFrom(interval.Locator), + interval.From.UTC().Format(time.RFC3339), + interval.To.UTC().Format(time.RFC3339))) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/monitortests/node/watchpods/stuck_pending_pods.go` around lines 24 - 27, Replace the ad-hoc fmt.Sprintf construction with the shared pod locator formatter and RFC3339 timestamps: build the locator from interval.Locator (using monitorapi.LocatorPodKey and monitorapi.LocatorNamespaceKey or pass interval.Locator into the shared formatter), call the shared formatter (e.g., FormatPodLocator or the existing pod locator formatter) to produce the "ns/... pod/..." text, and format interval.From and interval.To with time.RFC3339 before appending to failures instead of using "01-02T15:04:05Z".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/monitortests/node/watchpods/monitortest.go`:
- Around line 90-91: The current early return skips calling
stuckPendingPodsJunit when w.podInformer is nil; update the control flow in the
function containing w.podInformer to always invoke
stuckPendingPodsJunit(finalIntervals) and append its results to ret regardless
of informer presence (e.g., move or duplicate the append call so it executes
before any return that depends on w.podInformer), while keeping the existing
early return for other informer-dependent checks intact; reference
stuckPendingPodsJunit and w.podInformer to locate where to adjust control flow.
---
Nitpick comments:
In `@pkg/monitortests/node/watchpods/stuck_pending_pods.go`:
- Around line 24-27: Replace the ad-hoc fmt.Sprintf construction with the shared
pod locator formatter and RFC3339 timestamps: build the locator from
interval.Locator (using monitorapi.LocatorPodKey and
monitorapi.LocatorNamespaceKey or pass interval.Locator into the shared
formatter), call the shared formatter (e.g., FormatPodLocator or the existing
pod locator formatter) to produce the "ns/... pod/..." text, and format
interval.From and interval.To with time.RFC3339 before appending to failures
instead of using "01-02T15:04:05Z".
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: ea6f1c14-a420-4a4a-9184-6d261aaa7067
📒 Files selected for processing (3)
pkg/monitortests/node/watchpods/monitortest.gopkg/monitortests/node/watchpods/stuck_pending_pods.gopkg/monitortests/node/watchpods/stuck_pending_pods_test.go
Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection. Assisted-by: Claude Code <https://claude.com/claude-code>
cfd6bd3 to
6ec90cf
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/monitortests/node/watchpods/stuck_pending_pods_test.go (1)
117-143: Tighten flake-pattern assertions to enforce cardinality.Current checks validate existence, but duplicate unexpected failing entries could still pass. Consider asserting exactly one failing case and exactly one matching pass case.
Proposed assertion tightening
- var failCase *junitapi.JUnitTestCase - for _, tc := range junits { - if tc.FailureOutput != nil { - failCase = tc - break - } - } - require.NotNil(t, failCase, "expected a failing test case") + var failCases []*junitapi.JUnitTestCase + for _, tc := range junits { + if tc.FailureOutput != nil { + failCases = append(failCases, tc) + } + } + require.Len(t, failCases, 1, "expected exactly one failing test case") + failCase := failCases[0] assert.Contains(t, failCase.FailureOutput.Output, "stuck in Pending state") if tt.wantSubstr != "" { assert.Contains(t, failCase.FailureOutput.Output, tt.wantSubstr) } if tt.wantCount > 0 { assert.Contains(t, failCase.FailureOutput.Output, fmt.Sprintf("%d pod(s)", tt.wantCount)) } // Verify the flake pattern: both a failure and a pass with the same test name - var hasPass bool + var matchingPassCount int for _, tc := range junits { if tc.FailureOutput == nil && tc.Name == failCase.Name { - hasPass = true + matchingPassCount++ } } - assert.True(t, hasPass, "expected a matching pass entry for flake pattern") + assert.Equal(t, 1, matchingPassCount, "expected exactly one matching pass entry for flake pattern")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/monitortests/node/watchpods/stuck_pending_pods_test.go` around lines 117 - 143, The flake-pattern check currently only asserts existence of a failing case and a matching pass; tighten it by counting entries in junits: compute failCount as number of tc where tc.FailureOutput != nil and tc.FailureOutput.Output contains "stuck in Pending state" (and tt.wantSubstr/tt.wantCount when applicable), and compute passCount as number of tc where tc.FailureOutput == nil and tc.Name equals the failing test name; then require.Equal(t, 1, failCount) and require.Equal(t, 1, passCount) instead of the current boolean/existence checks (references: junits, failCase, tc.FailureOutput, tc.Name, hasPass).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/monitortests/node/watchpods/stuck_pending_pods_test.go`:
- Around line 117-143: The flake-pattern check currently only asserts existence
of a failing case and a matching pass; tighten it by counting entries in junits:
compute failCount as number of tc where tc.FailureOutput != nil and
tc.FailureOutput.Output contains "stuck in Pending state" (and
tt.wantSubstr/tt.wantCount when applicable), and compute passCount as number of
tc where tc.FailureOutput == nil and tc.Name equals the failing test name; then
require.Equal(t, 1, failCount) and require.Equal(t, 1, passCount) instead of the
current boolean/existence checks (references: junits, failCase,
tc.FailureOutput, tc.Name, hasPass).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7003f2dc-a5fe-4c25-bf4e-c8ce054ba10d
📒 Files selected for processing (3)
pkg/monitortests/node/watchpods/monitortest.gopkg/monitortests/node/watchpods/stuck_pending_pods.gopkg/monitortests/node/watchpods/stuck_pending_pods_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/monitortests/node/watchpods/monitortest.go
- pkg/monitortests/node/watchpods/stuck_pending_pods.go
|
Scheduling required tests: |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bitoku, dgoodwin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
@bitoku: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Risk analysis has seen new tests most likely introduced by this PR. New Test Risks for sha: 6ec90cf
New tests seen in this PR at sha: 6ec90cf
|
|
/verified by CI |
|
@bitoku: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@bitoku: Jira Issue Verification Checks: Jira Issue OCPBUGS-79544 Jira Issue OCPBUGS-79544 has been moved to the MODIFIED state and will move to the VERIFIED state when the change is available in an accepted nightly payload. 🕓 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/cherry-pick release-4.22 release-4.21 |
|
@bitoku: new pull request created: #31073 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Add a JUnit evaluation to the pod-lifecycle monitortest that scans intervals for pods with PodWasPending reason and "never completed" message, indicating pods that entered Pending and never left it. This reliably detects stuck image pulls and scheduling failures across all cluster configurations by leveraging already-collected interval data rather than brittle node-level inspection.
Assisted-by: Claude Code https://claude.com/claude-code
Summary by CodeRabbit
New Features
Tests