payload agent: also proceed when payload reaches terminal state by stbenjam · Pull Request #79400 · openshift/release

stbenjam · 2026-05-18T10:36:36Z

Summary

The release controller can leave blocking jobs stuck as Pending even after they complete in Prow, causing the payload agent to wait indefinitely (up to 10h timeout)
The wait loop now also checks the payload phase — if it's Accepted or Rejected, proceed with analysis even if some jobs are still Pending per the release controller
A 1-hour grace period from payload creation guards against a release controller bug where the phase can be briefly wrong at creation time
Companion PR for the fetch_payloads.py fix that cross-checks stale Pending jobs against Prow: fetch-payloads: resolve stale Pending jobs against Prow openshift-eng/ai-helpers#481

Test plan

Verified date parsing from payload tag works in UBI9 container
Confirmed the problematic payload 4.20.0-0.nightly-2026-05-18-050422 would be handled correctly — Rejected with 1 stale Pending job, well past the grace period

🤖 Generated with Claude Code

Payload Agent: proceed when payload reaches terminal phase instead of waiting on stale Pending jobs

This change updates the payload agent used by OpenShift's release CI (claude payload agent step in openshift/release) so it no longer remains blocked waiting for jobs that the release controller left Pending after they actually finished in Prow.

Practical effect

The agent's blocking-job wait loop now also polls the release-controller's payload phase. If the payload reaches a terminal phase (Accepted or Rejected), the agent will move forward with analysis rather than waiting for potentially stale Pending job statuses reported by the release controller.
A 1-hour grace period (based on payload creation time parsed from PAYLOAD_TAG) is applied: during the first hour after payload creation the agent still prefers job completion, protecting against transient incorrect phases at creation time; after the grace period, a terminal phase will allow the agent to proceed.
Improved polling logs now include the current payload phase and counts of succeeded/pending/failed blocking jobs; jobs without started Prow URLs are treated as “not started”.
On an overall wait timeout the agent logs the error and proceeds to analysis (instead of exiting non‑zero), so analysis runs even when pending jobs remain.

Why this matters

Prevents the payload agent from waiting up to the previous timeout (hours) due to stale Pending statuses when the release controller has already marked the payload terminal.
Ensures payload analysis completes and the release pipeline moves forward despite release-controller or stale-job discrepancies.

A companion fix updates fetch_payloads.py to cross-check stale Pending jobs against Prow (openshift-eng/ai-helpers PR referenced in the change notes).

Infrastructure affected

The claude payload agent step in the OpenShift CI configuration (openshift/release) — this is part of the payload analysis tooling used for OpenShift nightly/release qualification.

The release controller can leave blocking jobs stuck as Pending even after they complete in Prow. Previously, the agent waited for all blocking jobs to leave Pending state, which could hang indefinitely. Now also check the payload phase (Accepted/Rejected) as a signal to proceed. A 1-hour grace period from payload creation guards against a release controller bug where the phase can be briefly wrong at creation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-18T10:37:03Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5d517fc2-fde9-457e-9fd9-a2378dbc0cfa

📥 Commits

Reviewing files that changed from the base of the PR and between 95555dc and 77015ca.

📒 Files selected for processing (1)

ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh

🚧 Files skipped from review as they are similar to previous changes (1)

ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh

Walkthrough

A single shell script's release-controller polling loop was updated to track the payload's reported phase and conditionally trust terminal phases only after a configurable grace period from payload creation, with extended logging and early-exit handling based on phase and blocking-job state.

Changes

Claude Payload Agent Phase Tracking

Layer / File(s)	Summary
Phase tracking setup and configuration `ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh`	Added polling/timeout constants for phase handling, grace-period configuration, and payload creation timestamp extraction from tag metadata to enable grace-period validation of terminal phases.
Phase polling extraction and terminal-phase decision logic `ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh`	Extended polling loop to extract and log payload phase from release-controller response, compute payload age relative to creation, delay trusting phase until grace period elapses, then conditionally exit successfully on `Accepted` or proceed with analysis on `Rejected`/failure conditions while reporting pending and failed blocking-job counts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12

✅ Passed checks (12 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: adding logic to proceed with analysis when the payload reaches a terminal phase (Accepted/Rejected), which is the core improvement described in the PR objectives.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR modifies only shell scripts and configuration files. No Ginkgo test files are involved, so this check for Ginkgo test name stability is not applicable.
Test Structure And Quality	✅ Passed	The PR modifies only a shell script, not Ginkgo test code. The custom check applies to Go testing patterns and is not applicable to this PR which contains no test code.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests added. PR modifies only a bash shell script for CI infrastructure payload agent. MicroShift test check not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	Custom check for SNO compatibility is not applicable. PR modifies only a bash shell script (openshift-claude-payload-agent-commands.sh) for CI payload handling, not any Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility	✅ Passed	PR modifies a CI script to improve payload processing. No deployment manifests, operator code, or controllers with scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	This check applies only to OTE binaries (Go/Ginkgo test code). The PR modifies a bash shell script used for CI/CD payload analysis, not an OTE binary or test code. Check is not applicable.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR. The only modified file is a bash shell script for the payload agent, not a test file. The custom check is not applicable.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh`:
- Around line 140-155: The branch treating PHASE "Accepted" or "Rejected"
currently only starts analysis when FAILED > 0, causing the loop to keep polling
if the release-controller advanced to "Rejected" while blockingJobs are stale;
modify the logic in the PHASE handling (inside the loop where PHASE,
PAYLOAD_CREATED, PHASE_GRACE_PERIOD, PAYLOAD_AGE, REMAINING are computed) to
treat "Rejected" as immediately terminal: if [[ "${PHASE}" == "Rejected" ]];
then print the terminal message about ${PENDING} and ${FAILED}/${TOTAL} and
immediately start analysis (i.e., break into the analysis path) regardless of
FAILED, while preserving the existing early-exit for "Accepted" (exit 0) and the
grace-period check for both phases.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7513c6bc-b9d3-4a28-bb99-814f349ebbcf

📥 Commits

Reviewing files that changed from the base of the PR and between c897a49 and 95555dc.

📒 Files selected for processing (1)

ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh

coderabbitai · 2026-05-18T10:41:26Z

+    elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then
+        NOW=$(date +%s)
+        PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))
+        if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then
+            REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))
+            echo "  Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."
+        else
+            echo ""
+            echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."
+            if [[ "${FAILED}" -gt 0 ]]; then
+                echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
+                break
+            elif [[ "${PHASE}" == "Accepted" ]]; then
+                echo "Payload was accepted. No analysis needed."
+                exit 0
+            fi


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat Rejected as terminal without waiting for FAILED to converge.

This branch still requires FAILED > 0 before it breaks into analysis. If the release-controller phase has already advanced to Rejected while blockingJobs is still stale, the loop keeps polling until timeout, which is the failure mode this change is trying to avoid.

Suggested fix

- if [[ "${FAILED}" -gt 0 ]]; then + if [[ "${PHASE}" == "Rejected" || "${FAILED}" -gt 0 ]]; then echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..." break elif [[ "${PHASE}" == "Accepted" ]]; then echo "Payload was accepted. No analysis needed." exit 0

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then

NOW=$(date +%s)

PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))

if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then

REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))

echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."

else

echo ""

echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."

if [[ "${FAILED}" -gt 0 ]]; then

echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."

break

elif [[ "${PHASE}" == "Accepted" ]]; then

echo "Payload was accepted. No analysis needed."

exit 0

fi

elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then

NOW=$(date +%s)

PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))

if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then

REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))

echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."

else

echo ""

echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."

if [[ "${PHASE}" == "Rejected" || "${FAILED}" -gt 0 ]]; then

echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."

break

elif [[ "${PHASE}" == "Accepted" ]]; then

echo "Payload was accepted. No analysis needed."

exit 0

fi

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh` around lines 140 - 155, The branch treating PHASE "Accepted" or "Rejected" currently only starts analysis when FAILED > 0, causing the loop to keep polling if the release-controller advanced to "Rejected" while blockingJobs are stale; modify the logic in the PHASE handling (inside the loop where PHASE, PAYLOAD_CREATED, PHASE_GRACE_PERIOD, PAYLOAD_AGE, REMAINING are computed) to treat "Rejected" as immediately terminal: if [[ "${PHASE}" == "Rejected" ]]; then print the terminal message about ${PENDING} and ${FAILED}/${TOTAL} and immediately start analysis (i.e., break into the analysis path) regardless of FAILED, while preserving the existing early-exit for "Accepted" (exit 0) and the grace-period check for both phases.

If not a single job shows up as failed, there's little reason to analyze. The changes below this will still analyze on exit if we do hit the timeout just in case there is something to look at.

@stbenjam, that's a fair point — if the release controller shows Rejected but FAILED=0, the job data is likely stale and there isn't a concrete failure signal to analyze anyway. Kicking off analysis without any failed jobs to examine would mostly generate noise. And with the timeout path now breaking into analysis rather than exiting, there's still a safety net if the stale state never resolves. Makes sense, I'll withdraw the suggestion.

🧠 Learnings used

Learnt from: tthvo Repo: openshift/release PR: 79214 File: ci-operator/step-registry/upi/install/aws/usgov/private/cluster/upi-install-aws-usgov-private-cluster-commands.sh:834-836 Timestamp: 2026-05-13T04:27:10.115Z Learning: When reviewing shell scripts that invoke the AWS CLI, do not flag use of `--filter` as incorrect if the intended argument is `--filters` (plural). AWS CLI supports prefix matching for argument names, so `--filter` is an accepted alias for `--filters` for relevant `aws ec2 describe-*` commands (e.g., `describe-availability-zones`, `describe-instances`, `describe-subnets`). Both forms should be treated as equivalent.

Learnt from: ggiguash Repo: openshift/release PR: 79349 File: ci-operator/step-registry/openshift/edge-tooling/lvms-ci/post/openshift-edge-tooling-lvms-ci-post-commands.sh:3-3 Timestamp: 2026-05-18T09:59:38.396Z Learning: When reviewing step-registry shell command scripts in this repository for `set -x`, only treat `set -x` as a potential security issue if the script handles sensitive data. Flag when `set -x` could cause secrets/credentials/tokens/API keys/kubeconfig contents (or other sensitive values) to be printed to logs—for example, when the script reads sensitive env vars or files, constructs/uses authenticated requests with tokens, processes kubeconfig, or exports/echoes values that are secret. If the script performs only non-sensitive operations (e.g., building URLs and downloading public artifacts with curl without credentials or secret material), `set -x` may be used without special concern and should not be singled out.

openshift-merge-bot · 2026-05-18T10:44:07Z

[REHEARSALNOTIFIER]
@stbenjam: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

openshift-ci · 2026-05-18T10:49:18Z

@stbenjam: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

petr-muller · 2026-05-18T15:33:43Z

+            if [[ "${FAILED}" -gt 0 ]]; then
+                echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
+                break


Is it possible for a payload to be rejected with a single blocking job that failing in Prow but pending in RC (=> FAILED == 0 when Rejected)? We would not reach this break in that case.

IOW, should we just bail out on terminal phase, and just log the # of failing jobs?

Haha I see Coderabbit was first

openshift-ci · 2026-05-18T15:39:39Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/openshift/claude/payload/agent/OWNERS~~ [petr-muller,stbenjam]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci Bot requested review from smg247 and xueqzhan May 18, 2026 10:39

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026

openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 18, 2026

payload agent: analyze on timeout instead of exiting

77015ca

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai Bot reviewed May 18, 2026

View reviewed changes

petr-muller reviewed May 18, 2026

View reviewed changes

petr-muller approved these changes May 18, 2026

View reviewed changes

openshift-ci Bot assigned petr-muller May 18, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2026

openshift-merge-bot Bot merged commit 2a6fa44 into openshift:main May 18, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

payload agent: also proceed when payload reaches terminal state#79400

payload agent: also proceed when payload reaches terminal state#79400
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
stbenjam:fix-payload-agent-terminal-phase

stbenjam commented May 18, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 18, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 18, 2026 •

edited

Loading

Uh oh!

stbenjam May 18, 2026

Uh oh!

coderabbitai Bot May 18, 2026

Uh oh!

openshift-merge-bot Bot commented May 18, 2026

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

petr-muller May 18, 2026 •

edited

Loading

Uh oh!

petr-muller May 18, 2026

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stbenjam commented May 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Payload Agent: proceed when payload reaches terminal phase instead of waiting on stale Pending jobs

Uh oh!

coderabbitai Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stbenjam May 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented May 18, 2026

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

petr-muller May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

petr-muller May 18, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stbenjam commented May 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 18, 2026 •

edited

Loading

coderabbitai Bot May 18, 2026 •

edited

Loading

petr-muller May 18, 2026 •

edited

Loading