Skip to content

payload agent: also proceed when payload reaches terminal state#79400

Merged
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
stbenjam:fix-payload-agent-terminal-phase
May 18, 2026
Merged

payload agent: also proceed when payload reaches terminal state#79400
openshift-merge-bot[bot] merged 2 commits into
openshift:mainfrom
stbenjam:fix-payload-agent-terminal-phase

Conversation

@stbenjam
Copy link
Copy Markdown
Member

@stbenjam stbenjam commented May 18, 2026

Summary

  • The release controller can leave blocking jobs stuck as Pending even after they complete in Prow, causing the payload agent to wait indefinitely (up to 10h timeout)
  • The wait loop now also checks the payload phase — if it's Accepted or Rejected, proceed with analysis even if some jobs are still Pending per the release controller
  • A 1-hour grace period from payload creation guards against a release controller bug where the phase can be briefly wrong at creation time
  • Companion PR for the fetch_payloads.py fix that cross-checks stale Pending jobs against Prow: fetch-payloads: resolve stale Pending jobs against Prow openshift-eng/ai-helpers#481

Test plan

  • Verified date parsing from payload tag works in UBI9 container
  • Confirmed the problematic payload 4.20.0-0.nightly-2026-05-18-050422 would be handled correctly — Rejected with 1 stale Pending job, well past the grace period

🤖 Generated with Claude Code

Payload Agent: proceed when payload reaches terminal phase instead of waiting on stale Pending jobs

This change updates the payload agent used by OpenShift's release CI (claude payload agent step in openshift/release) so it no longer remains blocked waiting for jobs that the release controller left Pending after they actually finished in Prow.

Practical effect

  • The agent's blocking-job wait loop now also polls the release-controller's payload phase. If the payload reaches a terminal phase (Accepted or Rejected), the agent will move forward with analysis rather than waiting for potentially stale Pending job statuses reported by the release controller.
  • A 1-hour grace period (based on payload creation time parsed from PAYLOAD_TAG) is applied: during the first hour after payload creation the agent still prefers job completion, protecting against transient incorrect phases at creation time; after the grace period, a terminal phase will allow the agent to proceed.
  • Improved polling logs now include the current payload phase and counts of succeeded/pending/failed blocking jobs; jobs without started Prow URLs are treated as “not started”.
  • On an overall wait timeout the agent logs the error and proceeds to analysis (instead of exiting non‑zero), so analysis runs even when pending jobs remain.

Why this matters

  • Prevents the payload agent from waiting up to the previous timeout (hours) due to stale Pending statuses when the release controller has already marked the payload terminal.
  • Ensures payload analysis completes and the release pipeline moves forward despite release-controller or stale-job discrepancies.

Related

  • A companion fix updates fetch_payloads.py to cross-check stale Pending jobs against Prow (openshift-eng/ai-helpers PR referenced in the change notes).

Infrastructure affected

  • The claude payload agent step in the OpenShift CI configuration (openshift/release) — this is part of the payload analysis tooling used for OpenShift nightly/release qualification.

The release controller can leave blocking jobs stuck as Pending even
after they complete in Prow. Previously, the agent waited for all
blocking jobs to leave Pending state, which could hang indefinitely.

Now also check the payload phase (Accepted/Rejected) as a signal to
proceed. A 1-hour grace period from payload creation guards against a
release controller bug where the phase can be briefly wrong at creation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 18, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5d517fc2-fde9-457e-9fd9-a2378dbc0cfa

📥 Commits

Reviewing files that changed from the base of the PR and between 95555dc and 77015ca.

📒 Files selected for processing (1)
  • ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh

Walkthrough

A single shell script's release-controller polling loop was updated to track the payload's reported phase and conditionally trust terminal phases only after a configurable grace period from payload creation, with extended logging and early-exit handling based on phase and blocking-job state.

Changes

Claude Payload Agent Phase Tracking

Layer / File(s) Summary
Phase tracking setup and configuration
ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh
Added polling/timeout constants for phase handling, grace-period configuration, and payload creation timestamp extraction from tag metadata to enable grace-period validation of terminal phases.
Phase polling extraction and terminal-phase decision logic
ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh
Extended polling loop to extract and log payload phase from release-controller response, compute payload age relative to creation, delay trusting phase until grace period elapses, then conditionally exit successfully on Accepted or proceed with analysis on Rejected/failure conditions while reporting pending and failed blocking-job counts.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 12
✅ Passed checks (12 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: adding logic to proceed with analysis when the payload reaches a terminal phase (Accepted/Rejected), which is the core improvement described in the PR objectives.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR modifies only shell scripts and configuration files. No Ginkgo test files are involved, so this check for Ginkgo test name stability is not applicable.
Test Structure And Quality ✅ Passed The PR modifies only a shell script, not Ginkgo test code. The custom check applies to Go testing patterns and is not applicable to this PR which contains no test code.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests added. PR modifies only a bash shell script for CI infrastructure payload agent. MicroShift test check not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Custom check for SNO compatibility is not applicable. PR modifies only a bash shell script (openshift-claude-payload-agent-commands.sh) for CI payload handling, not any Ginkgo e2e tests.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies a CI script to improve payload processing. No deployment manifests, operator code, or controllers with scheduling constraints are introduced.
Ote Binary Stdout Contract ✅ Passed This check applies only to OTE binaries (Go/Ginkgo test code). The PR modifies a bash shell script used for CI/CD payload analysis, not an OTE binary or test code. Check is not applicable.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. The only modified file is a bash shell script for the payload agent, not a test file. The custom check is not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from smg247 and xueqzhan May 18, 2026 10:39
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 18, 2026
@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 18, 2026
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh`:
- Around line 140-155: The branch treating PHASE "Accepted" or "Rejected"
currently only starts analysis when FAILED > 0, causing the loop to keep polling
if the release-controller advanced to "Rejected" while blockingJobs are stale;
modify the logic in the PHASE handling (inside the loop where PHASE,
PAYLOAD_CREATED, PHASE_GRACE_PERIOD, PAYLOAD_AGE, REMAINING are computed) to
treat "Rejected" as immediately terminal: if [[ "${PHASE}" == "Rejected" ]];
then print the terminal message about ${PENDING} and ${FAILED}/${TOTAL} and
immediately start analysis (i.e., break into the analysis path) regardless of
FAILED, while preserving the existing early-exit for "Accepted" (exit 0) and the
grace-period check for both phases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 7513c6bc-b9d3-4a28-bb99-814f349ebbcf

📥 Commits

Reviewing files that changed from the base of the PR and between c897a49 and 95555dc.

📒 Files selected for processing (1)
  • ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh

Comment on lines +140 to +155
elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then
NOW=$(date +%s)
PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))
if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then
REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))
echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."
else
echo ""
echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."
if [[ "${FAILED}" -gt 0 ]]; then
echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
break
elif [[ "${PHASE}" == "Accepted" ]]; then
echo "Payload was accepted. No analysis needed."
exit 0
fi
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Treat Rejected as terminal without waiting for FAILED to converge.

This branch still requires FAILED > 0 before it breaks into analysis. If the release-controller phase has already advanced to Rejected while blockingJobs is still stale, the loop keeps polling until timeout, which is the failure mode this change is trying to avoid.

Suggested fix
-            if [[ "${FAILED}" -gt 0 ]]; then
+            if [[ "${PHASE}" == "Rejected" || "${FAILED}" -gt 0 ]]; then
                 echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
                 break
             elif [[ "${PHASE}" == "Accepted" ]]; then
                 echo "Payload was accepted. No analysis needed."
                 exit 0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then
NOW=$(date +%s)
PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))
if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then
REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))
echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."
else
echo ""
echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."
if [[ "${FAILED}" -gt 0 ]]; then
echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
break
elif [[ "${PHASE}" == "Accepted" ]]; then
echo "Payload was accepted. No analysis needed."
exit 0
fi
elif [[ "${PHASE}" == "Accepted" || "${PHASE}" == "Rejected" ]]; then
NOW=$(date +%s)
PAYLOAD_AGE=$(( NOW - PAYLOAD_CREATED ))
if [[ "${PAYLOAD_CREATED}" -gt 0 && "${PAYLOAD_AGE}" -lt "${PHASE_GRACE_PERIOD}" ]]; then
REMAINING=$(( (PHASE_GRACE_PERIOD - PAYLOAD_AGE) / 60 ))
echo " Payload phase is ${PHASE} but payload is only $((PAYLOAD_AGE / 60))m old (grace period: $((PHASE_GRACE_PERIOD / 60))m). Waiting ${REMAINING}m more before trusting phase..."
else
echo ""
echo "Payload reached terminal state (${PHASE}) with ${PENDING} job(s) still pending."
if [[ "${PHASE}" == "Rejected" || "${FAILED}" -gt 0 ]]; then
echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
break
elif [[ "${PHASE}" == "Accepted" ]]; then
echo "Payload was accepted. No analysis needed."
exit 0
fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/openshift/claude/payload/agent/openshift-claude-payload-agent-commands.sh`
around lines 140 - 155, The branch treating PHASE "Accepted" or "Rejected"
currently only starts analysis when FAILED > 0, causing the loop to keep polling
if the release-controller advanced to "Rejected" while blockingJobs are stale;
modify the logic in the PHASE handling (inside the loop where PHASE,
PAYLOAD_CREATED, PHASE_GRACE_PERIOD, PAYLOAD_AGE, REMAINING are computed) to
treat "Rejected" as immediately terminal: if [[ "${PHASE}" == "Rejected" ]];
then print the terminal message about ${PENDING} and ${FAILED}/${TOTAL} and
immediately start analysis (i.e., break into the analysis path) regardless of
FAILED, while preserving the existing early-exit for "Accepted" (exit 0) and the
grace-period check for both phases.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not a single job shows up as failed, there's little reason to analyze. The changes below this will still analyze on exit if we do hit the timeout just in case there is something to look at.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stbenjam, that's a fair point — if the release controller shows Rejected but FAILED=0, the job data is likely stale and there isn't a concrete failure signal to analyze anyway. Kicking off analysis without any failed jobs to examine would mostly generate noise. And with the timeout path now breaking into analysis rather than exiting, there's still a safety net if the stale state never resolves. Makes sense, I'll withdraw the suggestion.


🧠 Learnings used
Learnt from: tthvo
Repo: openshift/release PR: 79214
File: ci-operator/step-registry/upi/install/aws/usgov/private/cluster/upi-install-aws-usgov-private-cluster-commands.sh:834-836
Timestamp: 2026-05-13T04:27:10.115Z
Learning: When reviewing shell scripts that invoke the AWS CLI, do not flag use of `--filter` as incorrect if the intended argument is `--filters` (plural). AWS CLI supports prefix matching for argument names, so `--filter` is an accepted alias for `--filters` for relevant `aws ec2 describe-*` commands (e.g., `describe-availability-zones`, `describe-instances`, `describe-subnets`). Both forms should be treated as equivalent.

Learnt from: ggiguash
Repo: openshift/release PR: 79349
File: ci-operator/step-registry/openshift/edge-tooling/lvms-ci/post/openshift-edge-tooling-lvms-ci-post-commands.sh:3-3
Timestamp: 2026-05-18T09:59:38.396Z
Learning: When reviewing step-registry shell command scripts in this repository for `set -x`, only treat `set -x` as a potential security issue if the script handles sensitive data. Flag when `set -x` could cause secrets/credentials/tokens/API keys/kubeconfig contents (or other sensitive values) to be printed to logs—for example, when the script reads sensitive env vars or files, constructs/uses authenticated requests with tokens, processes kubeconfig, or exports/echoes values that are secret. If the script performs only non-sensitive operations (e.g., building URLs and downloading public artifacts with curl without credentials or secret material), `set -x` may be used without special concern and should not be singled out.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@stbenjam: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 18, 2026

@stbenjam: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +149 to +151
if [[ "${FAILED}" -gt 0 ]]; then
echo "${FAILED}/${TOTAL} blocking jobs failed. Starting analysis..."
break
Copy link
Copy Markdown
Member

@petr-muller petr-muller May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for a payload to be rejected with a single blocking job that failing in Prow but pending in RC (=> FAILED == 0 when Rejected)? We would not reach this break in that case.

IOW, should we just bail out on terminal phase, and just log the # of failing jobs?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha I see Coderabbit was first

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 18, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 18, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, stbenjam

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit 2a6fa44 into openshift:main May 18, 2026
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants