OSAC: add AAP job failure diagnostics to e2e-vmaas CI (2) by omer-vishlitzky · Pull Request #79651 · openshift/release

omer-vishlitzky · 2026-05-23T11:59:26Z

Summary

e2e-vmaas has ~50% failure rate with AAP provision jobs crashing (rc=None) but zero visibility into the crash reason
Adds diagnostics to the gather and test steps to capture the missing data
Duplicate PR for additional rehearsal coverage

Changes

Gather step (osac-project-gather-commands.sh):

Query AAP REST API for failed/errored jobs — saves result_traceback, job_explanation, stdout per job
Collect automation-job pod descriptions (exit codes, OOMKill events)
Collect instance group capacity data
Collect VirtualNetwork/Subnet/SecurityGroup YAML status

Test step (osac-project-cluster-tool-test-commands.sh):

Pre-test resource baseline: node resources, pod count, LVM thin pool, disk usage
Monkey-patch poll_until via conftest.py injection to dump resource state at exact moment of timeout
Post-test trap handler: node/pod resources, stuck resources, automation-job pod exit codes

Test plan

Rehearsal job passes (step scripts are syntactically valid)
On a failing run: verify AAP job artifacts appear in gathered logs
On a failing run: verify timeout diagnostics appear in build log

This PR enhances OpenShift CI for the OSAC project (e2e-vmaas jobs) by adding targeted diagnostics during boot, test, and gather steps to capture missing data when AAP provision jobs crash or time out. It is a duplicate PR to increase rehearsal coverage and contains only CI/infrastructure script changes (no public API or function signature changes).

Systems affected

OpenShift CI step-registry for the OSAC project: ci-operator/step-registry/osac-project (gather, cluster-tool/test, cluster-tool/boot).
Artifacts and logs produced by e2e-vmaas CI runs (additional files captured to job artifacts).

Key practical changes and impact

Gather step (ci-operator/.../gather/osac-project-gather-commands.sh)
- Adds an AAP job diagnostics phase that creates an aap-jobs/ artifact directory and, when the osac-aap route and API token secret exist, queries the AAP Automation Controller REST API to save recent failed/errored jobs (failed-jobs.json), per-job detail JSON, per-job stdout text, job events, and instance group data.
- Captures pod JSON/describe output for pods labeled ansible_job.
- Adds virtualization/networking YAML outputs for VirtualNetwork, Subnet, and SecurityGroup and retains existing CNV, computeinstance, node and cluster operator diagnostics.
- New collection commands tolerate missing data (use "|| true") to avoid aborting artifact gathering.
Test step (ci-operator/.../cluster-tool/test/osac-project-cluster-tool-test-commands.sh)
- Records a pre-test resource baseline on the remote test host: node metrics (oc adm top nodes), pod count in the test namespace, LVM thin pool stats (lvs), and disk usage (df -h /home).
- Injects a generated Python helper at /tmp/patch_helpers.py on the test host and mounts it into the test container; the container appends it to tests/conftest.py before running pytest. The helper monkey-patches tests.core.runner.poll_until to catch TimeoutError, infer a likely resource type from the poll description, and dump targeted timeout-time diagnostics (resource YAML, related ansible_job pods, warning events, node/pod metrics) to stderr, then re-raises the timeout.
- Adds a post-test EXIT trap that copies junit results back and runs post-test diagnostics on the remote host: node and top pod resources, lists stuck computeinstance/virtualnetwork/subnet/securitygroup, and prints automation-job pod terminated exit codes and termination reasons.
Boot/refresh step (ci-operator/.../cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh)
- Writes patched installer helper scripts to /tmp on the machine and mounts them into the installer run. The provided prepare-aap-patched.sh:
  - Captures the raw HTTP response from the AAP gateway token creation and validates the token before relying on jq, printing a truncated raw response on parse failure to avoid opaque jq errors.
  - Stores the created token in secret osac-aap-api-token so downstream steps (gather/test) can query the Automation Controller API.
- The boot script flow also includes the patched multi-phase refresh logic (applied via mounted patched scripts) used during refresh/rollout phases (ensures more robust waits/ordering for AAP/fulfillment components and controller-task rollout stabilization).

Test plan / expected behavior

Rehearsal runs validate CI script syntax; this PR was duplicated to increase rehearsal coverage.
On failing runs, job artifacts should include AAP job outputs (failed-job JSON, per-job stdout/events, pod JSON/describe) and timeout-time diagnostics emitted by the monkey-patch (printed to build log/stderr) to aid root-cause analysis.

Risk/impact

Low risk: changes add logging, artifact collection, and safer error handling only. Commands are designed to tolerate missing data so CI runs are not made more fragile.

Primary files changed

ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh
ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh

e2e-vmaas has a 50% failure rate with provision jobs crashing (rc=None) but no visibility into why. Add diagnostics to capture the crash reason from AAP API and cluster state on failure. Gather step: query AAP REST API for failed job details (result_traceback, job_explanation, stdout), collect automation-job pod descriptions (exit codes, OOMKill events), and instance group capacity. Also collect VirtualNetwork/Subnet/SecurityGroup status. Test step: add pre-test resource baseline (node resources, storage), monkey-patch poll_until to dump resource state at exact moment of timeout, and collect post-test diagnostics in the trap handler.

coderabbitai · 2026-05-23T11:59:33Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds pre-test baselines, a Python helper that instruments test timeouts to emit targeted Kubernetes diagnostics, mounts that helper into the test container, expands remote post-test diagnostics (node/pod/resource listings and ansible-job container exit info), collects AAP failed-job details and pod diagnostics, exports virtualization networking YAMLs, and patches/mounts enhanced boot/refresh scripts.

Changes

Enhanced Test Diagnostics and Artifact Collection

Layer / File(s)	Summary
Pre-test baseline `ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`	Captures initial node metrics, namespace pod count, `lvs` output, and `/home` disk usage before tests.
Timeout instrumentation helper `ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`	Generates `/tmp/patch_helpers.py` that monkey-patches `tests.core.runner.poll_until` to catch `TimeoutError`, infer resource type from the poll description, and emit resource-specific YAML, related pods, warning events, and metrics before re-raising.
Test execution with helper integration and remote post-test diagnostics `ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`	Mounts the helper into the vmaas test container, appends it into `tests/conftest.py`, runs `pytest` with JUnit output, and expands `collect_artifacts` to SSH into the remote host and collect node/pod metrics, top pods by memory, lists of computeinstance/virtualnetwork/subnet/securitygroup, and per-container terminated exit codes/reasons for ansible-job pods.
AAP failed-job collection and pod diagnostics `ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh`	Creates `aap-jobs/`, retrieves AAP route host and decoded API token, and—if present—uses authenticated `curl` calls to download failed/error job listings, per-job details (including stdout and events) and instance group data. Exports each `ansible_job`-labeled pod's JSON and `describe` output.
Networking resource exports `ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh`	Writes YAML artifacts for virtual networks, subnets, and security groups in the target namespace.
Boot: prepare-aap patch `ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`	Creates a patched `prepare-aap` script that captures the raw AAP gateway token response (`AAP_RESPONSE`) and improves parsing/error reporting when the response is not valid JSON.
Boot: refresh wait and installer mount `ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`	Writes a patched `refresh-after-snapshot.sh` implementing a multi-phase refresh workflow and mounts patched scripts into the installer container run so the updated logic executes during boot/refresh.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

openshift/release#79422: Related changes touching refresh/prepare AAP refresh logic and installer script overrides.

Suggested labels

lgtm, rehearsals-ack

Suggested reviewers

danmanor
eranco74

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately captures the main focus of the PR: adding AAP job failure diagnostics to the e2e-vmaas CI pipeline, as evidenced by the gather step diagnostics collection and test step monitoring enhancements.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR modifies CI orchestration shell scripts only, not test code. No Ginkgo test definitions found in any modified files. Check is not applicable to CI configuration changes.
Test Structure And Quality	✅ Passed	PR modifies only shell scripts (.sh), not Ginkgo test files (.go). Ginkgo test structure check is not applicable to this repository's CI configuration scripts.
Microshift Test Compatibility	✅ Passed	This PR modifies CI/CD automation scripts (shell scripts in ci-operator/step-registry), not Ginkgo e2e tests. No new Ginkgo tests are added, so MicroShift compatibility check is not applicable.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR modifies only CI/CD shell scripts (no Ginkgo e2e tests added). Check is not applicable since it targets Ginkgo test definitions, not infrastructure/diagnostic scripts.
Topology-Aware Scheduling Compatibility	✅ Passed	Changes are in CI test/gather scripts that collect diagnostics. No new deployment manifests, pod affinity rules, nodeSelector constraints, or other scheduling constraints are introduced.
Ote Binary Stdout Contract	✅ Passed	This PR modifies only shell scripts in CI/CD configuration, not OTE (Go test) binaries. The OTE stdout contract check does not apply to shell scripts.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	PR contains only bash CI/CD infrastructure scripts (in ci-operator/step-registry), not new Ginkgo e2e tests. Custom check applies only to new Ginkgo tests (It/Describe/Context/When).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

omer-vishlitzky · 2026-05-23T12:01:53Z

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

openshift-merge-bot · 2026-05-23T12:01:56Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh`:
- Around line 106-111: The test loop prints only r.stdout so error output is
lost; update the cmds loop to print both stdout and stderr from subprocess.run
(e.g., include r.stderr in the print call) and handle timeouts by catching
subprocess.TimeoutExpired and printing e.stdout and e.stderr as well;
specifically modify the for cmd in cmds: block that calls subprocess.run(...) to
emit f"--- {cmd} ---\n{r.stdout}\n{r.stderr}" to sys.stderr (truncated if
desired) and in the except Exception as e: branch, detect
subprocess.TimeoutExpired (or print getattr(e, 'stdout', '') and getattr(e,
'stderr', '')) so timeout failures also dump stderr.
- Around line 103-104: The timeout-diagnostics commands use invalid invocations
("kubectl adm top nodes" and "kubectl adm top pods -n {ns} --sort-by=memory");
change them to the correct kubectl subcommand names ("kubectl top nodes" and
"kubectl top pods -n {ns} --sort-by=memory") so the functions that build the
diagnostics list (the strings containing "kubectl adm top nodes" and "kubectl
adm top pods -n {ns} --sort-by=memory") will run successfully and produce
node/pod metrics.

In
`@ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh`:
- Around line 92-126: The new AAP gather block runs without errexit so setup
failures can be skipped; add set -e (preferably set -o errexit or set -euo
pipefail consistent with other step scripts) at the start of that remote heredoc
before the AAP block to enable errexit, and then remove the blanket "|| true"
only from critical setup commands (e.g. mkdir -p "${ARTIFACT_DIR}/aap-jobs",
AAP_ROUTE=$(oc get route ...), AAP_TOKEN=$(oc get secret ...)) so those fail the
script on error; keep "|| true" on the best-effort collectors/outputs (the curl
and oc get pod/json/describe lines that write into ${ARTIFACT_DIR}) to preserve
diagnostics gathering.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: a692b313-7c18-407b-95af-8ef3092d920a

📥 Commits

Reviewing files that changed from the base of the PR and between 1ccb439 and 0300d79.

📒 Files selected for processing (2)

ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh
ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh

coderabbitai · 2026-05-23T12:06:23Z

+        "kubectl adm top nodes",
+        f"kubectl adm top pods -n {ns} --sort-by=memory",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does kubectlsupport the subcommand sequencekubectl adm top, or is the correct syntax kubectl top? If ocis used instead, what is the correctoc adm top syntax?

💡 Result:

The correct syntax for the Kubernetes command-line tool is kubectl top [1][2]. The sequence kubectl adm top is not a valid command [1][3]. When using the OpenShift CLI (oc), the correct syntax is oc adm top [4][5]. This command is used to analyze resources managed by the platform and provides usage statistics for various resources, such as nodes and pods [4][6]. Common examples for the OpenShift CLI include: - oc adm top node: Displays resource (CPU/memory) usage of nodes [5][6]. - oc adm top pod: Displays resource (CPU/memory) usage of pods [5][7]. The oc adm top command internally utilizes kubectl top functionality to retrieve these metrics [8].

Citations:

1: https://kubernetes.io/docs/reference/kubectl/generated/kubectl_top/

2: https://signoz.io/blog/kubectl-top/

3: https://manpages.debian.org/bullseye/kubernetes-client/kubectl-top.1.en.html

4: https://manpages.opensuse.org/Tumbleweed/oc/oc-adm-top.1.en.html

5: https://docs.okd.io/latest/cli_reference/openshift_cli/administrator-cli-commands.html

6: https://github.com/RedHatWorkshops/openshiftv3-ops-workshop/blob/master/oc_adm_top.md

7: https://manpages.opensuse.org/Tumbleweed/oc/oc-adm-top-pod.1

8: https://www.redhat.com/en/blog/using-oc-adm-top-to-monitor-memory-usage

Fix invalid kubectl metrics commands in timeout diagnostics
kubectl adm top ... is not a valid kubectl invocation, so the timeout diagnostics will fail and produce no node/pod metrics.

Suggested fix

- "kubectl adm top nodes", - f"kubectl adm top pods -n {ns} --sort-by=memory", + "kubectl top nodes", + f"kubectl top pods -n {ns} --sort-by=memory",

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"kubectl adm top nodes",

f"kubectl adm top pods -n {ns} --sort-by=memory",

"kubectl top nodes",

f"kubectl top pods -n {ns} --sort-by=memory",

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh` around lines 103 - 104, The timeout-diagnostics commands use invalid invocations ("kubectl adm top nodes" and "kubectl adm top pods -n {ns} --sort-by=memory"); change them to the correct kubectl subcommand names ("kubectl top nodes" and "kubectl top pods -n {ns} --sort-by=memory") so the functions that build the diagnostics list (the strings containing "kubectl adm top nodes" and "kubectl adm top pods -n {ns} --sort-by=memory") will run successfully and produce node/pod metrics.

coderabbitai · 2026-05-23T12:06:23Z

+    for cmd in cmds:
+        try:
+            r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=15)
+            print(f"--- {cmd} ---\n{r.stdout[:3000]}", file=sys.stderr, flush=True)
+        except Exception as e:
+            print(f"--- {cmd} FAILED: {e} ---", file=sys.stderr, flush=True)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Include stderr in the timeout dump.

subprocess.run(..., capture_output=True) captures both streams, but this code only prints r.stdout. Any failed kubectl call becomes an empty section instead of showing the actual error.

Suggested fix

for cmd in cmds: try: r = subprocess.run(cmd, shell=True, capture_output=True, text=True, timeout=15) - print(f"--- {cmd} ---\n{r.stdout[:3000]}", file=sys.stderr, flush=True) + output = "\n".join(part for part in (r.stdout.strip(), r.stderr.strip()) if part) + print( + f"--- {cmd} (rc={r.returncode}) ---\n{output[:3000]}", + file=sys.stderr, + flush=True, + ) except Exception as e: print(f"--- {cmd} FAILED: {e} ---", file=sys.stderr, flush=True)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/osac-project/cluster-tool/test/osac-project-cluster-tool-test-commands.sh` around lines 106 - 111, The test loop prints only r.stdout so error output is lost; update the cmds loop to print both stdout and stderr from subprocess.run (e.g., include r.stderr in the print call) and handle timeouts by catching subprocess.TimeoutExpired and printing e.stdout and e.stderr as well; specifically modify the for cmd in cmds: block that calls subprocess.run(...) to emit f"--- {cmd} ---\n{r.stdout}\n{r.stderr}" to sys.stderr (truncated if desired) and in the except Exception as e: branch, detect subprocess.TimeoutExpired (or print getattr(e, 'stdout', '') and getattr(e, 'stderr', '')) so timeout failures also dump stderr.

coderabbitai · 2026-05-23T12:06:23Z

+echo "=== Collecting AAP job failure diagnostics ==="
+mkdir -p "${ARTIFACT_DIR}/aap-jobs"
+
+AAP_ROUTE=$(oc get route osac-aap -n "${E2E_NAMESPACE}" -o jsonpath='{.spec.host}' 2>/dev/null || true)
+AAP_TOKEN=$(oc get secret osac-aap-api-token -n "${E2E_NAMESPACE}" -o jsonpath='{.data.token}' 2>/dev/null | base64 -d || true)
+
+if [[ -n "${AAP_ROUTE}" && -n "${AAP_TOKEN}" ]]; then
+    AUTH="Authorization: Bearer ${AAP_TOKEN}"
+    BASE="https://${AAP_ROUTE}/api/controller/v2"
+
+    curl -sk -H "${AUTH}" "${BASE}/jobs/?status__in=error,failed&order_by=-finished&page_size=20" \
+        > "${ARTIFACT_DIR}/aap-jobs/failed-jobs.json" 2>&1 || true
+
+    for JOB_ID in $(jq -r '.results[].id' "${ARTIFACT_DIR}/aap-jobs/failed-jobs.json" 2>/dev/null | head -10); do
+        curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/" \
+            > "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-detail.json" 2>&1 || true
+        curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/stdout/?format=txt" \
+            > "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-stdout.txt" 2>&1 || true
+        curl -sk -H "${AUTH}" "${BASE}/jobs/${JOB_ID}/job_events/?order_by=-counter&page_size=30" \
+            > "${ARTIFACT_DIR}/aap-jobs/job-${JOB_ID}-events.json" 2>&1 || true
+    done
+
+    curl -sk -H "${AUTH}" "${BASE}/instance_groups/" \
+        > "${ARTIFACT_DIR}/aap-jobs/instance-groups.json" 2>&1 || true
+fi
+
+for POD in $(oc get pods -n "${E2E_NAMESPACE}" -l ansible_job --no-headers -o custom-columns=NAME:.metadata.name 2>/dev/null); do
+    oc get pod "${POD}" -n "${E2E_NAMESPACE}" -o json > "${ARTIFACT_DIR}/aap-jobs/pod-${POD}.json" 2>&1 || true
+    oc describe pod "${POD}" -n "${E2E_NAMESPACE}" > "${ARTIFACT_DIR}/aap-jobs/pod-${POD}-describe.txt" 2>&1 || true
+done
+
+echo "=== Collecting networking resource status ==="
+oc get virtualnetwork -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/virtualnetworks.yaml" 2>&1 || true
+oc get subnet -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/subnets.yaml" 2>&1 || true
+oc get securitygroup -n "${E2E_NAMESPACE}" -o yaml > "${ARTIFACT_DIR}/securitygroups.yaml" 2>&1 || true


🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Run the new gather diagnostics under errexit too.

This new block still executes inside a remote heredoc that defaults to nounset + pipefail only, so setup failures here can be skipped silently unless each command is manually guarded. Please enable set -o errexit for that remote shell and keep || true only on the best-effort collectors.

Suggested fix

timeout -s 9 10m ssh -F "${SHARED_DIR}/ssh_config" ci_machine bash -s "${E2E_NAMESPACE}" "${REMOTE_ARTIFACT_DIR}" <<'REMOTE_EOF' +set -o errexit set -o nounset set -o pipefail

As per coding guidelines "Step registry script files must use set -euo pipefail (without -x) as default and only enable -x when actively debugging".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/osac-project/gather/osac-project-gather-commands.sh` around lines 92 - 126, The new AAP gather block runs without errexit so setup failures can be skipped; add set -e (preferably set -o errexit or set -euo pipefail consistent with other step scripts) at the start of that remote heredoc before the AAP block to enable errexit, and then remove the blanket "|| true" only from critical setup commands (e.g. mkdir -p "${ARTIFACT_DIR}/aap-jobs", AAP_ROUTE=$(oc get route ...), AAP_TOKEN=$(oc get secret ...)) so those fail the script on error; keep "|| true" on the best-effort collectors/outputs (the curl and oc get pod/json/describe lines that write into ${ARTIFACT_DIR}) to preserve diagnostics gathering.

openshift-ci · 2026-05-23T13:24:41Z

@omer-vishlitzky: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

omer-vishlitzky · 2026-05-23T14:31:50Z

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

The AAP operator triggers controller-task deployment rollouts during the refresh-after-snapshot sequence. If tests start before the rollout completes, the old pod is terminated mid-test, its Redis sidecar socket vanishes, and running AAP jobs crash with redis.exceptions.ConnectionError on /var/run/redis/redis.sock. Wait for the rollout to finish after refresh, before declaring boot complete. This matches how the refresh script already waits for fulfillment deployment rollouts.

omer-vishlitzky · 2026-05-23T15:11:47Z

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

openshift-merge-bot · 2026-05-23T15:11:49Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci · 2026-05-23T15:11:50Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: omer-vishlitzky

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~ci-operator/step-registry/osac-project/cluster-tool/boot/OWNERS~~ [omer-vishlitzky]
~~ci-operator/step-registry/osac-project/cluster-tool/test/OWNERS~~ [omer-vishlitzky]
~~ci-operator/step-registry/osac-project/gather/OWNERS~~ [omer-vishlitzky]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

omer-vishlitzky · 2026-05-23T18:05:26Z

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

openshift-merge-bot · 2026-05-23T18:05:28Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Embed verbatim copies of refresh-after-snapshot.sh and prepare-aap.sh from osac-installer main, with two fixes: prepare-aap.sh: capture curl response before jq so we see what AAP returns when it responds with non-JSON (was causing silent "parse error: Invalid numeric literal" crash). refresh-after-snapshot.sh: after step [8/8], wait for the AutomationController to reach Successful status before declaring refresh complete. The AAP operator triggers multiple async controller-task rollouts that kill the Redis sidecar socket on the old pod, crashing in-flight provision jobs. Waiting for Successful ensures all rollouts are done.

omer-vishlitzky · 2026-05-23T19:59:28Z

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

openshift-merge-bot · 2026-05-23T19:59:30Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The heredoc-created scripts need chmod +x before being mounted into the container, otherwise the refresh script fails with "Permission denied" (exit code 126).

omer-vishlitzky · 2026-05-23T20:59:30Z

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

openshift-merge-bot · 2026-05-23T20:59:33Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh (2)

213-225: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid printing the raw token endpoint response.

Lines 219 and 224 log the body returned by an authenticated token-creation call. Even truncated, that can expose the AAP route or token material in CI logs. Please redact sensitive fields before logging, or persist the body outside the log stream.

🔒 Proposed fix

 AAP_RESPONSE=$(curl -sk -X POST \
     -u "admin:${AAP_ADMIN_PASSWORD}" \
     -H "Content-Type: application/json" \
     -d '{"description": "osac-operator", "scope": "write"}' \
     "${AAP_URL}/api/gateway/v1/tokens/")
+REDACTED_AAP_RESPONSE=$(printf '%s' "${AAP_RESPONSE}" | sed -E \
+    -e 's/"token"[[:space:]]*:[[:space:]]*"[^"]*"/"token":"<redacted>"/g' \
+    -e 's#https?://[^"[:space:]]+#<redacted-url>`#g`')
 AAP_TOKEN=$(echo "${AAP_RESPONSE}" | jq -r '.token') || {
-    echo "ERROR: AAP gateway returned non-JSON response: ${AAP_RESPONSE:0:500}"
+    echo "ERROR: AAP gateway returned non-JSON response: ${REDACTED_AAP_RESPONSE:0:500}"
     exit 1
 }

 if [[ -z "${AAP_TOKEN}" || "${AAP_TOKEN}" == "null" ]]; then
-    echo "Failed to create AAP API token. Response: ${AAP_RESPONSE:0:500}"
+    echo "Failed to create AAP API token. Response: ${REDACTED_AAP_RESPONSE:0:500}"
     exit 1
 fi

As per coding guidelines, "Protect sensitive information in step registry scripts - never echo or print passwords, tokens, API keys, cluster URLs, or kubeconfig contents".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`
around lines 213 - 225, The script currently echoes raw AAP_RESPONSE and slices
it into logs when token creation fails; update the token creation/check block
(the AAP_RESPONSE and AAP_TOKEN handling) to never print raw response contents:
instead parse and mask sensitive fields (e.g., remove or replace .token and any
URL fields via jq) before any echo, or write the full response to a secure
file/secret store and only log a non-sensitive stub like "REDACTED_RESPONSE" or
a masked summary; ensure the failure messages referencing AAP_RESPONSE use the
masked summary variable rather than the raw AAP_RESPONSE.

433-436: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use the condition status as the readiness signal (not reason) (ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh:433-436). Kubernetes conditions use status as the satisfaction/truth value (True/False/Unknown); reason is diagnostic text and can change independently.

Proposed fix

-retry_until 300 10 '[[ "$(oc get automationcontroller osac-aap-controller -n '"${INSTALLER_NAMESPACE}"' -o jsonpath='"'"'{.status.conditions[?(@.type=="Successful")].reason}'"'"' 2>/dev/null)" == "Successful" ]]' || {
+retry_until 300 10 '[[ "$(oc get automationcontroller osac-aap-controller -n '"${INSTALLER_NAMESPACE}"' -o jsonpath='"'"'{.status.conditions[?(@.type=="Successful")].status}'"'"' 2>/dev/null)" == "True" ]]' || {
     echo "WARNING: AAP operator did not reach Successful state, waiting for controller-task rollout instead..."
     oc rollout status deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`
around lines 433 - 436, The check in the retry_until call uses the condition's
reason instead of its truth value; change the jsonpath used in the oc get
command inside retry_until to read
.status.conditions[?(@.type=="Successful")].status and compare against "True"
(i.e. update the oc get automationcontroller osac-aap-controller ... -o jsonpath
to use .status rather than .reason) so readiness is based on condition.status;
keep the existing fallback that logs the warning and calls oc rollout status
deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s
unchanged.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`:
- Around line 213-225: The script currently echoes raw AAP_RESPONSE and slices
it into logs when token creation fails; update the token creation/check block
(the AAP_RESPONSE and AAP_TOKEN handling) to never print raw response contents:
instead parse and mask sensitive fields (e.g., remove or replace .token and any
URL fields via jq) before any echo, or write the full response to a secure
file/secret store and only log a non-sensitive stub like "REDACTED_RESPONSE" or
a masked summary; ensure the failure messages referencing AAP_RESPONSE use the
masked summary variable rather than the raw AAP_RESPONSE.
- Around line 433-436: The check in the retry_until call uses the condition's
reason instead of its truth value; change the jsonpath used in the oc get
command inside retry_until to read
.status.conditions[?(@.type=="Successful")].status and compare against "True"
(i.e. update the oc get automationcontroller osac-aap-controller ... -o jsonpath
to use .status rather than .reason) so readiness is based on condition.status;
keep the existing fallback that logs the warning and calls oc rollout status
deployment/osac-aap-controller-task -n "${INSTALLER_NAMESPACE}" --timeout=300s
unchanged.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 11628275-9d16-4378-b957-454d5080714c

📥 Commits

Reviewing files that changed from the base of the PR and between 3099928 and d331cd8.

📒 Files selected for processing (1)

ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh

The AAP operator triggers cascading controller-task rollouts after kustomize apply. The final rollout is triggered by the AAP gateway finishing its rollout (updates credentials.py checksum). The old sequential refresh finished before this final rollout, causing tests to run on a controller-task pod that gets killed mid-test. Restructure the refresh into parallel phases: - Phase 1: Keycloak sync + fulfillment credentials (parallel) - Phase 2: Kustomize apply - Phase 3: Fulfillment rollouts + AAP config + AAP controller wait + AAP gateway rollout (all parallel) — gateway rollout is NEW - Phase 4: prepare-aap + prepare-fulfillment - Phase 5: Restart fulfillment + tenant - Phase 6: Wait for final controller-task rollout (should be done already since gateway completed in Phase 3) Add timestamps and deployment status diagnostics throughout.

omer-vishlitzky · 2026-05-23T23:20:56Z

/pj-rehearse pull-ci-osac-project-osac-operator-main-e2e-vmaas pull-ci-osac-project-fulfillment-service-main-e2e-vmaas pull-ci-osac-project-osac-test-infra-main-e2e-vmaas pull-ci-osac-project-osac-installer-main-e2e-vmaas pull-ci-osac-project-osac-aap-main-e2e-vmaas

openshift-merge-bot · 2026-05-23T23:20:58Z

@omer-vishlitzky: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-merge-bot · 2026-05-23T23:24:07Z

[REHEARSALNOTIFIER]
@omer-vishlitzky: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-osac-project-osac-aap-main-e2e-vmaas	osac-project/osac-aap	presubmit	Registry content changed
pull-ci-osac-project-osac-installer-main-e2e-vmaas	osac-project/osac-installer	presubmit	Registry content changed
pull-ci-osac-project-osac-test-infra-main-e2e-vmaas	osac-project/osac-test-infra	presubmit	Registry content changed
pull-ci-osac-project-osac-operator-main-e2e-vmaas	osac-project/osac-operator	presubmit	Registry content changed
pull-ci-osac-project-fulfillment-service-main-e2e-vmaas	osac-project/fulfillment-service	presubmit	Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-subnet-lifecycle	N/A	periodic	Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-virtual-network-lifecycle	N/A	periodic	Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-creation	N/A	periodic	Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-api-fields	N/A	periodic	Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-cli-fields	N/A	periodic	Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-delete-during-provision	N/A	periodic	Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-restart	N/A	periodic	Registry content changed
periodic-ci-osac-project-osac-test-infra-main-e2e-metal-vmaas-compute-instance-restart-negative	N/A	periodic	Registry content changed

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh (2)

211-227: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don’t print the raw AAP token response to the job log.

Lines 219 and 224 dump part of an authenticated token-creation response to stdout. That endpoint can return freshly minted, non-redacted secret material, so the failure path should log only status/sanitized fields and keep the raw body off the build log.

🔒 Suggested hardening

-AAP_RESPONSE=$(curl -sk -X POST \
-    -u "admin:${AAP_ADMIN_PASSWORD}" \
-    -H "Content-Type: application/json" \
-    -d '{"description": "osac-operator", "scope": "write"}' \
-    "${AAP_URL}/api/gateway/v1/tokens/")
-AAP_TOKEN=$(echo "${AAP_RESPONSE}" | jq -r '.token') || {
-    echo "ERROR: AAP gateway returned non-JSON response: ${AAP_RESPONSE:0:500}"
-    exit 1
-}
-
-if [[ -z "${AAP_TOKEN}" || "${AAP_TOKEN}" == "null" ]]; then
-    echo "Failed to create AAP API token. Response: ${AAP_RESPONSE:0:500}"
-    exit 1
-fi
+TMP_AAP_RESPONSE=$(mktemp)
+HTTP_CODE=$(curl -sk -o "${TMP_AAP_RESPONSE}" -w '%{http_code}' -X POST \
+    -u "admin:${AAP_ADMIN_PASSWORD}" \
+    -H "Content-Type: application/json" \
+    -d '{"description": "osac-operator", "scope": "write"}' \
+    "${AAP_URL}/api/gateway/v1/tokens/")
+AAP_TOKEN=$(jq -r '.token // empty' "${TMP_AAP_RESPONSE}") || {
+    echo "ERROR: AAP gateway returned non-JSON response (HTTP ${HTTP_CODE})"
+    exit 1
+}
+
+if [[ -z "${AAP_TOKEN}" ]]; then
+    echo "ERROR: Failed to create AAP API token (HTTP ${HTTP_CODE})"
+    exit 1
+fi

As per coding guidelines "Protect sensitive information in step registry scripts - never echo or print passwords, tokens, API keys, cluster URLs, or kubeconfig contents".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`
around lines 211 - 227, The script currently prints parts of the authenticated
token response (AAP_RESPONSE and AAP_TOKEN) to stdout on error; instead stop
emitting raw response bodies or token values. Modify the curl invocation and
error paths around AAP_RESPONSE/AAP_TOKEN so you capture HTTP status separately
(e.g. via curl -w) and on failure log only the HTTP status and/or sanitized JSON
fields (extract non-secret fields like "status" or "message" with jq) or a fixed
redacted message; never echo the AAP_RESPONSE or AAP_TOKEN contents (replace the
two echo lines that reference AAP_RESPONSE and any checks that would print
AAP_TOKEN with a redacted/failure message). Ensure the code paths that exit on
error return useful diagnostics without revealing token or secret material.

262-519: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Run refresh-after-snapshot.sh with Bash, not sh.

osac-project-cluster-tool-boot-commands.sh mounts a patched refresh-after-snapshot.sh that is explicitly a bash script (#!/usr/bin/env bash) and uses bash-only constructs ([[ ... ]], local, arrays, (( ... ))), but it’s executed as sh scripts/refresh-after-snapshot.sh (lines 529-540).

🛠️ Minimal fix

-    bash -c "${COMPONENT_OVERRIDE_CMD}${AAP_OVERRIDE_CMD}cd /installer && sh scripts/refresh-after-snapshot.sh"
+    bash -c "${COMPONENT_OVERRIDE_CMD}${AAP_OVERRIDE_CMD}cd /installer && bash scripts/refresh-after-snapshot.sh"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`
around lines 262 - 519, The script refresh-after-snapshot.sh is a Bash script
(uses [[ ]], local, arrays, (( ))), but it is invoked with sh; change the
invocation in osac-project-cluster-tool-boot-commands.sh so it runs with Bash
instead of sh (replace the call of "sh scripts/refresh-after-snapshot.sh" with
an invocation using bash, e.g. "bash scripts/refresh-after-snapshot.sh" or
execute the script directly ensuring the shebang is honored), so that bash-only
constructs in refresh-after-snapshot.sh execute correctly.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`:
- Around line 513-514: The oc command is outputting a non-JSON “map[…]" when
using -o jsonpath, so jq receives invalid input; change the invocation to
request full JSON and let jq extract annotations: call oc get pod
"${CONTROLLER_TASK_POD}" -n "${INSTALLER_NAMESPACE}" -o json (instead of -o
jsonpath) and pipe that to jq -r '.metadata.annotations | to_entries[] |
select(.key | startswith("checksum-")) | "  \(.key): \(.value)"' so jq gets a
real object; keep the final || true if you want to tolerate missing pods but
avoid redirecting jq stderr into /dev/null so errors are visible.

---

Outside diff comments:
In
`@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh`:
- Around line 211-227: The script currently prints parts of the authenticated
token response (AAP_RESPONSE and AAP_TOKEN) to stdout on error; instead stop
emitting raw response bodies or token values. Modify the curl invocation and
error paths around AAP_RESPONSE/AAP_TOKEN so you capture HTTP status separately
(e.g. via curl -w) and on failure log only the HTTP status and/or sanitized JSON
fields (extract non-secret fields like "status" or "message" with jq) or a fixed
redacted message; never echo the AAP_RESPONSE or AAP_TOKEN contents (replace the
two echo lines that reference AAP_RESPONSE and any checks that would print
AAP_TOKEN with a redacted/failure message). Ensure the code paths that exit on
error return useful diagnostics without revealing token or secret material.
- Around line 262-519: The script refresh-after-snapshot.sh is a Bash script
(uses [[ ]], local, arrays, (( ))), but it is invoked with sh; change the
invocation in osac-project-cluster-tool-boot-commands.sh so it runs with Bash
instead of sh (replace the call of "sh scripts/refresh-after-snapshot.sh" with
an invocation using bash, e.g. "bash scripts/refresh-after-snapshot.sh" or
execute the script directly ensuring the shebang is honored), so that bash-only
constructs in refresh-after-snapshot.sh execute correctly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5ce003ef-820d-4a7e-a465-a206980fe880

📥 Commits

Reviewing files that changed from the base of the PR and between d331cd8 and cdbef02.

📒 Files selected for processing (1)

ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh

coderabbitai · 2026-05-23T23:29:22Z

+echo "$(ts) [post] Controller-task checksum annotations:"
+oc get pod "${CONTROLLER_TASK_POD}" -n "${INSTALLER_NAMESPACE}" -o jsonpath='{.metadata.annotations}' 2>/dev/null | jq -r 'to_entries[] | select(.key | startswith("checksum-")) | "  \(.key): \(.value)"' 2>/dev/null || true


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh" # Show the exact context around the reported lines nl -ba "$FILE" | sed -n '480,540p'

Repository: openshift/release

Length of output: 104

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh" # Extract only the checksum annotation dump line(s) rg -n "checksum-\"|checksum annotations|Controller-task checksum" "$FILE"

Repository: openshift/release

Length of output: 350

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE="ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh" # Print the exact lines (no nl dependency) sed -n '505,525p' "$FILE" # Show if there are other checksum annotation dumps that use jq (context/pattern) rg -n "checksum-.*annotations|Controller-task checksum|to_entries\\[]|startswith\$\"checksum-\"\$" "$FILE"

Repository: openshift/release

Length of output: 1933

🏁 Script executed:

#!/bin/bash set -euo pipefail rg -n --fixed-string "-o jsonpath='{.metadata.annotations}'" -S .

Repository: openshift/release

Length of output: 509

🏁 Script executed:

#!/bin/bash set -euo pipefail # Search for other occurrences of the same jsonpath annotations pattern rg -n -- "-o jsonpath='\\{\\.metadata\\.annotations\\}'" .

Repository: openshift/release

Length of output: 593

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the other file that uses -o jsonpath='{.metadata.annotations}' FILE2="ci-operator/step-registry/openshift-qe/move-pods-infra/openshift-qe-move-pods-infra-commands.sh" sed -n '1,120p' "$FILE2" # Search for any jq pipelines using jsonpath annotations rg -n "jsonpath='\\{\\.metadata\\.annotations\\}'[^\\n]*\\|[^\\n]*jq" -S . # Search for jq to_entries on metadata.annotations in repo rg -n "to_entries\\[\\]|to_entries\\[]|to_entries\\[\\]|to_entries" -S ci-operator/step-registry

Repository: openshift/release

Length of output: 25271

🌐 Web query:

kubectl -o jsonpath '{.metadata.annotations}' output format

💡 Result:

The correct kubectl JSONPath usage for annotations is: kubectl get -o jsonpath='{.metadata.annotations}' Notes: 1) kubectl supports the -o jsonpath= output format; the JSONPath template is enclosed in curly braces and you can directly reference object fields like.metadata.annotations. [1][2] 2) If an annotation value contains special characters (notably dots in annotation keys), you may need bracket-style JSONPath field access or escaping. Kubernetes’ kubectl JSONPath docs show bracket/dot alternatives (e.g., {['kind']} and escaping termination characters), and there are known cases where bracket access/escaping is required for keys with dots. [2][3] If you want a specific annotation key value (rather than the whole annotations map), use bracket notation: kubectl get -o jsonpath='{.metadata.annotations["key"]}' And for keys that contain dots, prefer bracket notation with escaping, e.g.: kubectl get -o jsonpath='{.metadata.annotations["external-dns.alpha.kubernetes.io/hostname"]}' [2][3] Example (all pods in a namespace, print their annotations map): kubectl get pods -n -o jsonpath='{range.items[*]}{.metadata.name}{"\t"}{.metadata.annotations}{"\n"}{end}' [2]

Citations:

1: https://kubernetes.io/docs/reference/kubectl/

2: https://v1-30.docs.kubernetes.io/docs/reference/kubectl/jsonpath

3: JSONpath fails to return keys containing dots in a map kubernetes/kubernetes#23386

🌐 Web query:

kubectl -o jsonpath '{.metadata.annotations}' output format jq to_entries

💡 Result:

Use jq, but you can’t pipe “kubectl -o jsonpath …” directly into jq to get an object unless the jsonpath output is valid JSON. 1) If you want the annotations as a JSON object (recommended): - kubectl -o json then jq kubectl get -o json | jq -r '.metadata.annotations | to_entries' 2) If you insist on jsonpath, you must emit valid JSON. - kubectl can output just the raw annotation map values, but JSONPath output is not JSON by default; jq expects JSON. - Kubernetes documents that kubectl JSONPath can format output, but jq is typically used when you need operations not supported by kubectl JSONPath (e.g., regex); the docs also emphasize that regex matching isn’t supported in kubectl JSONPath and suggests using jq on -o json instead [1][2]. 3) Typical working jq for entries: - From a pod/object JSON response: kubectl get -o json | jq -r '.metadata.annotations | to_entries[] | "(.key)=(.value)"' Why your exact idea usually fails: - kubectl JSONPath template returns a string (or concatenation) rather than a proper JSON object suitable for jq 'to_entries' unless you carefully craft it to produce JSON text. For annotations, easiest is to request JSON from kubectl and let jq do to_entries. Citations: - Kubernetes kubectl JSONPath reference (mentions kubectl JSONPath output behavior and recommends jq for cases JSONPath can’t do, e.g., regex; and gives the canonical approach of using -o json | jq when needed) [1][2].

Citations:

1: https://v1-30.docs.kubernetes.io/docs/reference/kubectl/jsonpath

2: https://kubernetes.io/docs/reference/kubectl/jsonpath/

🌐 Web query:

kubectl get pod -o jsonpath='{.metadata.annotations}' example output

💡 Result:

To retrieve the annotations of a pod using kubectl and JSONPath, you can use the following command: kubectl get pod -o jsonpath='{.metadata.annotations}' Example Output: If a pod has annotations such as "kubernetes.io/config.seen" and "example.com/version", the output will be a map (dictionary) of key-value pairs: map[example.com/version:1.0.0 kubernetes.io/config.seen:2026-05-23T10:00:00Z] Important Considerations: 1. Accessing Specific Annotations: If you need to access a specific annotation key that contains dots (e.g., "kubernetes.io/created-by"), you must escape the dots with a backslash to prevent the JSONPath parser from interpreting them as child operators [1][2]. Example: kubectl get pod -o jsonpath='{.metadata.annotations.kubernetes.io/created-by}' 2. Handling Special Characters: If escaping does not work for complex keys, you can use bracket notation with single quotes [1][3]: kubectl get pod -o jsonpath="{.metadata.annotations['kubernetes.io/created-by']}" 3. Formatting: The output of {.metadata.annotations} is the raw map representation provided by the Go template engine used in kubectl. If you require structured JSON output for further processing, it is often more reliable to pipe the output to a tool like jq [4][3]: kubectl get pod -o json | jq '.metadata.annotations'

Citations:

1: Unable to kubectl get -o jsonpath annotation value kubernetes/kubectl#25

2: https://kubernetes.io/docs/reference/kubectl/jsonpath/

3: https://stackoverflow.com/questions/64014047/extract-information-of-kubernetes-annotations-with-jsonpath

4: https://dev.to/sachincool/kubectl-jsonpath-extract-exactly-what-you-need-2hg4

Fix jq input for checksum annotation dump

oc get pod ... -o jsonpath='{.metadata.annotations}' prints a non-JSON “map[...]” style string, so jq doesn’t receive a JSON object; since jq stderr is discarded and the command has || true, the checksum annotation dump can end up printing nothing.

🩹 Suggested fix

-oc get pod "${CONTROLLER_TASK_POD}" -n "${INSTALLER_NAMESPACE}" -o jsonpath='{.metadata.annotations}' 2>/dev/null | jq -r 'to_entries[] | select(.key | startswith("checksum-")) | " \(.key): \(.value)"' 2>/dev/null || true +oc get pod "${CONTROLLER_TASK_POD}" -n "${INSTALLER_NAMESPACE}" -o json 2>/dev/null \ + | jq -r '.metadata.annotations // {} | to_entries[] | select(.key | startswith("checksum-")) | " \(.key): \(.value)"' 2>/dev/null || true

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/osac-project/cluster-tool/boot/osac-project-cluster-tool-boot-commands.sh` around lines 513 - 514, The oc command is outputting a non-JSON “map[…]" when using -o jsonpath, so jq receives invalid input; change the invocation to request full JSON and let jq extract annotations: call oc get pod "${CONTROLLER_TASK_POD}" -n "${INSTALLER_NAMESPACE}" -o json (instead of -o jsonpath) and pipe that to jq -r '.metadata.annotations | to_entries[] | select(.key | startswith("checksum-")) | " \(.key): \(.value)"' so jq gets a real object; keep the final || true if you want to tolerate missing pods but avoid redirecting jq stderr into /dev/null so errors are visible.

openshift-ci Bot requested review from danmanor and jhernand May 23, 2026 11:59

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 23, 2026

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

OSAC: fix permission denied on patched scripts

d331cd8

The heredoc-created scripts need chmod +x before being mounted into the container, otherwise the refresh script fails with "Permission denied" (exit code 126).

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

omer-vishlitzky closed this May 23, 2026

omer-vishlitzky reopened this May 23, 2026

coderabbitai Bot reviewed May 23, 2026

View reviewed changes

		"kubectl adm top nodes",
		f"kubectl adm top pods -n {ns} --sort-by=memory",

		echo "$(ts) [post] Controller-task checksum annotations:"
		oc get pod "${CONTROLLER_TASK_POD}" -n "${INSTALLER_NAMESPACE}" -o jsonpath='{.metadata.annotations}' 2>/dev/null \| jq -r 'to_entries[] \| select(.key \| startswith("checksum-")) \| " \(.key): \(.value)"' 2>/dev/null \|\| true

Conversation

omer-vishlitzky commented May 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

coderabbitai Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

omer-vishlitzky commented May 23, 2026

Uh oh!

openshift-merge-bot Bot commented May 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented May 23, 2026

Uh oh!

omer-vishlitzky commented May 23, 2026

Uh oh!

omer-vishlitzky commented May 23, 2026

Uh oh!

openshift-merge-bot Bot commented May 23, 2026

Uh oh!

openshift-ci Bot commented May 23, 2026

Uh oh!

omer-vishlitzky commented May 23, 2026

Uh oh!

openshift-merge-bot Bot commented May 23, 2026

Uh oh!

omer-vishlitzky commented May 23, 2026

Uh oh!

openshift-merge-bot Bot commented May 23, 2026

Uh oh!

omer-vishlitzky commented May 23, 2026

Uh oh!

openshift-merge-bot Bot commented May 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

omer-vishlitzky commented May 23, 2026

Uh oh!

openshift-merge-bot Bot commented May 23, 2026

Uh oh!

openshift-merge-bot Bot commented May 23, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

omer-vishlitzky commented May 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 23, 2026 •

edited

Loading