Create GS baremetal Agent-Based Install steps, chain, and reference health workflow by sg-rh · Pull Request #78959 · openshift/release

sg-rh · 2026-05-07T00:06:59Z

Add ABI bare-metal BMC install steps, chain, and a reference workflow (step-registry)

Adds step-registry support for Agent-Based Installation (ABI) on bare metal without a bastion. Site-specific values (hosts, networks, BMC addresses, VIPs, tunnel ports) come from the cluster profile and env vars—not from hardcoded logic in the step scripts.

Out of scope for this PR: `ci-operator/config` job definitions (interop, `gs-bm`, `metal-redhat-gs`, localnet). Those land in follow-up PRs.

Layout (step registry)

Path	Role
`abi/conf/bm/`	Configuration step `abi-conf-bm`
`abi/install/bmc/`	BMC install step `abi-install-bmc`
`abi/chains/bm--bmc/`	Chain `abi-chains-bm--bmc` (`abi-conf-bm` → `abi-install-bmc`)
`abi/workflows/bm--bmc--cluster-health/`	Reference workflow `abi-workflows-bm--bmc--cluster-health`

Docs: ci-operator/step-registry/abi/README.md · Registry: abi-conf-bm, abi-install-bmc

Step: `abi-conf-bm` (configuration / Day-0 & Day-1)

Role: Prepare everything abi-install-bmc needs before any ISO is built or any BMC is touched.

Base image: ci/baremetal-qe-base:latest (includes nmstatectl for NMState in agent-config).

Cluster profile must provide: pull-secret, ssh-publickey, cred--bmc--usr, cred--bmc--pwd, and ${OCP__ABI__CFG_FN} (default ocp--abi--cfg.yaml).

What it does:

Day-0 — cluster configuration
- Create a minimal install-config.yaml and agent-config.yaml template.
- Apply structured overrides from OCP__ABI__CFG / ${CLUSTER_PROFILE_DIR}/${OCP__ABI__CFG_FN} (UpdateCfg Day0: merge/replace YAML/JSON per README).
- Run optional OCP__ABI__DAY0_SCRIPTS_YAML hooks (Scripts: list; schema in OpenShift-LP-QE--Tools).
Day-1 — manifests before ISO
- Run openshift-install agent create cluster-manifests.
- Apply Day-1 overrides from OCP__ABI__CFG (UpdateCfg Day1).
- Run optional OCP__ABI__DAY1_SCRIPTS_YAML hooks.
**BMC handoff **
- Build BMC metadata for the install step (e.g. ocp--bmc--info.json).
- Strip .bmc credentials from agent-config.yaml so secrets are not carried in the tarball.
Pack for install step
- Pack the workspace into ocpClusterInf.tgz in ${SHARED_DIR} for abi-install-bmc.

Key env vars (see abi-conf-bm-ref.yaml for full list): OCP__ABI__BM__CLS_NAME, OCP__ABI__BM__BASE_DOM, OCP__ABI__CFG_FN, OCP__ABI__DAY0_SCRIPTS_YAML, OCP__ABI__DAY1_SCRIPTS_YAML, OCP__ABI__MIN_ISO, OCP__ABI__TUN_SVC__DP_BASE_URL, OCP__ABI__TUN_SVC__DP_PORT.

Step: `abi-install-bmc` (install / BMC / Day-1.5 & Day-2)

Role: Build and deliver the agent ISO, install the cluster via Redfish virtual media and a Chisel tunnel (no bastion), wait for the cluster to be ready, and publish artifacts.

Base image: ci/baremetal-qe-base:latest.

Cluster profile must provide: ssh-privatekey, ${OCP__ABI__CFG_FN}. Secrets: Chisel (/var/run/secrets/chisel), optional Bitwarden vault (BW__OBJ_NAME).

What it does:

Workspace
- Unpack ${SHARED_DIR}/ocpClusterInf.tgz into ${OCP__ABI__CLUSTER_DIR} (must match abi-conf-bm).
ISO build & serve
- openshift-install agent create image.
- Serve the ISO with an HTTP server that supports Range requests (required for agent minimal ISO / partial fetch).
- Expose the ISO on the Chisel data plane using OCP__ABI__TUN_SVC__DP_BASE_URL, OCP__ABI__TUN_SVC__DP_PORT, and OCP__ABI__TEAM_NAME (Chisel basic-auth files under /secret/chisel).
BMC / Redfish (per host)
- Use metadata from the conf step; vendor-specific Redfish handling in abi-install-bmc-commands.sh.
- Virtual media insert, boot order, power on/off, retries, stuck-boot detection, ISO eject / cleanup where implemented.
Install waits
- openshift-install agent wait-for bootstrap-complete and wait-for install-complete with configurable retries (OCP__ABI__WAIT__BOOTSTRAP__TRY, OCP__ABI__WAIT__CLUSTER__TRY, OCP__ABI__WAIT__NODE_READY__M).
Day-1.5 / Day-2
- Optional actions from OCP__ABI__CFG (e.g. Day-1.5 node provisioning flags).
- OCP__ABI__DAY2_SCRIPTS_YAML via BuildCustomScriptsFromYAML (OpenShift-LP-QE--Tools).
Artifacts
- Admin kubeconfig, ocp.tgz, logs → ${ARTIFACT_DIR}; handoff via ${SHARED_DIR}.
- Optional upload of kubeconfig to Bitwarden when BW__OBJ_NAME is set.

Key env vars (see abi-install-bmc-ref.yaml for full list): OCP__ABI__TUN_SVC__CP_URL, OCP__ABI__TUN_SVC__DP_*, OCP__ABI__TEAM_NAME, OCP__ABI__WAIT__*, OCP__ABI__DAY2_SCRIPTS_YAML, BW__OBJ_NAME.

Chain: `abi-chains-bm--bmc`

Runs in order:

abi-conf-bm
abi-install-bmc

Reference workflow: `abi-workflows-bm--bmc--cluster-health`

Phase	Content
pre	`abi-chains-bm--bmc` (install), then `cucushift-installer-check-cluster-health`
post	empty (teardown TBD)
test	not defined in workflow

Post-install health check is part of setup (last step in pre), not workflow test, so follow-up jobs can set steps.test to their own e2e while still inheriting install + health. Same pattern as install frameworks like ipi-aws.

Example follow-up job (not in this PR):

steps:
  cluster_profile: metal-redhat-gs
  env: { ... }
  test:
    - ref: your-interop-or-e2e-step
  workflow: abi-workflows-bm--bmc--cluster-health

coderabbitai · 2026-05-07T00:07:32Z

Walkthrough

This PR adds Agent-Based Installer (ABI) bare-metal testing infrastructure to RedHatQE interop-testing CI. It defines two new ABI steps (abi-conf-bm for cluster configuration and abi-install-bmc for installation), a workflow to orchestrate them, test job definitions, and updates to CI reporting configuration.

Changes

ABI Step Registry and Workflow Infrastructure

Layer / File(s)	Summary
Ownership & Documentation `ci-operator/step-registry/abi/OWNERS`, `ci-operator/step-registry/abi/README.md`	Establishes ABI area ownership under `cspi-qe-ocp-lp` and documents Day-0/Day-1/Day-2 phases, configuration merge semantics, and script injection patterns.
Configuration Step `ci-operator/step-registry/abi/conf/OWNERS`, `ci-operator/step-registry/abi/conf/bm/OWNERS`, `ci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.yaml`, `ci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.metadata.json`, `ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh`	Defines the `abi-conf-bm` step which generates `install-config.yaml` and `agent-config.yaml`, extracts BMC connectivity info, applies Day-0 overrides, enables minimal ISO mode when needed, and packages cluster manifests.
Installation Step `ci-operator/step-registry/abi/install/OWNERS`, `ci-operator/step-registry/abi/install/bmc/OWNERS`, `ci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.yaml`, `ci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.metadata.json`, `ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh`	Defines the `abi-install-bmc` step which serves the agent ISO, establishes a Chisel reverse tunnel for BMC access, configures Redfish boot/virtual media, monitors installation phases, orchestrates Day-1.5 scaling, and applies Day-2 customization scripts.
Workflow Orchestration `ci-operator/step-registry/gs-baremetal/OWNERS`, `ci-operator/step-registry/gs-baremetal/README.md`, `ci-operator/step-registry/gs-baremetal/agent-install/OWNERS`, `ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml`, `ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.metadata.json`	Defines the `gs-baremetal-agent-install` workflow that chains `abi-conf-bm` and `abi-install-bmc` steps for the `metal-redhat-gs` cluster profile and documents the contract for external test chaining.
Test Job & Reporting `ci-operator/config/RedHatQE/interop-testing/.config.prowgen`, `ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__abi-test.yaml`, `ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__ciOpEmul--imgRetr.yaml`	Adds scheduled `abi-test` jobs (`abi-test-conf-bm` and `abi-test-install-bmc` with 1h and 3h timeouts), a related image-retrieval test, and updates prowgen Slack reporting to include `abi-test` in `#forum-p2p-cspi`, `#cnv-release-4-20-z`, and `#cnv-release-4-21-z` channels.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	PR contains no Ginkgo test definitions. Test/step names present (abi-test-conf-bm, abi-test-install-bmc, img-retr) are static and deterministic without dynamic information.
Test Structure And Quality	✅ Passed	No Ginkgo test code in PR. Check is not applicable—PR contains CI configs and Bash scripts only, no Go tests.
Microshift Test Compatibility	✅ Passed	No Ginkgo e2e tests are added in this PR. It adds only CI infrastructure (step-registry scripts, YAML configs, docs). The MicroShift compatibility check applies only when e2e tests are added.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	No Ginkgo e2e tests are present in this PR. Changes are CI configuration, step-registry infrastructure, scripts, and documentation only. SNO compatibility check not applicable.
Topology-Aware Scheduling Compatibility	✅ Passed	PR adds CI step-registry and testing infrastructure only. No production Kubernetes manifests with scheduling constraints (affinity, topology spread, PDB) are introduced. Check is not applicable.
Ote Binary Stdout Contract	✅ Passed	Not applicable. PR adds Bash scripts, YAML/JSON configs, and docs—no Go OTE binaries subject to stdout contract.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR adds no Ginkgo e2e tests. All files are CI/CD infrastructure (YAML configs, bash scripts, markdown docs, JSON metadata). The custom check is not applicable.
Title check	✅ Passed	The title accurately describes the main changes: introducing Agent-Based Install (ABI) steps, workflow, and reference configuration for the gs-baremetal cluster profile.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml (1)
15-15: ⚡ Quick win

Use a latest docs URL to avoid stale workflow guidance.

Line 15 pins documentation to OCP 4.12, which can drift from current installer behavior. Prefer the .../latest/... docs path used elsewhere in this PR.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml`
at line 15, Update the pinned Red Hat docs URL in the
gs-baremetal-agent-install-workflow.yaml comment so it uses the version-agnostic
path; specifically replace the segment "4.12" in the URL
"https://docs.redhat.com/en/documentation/openshift_container_platform/4.12/html/installing_an_on-premise_cluster_with_the_agent-based-installer/preparing-to-install-with-the-agent-based-installer"
with "latest" so the comment references the .../latest/... docs path and won’t
drift from current installer behavior.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__abi-test.yaml`:
- Line 25: The cron entry "0 23 31 2 *" is invalid (Feb 31) and prevents the job
from ever running; locate both occurrences of that exact string in the YAML (the
cron keys present in the ABI test job) and replace each with a valid cron
expression that matches the intended cadence (for example, use a valid day in
February like the 28th or choose a monthly/last-day-safe schedule that covers
February correctly); ensure both instances are updated consistently so the job
actually runs.

In `@ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh`:
- Around line 16-21: The scripts fetched via curl in the eval blocks
(BuildCustomScriptsFromYAML.sh and EnsureReqs.sh) currently reference the "main"
branch; change both URLs to use immutable commit SHAs instead (replace "main" in
the two curl URLs with the specific commit hash for each script), update any
other instances of the same pattern (e.g., in abi-install-bmc-commands.sh and
other step-registry scripts), and verify the fetched versions still work with
the functions invoked (EnsureReqs yq and whatever BuildCustomScriptsFromYAML
provides) before committing.

In `@ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh`:
- Around line 399-408: The current vendor-prep case forces a hard-fail for
non-Dell BMCs by using "(* ) false;;" which aborts the install; update the case
for bmcVend so only the Dell branch runs the PATCH and all other vendors simply
no-op and continue (replace the default "(* ) false;;" with a no-op branch like
"(* ) ;;" or an explicit comment/no-op), keeping the RedfishAPIcall invocation
and variables (bmcInfo, bmcURL, bmcMgrId, bmcVend) unchanged.
- Around line 18-26: The step directly evals remote scripts from GitHub main
(BuildCustomScriptsFromYAML.sh, Vault--BitWarden--UploadAttachment.sh,
EnsureReqs.sh) which is unsafe; pin or vendor them and stop executing main at
runtime. Fetch and commit the exact versions (or reference a specific commit
tag) of those three scripts into the repo (or update CI to download by SHA and
verify checksum), then change the calls that use eval "$(curl ...)" to source
the local vendored files (or curl the pinned raw URL and verify checksum before
sourcing). Also update the EnsureReqs invocation (EnsureReqs yq chisel) to call
the vendored EnsureReqs implementation so runtime behavior is unchanged but
reproducible and auditable.
- Around line 67-90: The RedfishAPIcall function always returns success because
of the trailing "true"; remove that final "true" so curl failures propagate and
callers (e.g., the boot-override PATCH fallback logic) can detect errors and run
the fallback path; locate the function named RedfishAPIcall and delete the
concluding "true" so the function returns the actual exit status from curl.

In `@ci-operator/step-registry/abi/README.md`:
- Around line 6-11: The markdown table in the README (the block starting with
the header row "| Step                | Reference (source of truth)..." )
violates MD058; add a single blank line immediately before the table and a
single blank line immediately after the table so it is separated from
surrounding paragraphs, then re-run the linter to verify MD058 is resolved.

---

Nitpick comments:
In
`@ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml`:
- Line 15: Update the pinned Red Hat docs URL in the
gs-baremetal-agent-install-workflow.yaml comment so it uses the version-agnostic
path; specifically replace the segment "4.12" in the URL
"https://docs.redhat.com/en/documentation/openshift_container_platform/4.12/html/installing_an_on-premise_cluster_with_the_agent-based-installer/preparing-to-install-with-the-agent-based-installer"
with "latest" so the comment references the .../latest/... docs path and won’t
drift from current installer behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2635734a-3225-4d33-8956-d96852617a62

📥 Commits

Reviewing files that changed from the base of the PR and between 78291f3 and c4c67fa.

⛔ Files ignored due to path filters (2)

ci-operator/jobs/RedHatQE/interop-testing/RedHatQE-interop-testing-master-periodics.yaml is excluded by !ci-operator/jobs/**
ci-operator/jobs/RedHatQE/interop-testing/RedHatQE-interop-testing-master-presubmits.yaml is excluded by !ci-operator/jobs/**

📒 Files selected for processing (20)

ci-operator/config/RedHatQE/interop-testing/.config.prowgen
ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__abi-test.yaml
ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__ciOpEmul--imgRetr.yaml
ci-operator/step-registry/abi/OWNERS
ci-operator/step-registry/abi/README.md
ci-operator/step-registry/abi/conf/OWNERS
ci-operator/step-registry/abi/conf/bm/OWNERS
ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh
ci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.metadata.json
ci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.yaml
ci-operator/step-registry/abi/install/OWNERS
ci-operator/step-registry/abi/install/bmc/OWNERS
ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh
ci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.metadata.json
ci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.yaml
ci-operator/step-registry/gs-baremetal/OWNERS
ci-operator/step-registry/gs-baremetal/README.md
ci-operator/step-registry/gs-baremetal/agent-install/OWNERS
ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.metadata.json
ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml

💤 Files with no reviewable changes (1)

ci-operator/step-registry/gs-baremetal/OWNERS

sg-rh · 2026-05-07T01:25:59Z

Supersedes #76365

openshift/release#76365 was closed after the GitHub PR page repeatedly failed to load, which blocked review and updates. This PR carries the same ABI bare-metal step-registry / workflow work forward.

sg-rh · 2026-05-07T01:38:32Z

/uncc @dfrazzette @dshchedr

sg-rh · 2026-05-07T01:51:27Z

@coderabbitai pause

coderabbitai · 2026-05-07T01:51:33Z

✅ Actions performed

Reviews paused.

openshift-merge-bot · 2026-05-13T15:59:38Z

@sg-rh, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

couldn't prepare candidate: couldn't rebase candidate onto 255a78bafd817794898efe6e5f3910ba11b5434f due to conflicts

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

openshift-merge-bot · 2026-05-13T16:34:45Z

@sg-rh, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

couldn't prepare candidate: couldn't rebase candidate onto 951f9ff028faad4d22c3d44433bf60eb936315a1 due to conflicts

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

etirta · 2026-05-17T04:55:44Z

Test performed:

test--01-abi-conf-bm.log:
- Testing full ISO generation.
- Testing minimal ISO generation.
test--02-abi-install-bmc.log
- Deploying IENG Goldman Sachs HW Cluster, with the following scenarios:
  - Node gs04, BMC wipe emulation:
    - 1st boot: emulating stuck at boot phase / failed boot.
    - 2nd boot: success.
  - Node gs05, boot into prev. OS on HDD:
    - 1st boot: SSH Key rejection (emulating prev. Cluster was deployed using different SSH Key).
    - 2nd boot: success.
  - Node gs06, boot into prev. OS on HDD:
    - 1st boot: Detect prev. OS.
    - 2nd boot: success.
  - Node gs02, gs03, gs01: No emulation, normal successful boot.

ci-artifacts.zip

openshift-ci · 2026-05-19T05:26:34Z

@sg-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/ci-operator-config-metadata	`3ceda05`	link	true	`/test ci-operator-config-metadata`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

etirta · 2026-05-19T13:26:13Z

@coderabbitai resume

coderabbitai · 2026-05-19T13:26:23Z

✅ Actions performed

Reviews resumed.

amp-rh

Nice work on the step registry structure and documentation. The README, ref YAML env var docs, and architecture are all well done. A few questions and suggestions inline, mostly around debuggability of retry loops and failure-mode handling on shared bare-metal infrastructure.

amp-rh · 2026-05-19T19:29:46Z

+            WipeDisks "${tPID}" "${bmcInfo}" \
+                "${bmcURL}" "${bmcSysId}" "${bmcMgrId}" \
+                "${diskWipeMethod}" && break
+            ((--tryLeft))
+        done
+        # Restore Boot Order.
+        [ -z "${diskWipeMethod}" ] || {
+            RedfishAPIcall "${bmcInfo}" "${bmcURL}" PATCH \
+                "Systems/${bmcSysId}" \
+                -d '{"Boot": {
+                    "BootSourceOverrideEnabled": "Disabled",
+                    "BootSourceOverrideTarget": "None"
+                }}'
+        }


suggestion (debuggability): All three retry loops in this script (ISO URL probe at L407, bootstrap-complete wait here, install-complete wait at L562) share the same pattern where exhausted retries produce a cryptic failure.

When tryLeft decrements to 0, ((--tryLeft)) evaluates to ((0)) which returns exit code 1 under set -e, killing the subshell. The CI log will show something like:

abi-install-bmc-commands.sh: line 525: ((: --tryLeft: 0

...which doesn't tell the person triaging that bootstrap-complete exhausted N attempts.

A small guard after each while loop would make failures much more debuggable:

((tryLeft)) || { echo "FATAL: bootstrap-complete failed after ${OCP__ABI__WAIT__BOOTSTRAP__TRY} attempts" >&2; exit 1; }

(Same pattern recommended for the install-complete and ISO probe loops.)

As I mentioned in my response to @shakyav regarding similar concerns, it is actually quite clear already:

etirtara@14c692f90aab.4 2026-05-19 w21 19:54:37 "/wrk--edttjRedHat.RHP-k8s--dev/local/repo/CSPI-QE/ocp-ci-docs" $ bash -exc "$(cat - 0<<'EOF' typeset -i tryLeft=2 while ((tryLeft)); do : foo ((--tryLeft)) done : success true EOF )" + typeset -i tryLeft=2 + (( tryLeft )) + : foo + (( --tryLeft )) + (( tryLeft )) + : foo + (( --tryLeft ))

The whole purpose of having both xtrace and errexit enabled is to reduce the need for instrumenting explicit debug information in the code.

amp-rh · 2026-05-19T19:30:21Z

+    ' 0< "${bmcInfo}")
+} |& tee "${ARTIFACT_DIR}/ocp--installer--bmc.log") & taskPIDs+=($!)
+# Wait for BootStrap Node to finish.
+(
+    typeset -i tryLeft="${OCP__ABI__WAIT__BOOTSTRAP__TRY}"
+    while ((tryLeft)); do
+        openshift-install agent wait-for bootstrap-complete && break
+        ((--tryLeft))
+    done
+)
+cp -f "${OCP__ABI__CLUSTER_DIR}/auth/kubeconfig" "${SHARED_DIR}/kubeconfig-minimal"
+
+# Day-1.5 Phase.
+(
+    typeset cfgKey='' cfgVal=''
+    export KUBECONFIG="${OCP__ABI__CLUSTER_DIR}/auth/kubeconfig"
+    while IFS=$'\t' read -r cfgKey cfgVal; do
+        case ${cfgKey} in
+          (NodeProv)
+            [ "${cfgVal}" = false ] && {
+                # Workers are provisioned by ABI. No
+                #   BareMetalHost CRDs or Ironic
+                #   provisioning network.
+                while true; do
+                    oc -n openshift-machine-api \
+                        scale MachineSets \
+                        --replicas 0 --all \
+                    && break || sleep 60
+                done
+            }
+            ;;
+        esac
+    done 0< <(


question: The Day-1.5 subshell runs in the background (& taskPIDs+=($!)) and performs oc scale MachineSets --replicas 0 --all when NodeProv: false. If this fails, the install proceeds through wait-for install-complete, Day-2 scripts, and reports success, but the cluster has both ABI-provisioned workers and an active MachineSet controller potentially fighting over them.

The EXIT trap kills/waits background tasks but doesn't propagate their exit codes. Is this intentional? If Day-1.5 failure should be fatal, a wait + exit-code check between install-complete and the virtual media eject block would catch it:

wait "${day15_pid}" || { echo "FATAL: Day-1.5 phase failed" >&2; exit 1; }

If oc scale MachineSets --replicas 0 --all fails, one of the Cluster Operators (either baremetal, control-plane-machine-set, or both) will fail to become Ready. Consequently, agent wait-for install-complete will eventually time out.

The purpose of the Day-1.5 phase is to perform the necessary adjustments to prevent agent wait-for install-complete from timing out when we lack the full infrastructure required for an ABI-based OCP deployment. That being said, configuring the manifest during the Day-1 phase remains the preferred approach. If we acquire more information down the road on how to properly configure the manifest properties, we can migrate to the Day-1 method and deprecate the Day-1.5 phase entirely.

The broader architectural question is: should we fail fast and exit with an error during Day-1.5? While this is a good idea in theory, a mere propagation is easy; the real challenge is propagating the failure early enough to intercept and cancel the ongoing agent wait-for install-complete command, neither of which we currently have a mechanism to do.

amp-rh · 2026-05-19T19:30:45Z

+    pre:
+      - chain: abi-chains-bm--bmc
+      - chain: cucushift-installer-check-cluster-health
+    post: []


question: post: [] means that if the job fails during bootstrap wait or install-complete, there's no cleanup step to eject virtual media or restore boot order on the BMC hosts. The eject loop in abi-install-bmc-commands.sh (L573-580) and the boot-order restore (L503-510) only execute on the happy path.

On shared bare-metal infrastructure, this could leave ISOs mounted and boot order set to CD, potentially affecting the next job that uses these hosts. Is there a plan for the teardown step mentioned in the PR description ("teardown TBD"), or is BMC state recovery handled externally?

We configure an EXIT trap in the abi-install-bmc step so that when the step script terminates, it tears down both the secure tunnel and the HTTP server, rendering the virtual CD (VCD) ISO mount invalid. Additionally, a safeguard is executed at the very beginning to eject any existing VCD as the first action.

Furthermore, since this runs on bare metal within our own lab infrastructure, a strict teardown is mostly moot. In fact, skipping the teardown could be beneficial for post-mortem debugging if needed.

amp-rh · 2026-05-19T19:31:09Z

+function openshift-install () {
+    typeset -i es=0
+    {
+        echo \
+"$(date -Iseconds)|${FUNCNAME[0]@Q} ${*@Q}"$'\n'"$(printf '%.0s-' {1..80})"
+        command openshift-install \
+            --dir "${OCP__ABI__CLUSTER_DIR}/" \
+            --log-level "${OCP__ABI__INSTLR_LOG_LEVEL}" \
+            "$@" 2>&1 || es=$?
+        echo "$(printf '%.0s=' {1..80})"
+        exit ${es}
+    } | tee -a "${ARTIFACT_DIR}/ocp--installer--cluster.log"
+    return ${PIPESTATUS[0]}
+}


nit: This openshift-install wrapper function is identical in both scripts (conf L35-48, install L50-62). If the logging format or argument handling needs updating, it must be changed in two places. Given that both scripts already source shared libraries from OpenShift-LP-QE--Tools, this could live there as well, or be extracted to a small helper that both scripts source from the step registry.

That is a valid point. My concern is that a different installation process might not always share or require these exact same CLI parameters. While being called more than once is a good enough reason to create a function, moving it into an external library requires a much broader use case.

Introducing an external library carries a higher cost and increases our dependency burden; it needs stronger justification if it is only going to be used by two Step scripts.

I am not opposing the idea right off the bat, as I generally prefer to do things right from the get-go, but unfortunately, time is not on our side for this particular PR. Let's keep this idea in mind for when we start developing more Conf and Install steps for other targets in the future.

Thanks for the great suggestion!

openshift-merge-bot · 2026-05-19T21:12:17Z

[REHEARSALNOTIFIER]
@sg-rh: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

shakyav · 2026-05-19T22:42:52Z

/lgtm

openshift-ci · 2026-05-19T22:46:27Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sg-rh, shakyav

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift#78959)

openshift-ci Bot requested review from dfrazzette and dshchedr May 7, 2026 00:07

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026

coderabbitai Bot reviewed May 7, 2026

View reviewed changes

openshift-ci Bot removed request for dfrazzette and dshchedr May 7, 2026 01:38

openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026

sg-rh force-pushed the abi-ocp-bm branch from 56fedbc to da64291 Compare May 13, 2026 16:33

sg-rh force-pushed the abi-ocp-bm branch from da64291 to 94e1a31 Compare May 13, 2026 17:07

openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026

etirta force-pushed the abi-ocp-bm branch 10 times, most recently from ac15872 to 7bc41c1 Compare May 16, 2026 18:30

openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 18, 2026

sg-rh force-pushed the abi-ocp-bm branch from 05ff5a5 to b43e1a7 Compare May 18, 2026 23:08

sg-rh changed the title ~~Create gs-baremetal Agent-Based Install workflow and scripts~~ Create GS baremetal Agent-Based Install steps, chain, and sample health workflow May 18, 2026

sg-rh force-pushed the abi-ocp-bm branch 3 times, most recently from 3df3552 to 6a63324 Compare May 19, 2026 05:20

oharan2 reviewed May 19, 2026

View reviewed changes

Comment thread ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh

oharan2 reviewed May 19, 2026

View reviewed changes

Comment thread ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh

oharan2 suggested changes May 19, 2026

View reviewed changes

Comment thread ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh

openshift-ci Bot assigned oharan2 May 19, 2026

oharan2 reviewed May 19, 2026

View reviewed changes

Comment thread ci-operator/step-registry/abi/README.md

sg-rh changed the title ~~Create GS baremetal Agent-Based Install steps, chain, and sample health workflow~~ Create GS baremetal Agent-Based Install steps, chain, and reference health workflow May 19, 2026

shakyav reviewed May 19, 2026

View reviewed changes

Comment thread ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh

shakyav reviewed May 19, 2026

View reviewed changes

Comment thread ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh

shakyav reviewed May 19, 2026

View reviewed changes

Comment thread ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh

amp-rh reviewed May 19, 2026

View reviewed changes

Create bare-metal BMC install steps, chain, and sample health workflow

4c7dbff

sg-rh force-pushed the abi-ocp-bm branch from 6a63324 to 4c7dbff Compare May 19, 2026 21:09

openshift-ci Bot assigned shakyav May 19, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2026

openshift-merge-bot Bot merged commit c9c6271 into openshift:main May 19, 2026
11 checks passed

coderabbitai Bot mentioned this pull request May 20, 2026

[WIP] Add GS bare-metal ABI and localnet jobs for OCP 4.20 #79496

Open

wgahnagl pushed a commit to wgahnagl/release that referenced this pull request May 20, 2026

Create bare-metal BMC install steps, chain, and sample health workflow (

6f2aa86

openshift#78959)

Conversation

sg-rh commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add ABI bare-metal BMC install steps, chain, and a reference workflow (step-registry)

Out of scope for this PR: ci-operator/config job definitions (interop, gs-bm, metal-redhat-gs, localnet). Those land in follow-up PRs.

Layout (step registry)

Step: abi-conf-bm (configuration / Day-0 & Day-1)

Step: abi-install-bmc (install / BMC / Day-1.5 & Day-2)

Chain: abi-chains-bm--bmc

Reference workflow: abi-workflows-bm--bmc--cluster-health

Uh oh!

coderabbitai Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sg-rh commented May 7, 2026

Supersedes #76365

Uh oh!

sg-rh commented May 7, 2026

Uh oh!

sg-rh commented May 7, 2026

Uh oh!

coderabbitai Bot commented May 7, 2026

Uh oh!

openshift-merge-bot Bot commented May 13, 2026

Uh oh!

openshift-merge-bot Bot commented May 13, 2026

Uh oh!

etirta commented May 17, 2026

Uh oh!

openshift-ci Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

etirta commented May 19, 2026

Uh oh!

coderabbitai Bot commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

amp-rh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-merge-bot Bot commented May 19, 2026

Uh oh!

shakyav commented May 19, 2026

Uh oh!

openshift-ci Bot commented May 19, 2026

sg-rh commented May 7, 2026 •

edited

Loading

Out of scope for this PR: `ci-operator/config` job definitions (interop, `gs-bm`, `metal-redhat-gs`, localnet). Those land in follow-up PRs.

Step: `abi-conf-bm` (configuration / Day-0 & Day-1)

Step: `abi-install-bmc` (install / BMC / Day-1.5 & Day-2)

Chain: `abi-chains-bm--bmc`

Reference workflow: `abi-workflows-bm--bmc--cluster-health`

coderabbitai Bot commented May 7, 2026 •

edited

Loading

openshift-ci Bot commented May 19, 2026 •

edited

Loading