Create GS baremetal Agent-Based Install steps, chain, and reference health workflow#78959
Conversation
WalkthroughThis PR adds Agent-Based Installer (ABI) bare-metal testing infrastructure to RedHatQE interop-testing CI. It defines two new ABI steps ( ChangesABI Step Registry and Workflow Infrastructure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml (1)
15-15: ⚡ Quick winUse a
latestdocs URL to avoid stale workflow guidance.Line 15 pins documentation to OCP 4.12, which can drift from current installer behavior. Prefer the
.../latest/...docs path used elsewhere in this PR.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml` at line 15, Update the pinned Red Hat docs URL in the gs-baremetal-agent-install-workflow.yaml comment so it uses the version-agnostic path; specifically replace the segment "4.12" in the URL "https://docs.redhat.com/en/documentation/openshift_container_platform/4.12/html/installing_an_on-premise_cluster_with_the_agent-based-installer/preparing-to-install-with-the-agent-based-installer" with "latest" so the comment references the .../latest/... docs path and won’t drift from current installer behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__abi-test.yaml`:
- Line 25: The cron entry "0 23 31 2 *" is invalid (Feb 31) and prevents the job
from ever running; locate both occurrences of that exact string in the YAML (the
cron keys present in the ABI test job) and replace each with a valid cron
expression that matches the intended cadence (for example, use a valid day in
February like the 28th or choose a monthly/last-day-safe schedule that covers
February correctly); ensure both instances are updated consistently so the job
actually runs.
In `@ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh`:
- Around line 16-21: The scripts fetched via curl in the eval blocks
(BuildCustomScriptsFromYAML.sh and EnsureReqs.sh) currently reference the "main"
branch; change both URLs to use immutable commit SHAs instead (replace "main" in
the two curl URLs with the specific commit hash for each script), update any
other instances of the same pattern (e.g., in abi-install-bmc-commands.sh and
other step-registry scripts), and verify the fetched versions still work with
the functions invoked (EnsureReqs yq and whatever BuildCustomScriptsFromYAML
provides) before committing.
In `@ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh`:
- Around line 399-408: The current vendor-prep case forces a hard-fail for
non-Dell BMCs by using "(* ) false;;" which aborts the install; update the case
for bmcVend so only the Dell branch runs the PATCH and all other vendors simply
no-op and continue (replace the default "(* ) false;;" with a no-op branch like
"(* ) ;;" or an explicit comment/no-op), keeping the RedfishAPIcall invocation
and variables (bmcInfo, bmcURL, bmcMgrId, bmcVend) unchanged.
- Around line 18-26: The step directly evals remote scripts from GitHub main
(BuildCustomScriptsFromYAML.sh, Vault--BitWarden--UploadAttachment.sh,
EnsureReqs.sh) which is unsafe; pin or vendor them and stop executing main at
runtime. Fetch and commit the exact versions (or reference a specific commit
tag) of those three scripts into the repo (or update CI to download by SHA and
verify checksum), then change the calls that use eval "$(curl ...)" to source
the local vendored files (or curl the pinned raw URL and verify checksum before
sourcing). Also update the EnsureReqs invocation (EnsureReqs yq chisel) to call
the vendored EnsureReqs implementation so runtime behavior is unchanged but
reproducible and auditable.
- Around line 67-90: The RedfishAPIcall function always returns success because
of the trailing "true"; remove that final "true" so curl failures propagate and
callers (e.g., the boot-override PATCH fallback logic) can detect errors and run
the fallback path; locate the function named RedfishAPIcall and delete the
concluding "true" so the function returns the actual exit status from curl.
In `@ci-operator/step-registry/abi/README.md`:
- Around line 6-11: The markdown table in the README (the block starting with
the header row "| Step | Reference (source of truth)..." )
violates MD058; add a single blank line immediately before the table and a
single blank line immediately after the table so it is separated from
surrounding paragraphs, then re-run the linter to verify MD058 is resolved.
---
Nitpick comments:
In
`@ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml`:
- Line 15: Update the pinned Red Hat docs URL in the
gs-baremetal-agent-install-workflow.yaml comment so it uses the version-agnostic
path; specifically replace the segment "4.12" in the URL
"https://docs.redhat.com/en/documentation/openshift_container_platform/4.12/html/installing_an_on-premise_cluster_with_the_agent-based-installer/preparing-to-install-with-the-agent-based-installer"
with "latest" so the comment references the .../latest/... docs path and won’t
drift from current installer behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 2635734a-3225-4d33-8956-d96852617a62
⛔ Files ignored due to path filters (2)
ci-operator/jobs/RedHatQE/interop-testing/RedHatQE-interop-testing-master-periodics.yamlis excluded by!ci-operator/jobs/**ci-operator/jobs/RedHatQE/interop-testing/RedHatQE-interop-testing-master-presubmits.yamlis excluded by!ci-operator/jobs/**
📒 Files selected for processing (20)
ci-operator/config/RedHatQE/interop-testing/.config.prowgenci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__abi-test.yamlci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__ciOpEmul--imgRetr.yamlci-operator/step-registry/abi/OWNERSci-operator/step-registry/abi/README.mdci-operator/step-registry/abi/conf/OWNERSci-operator/step-registry/abi/conf/bm/OWNERSci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.shci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.metadata.jsonci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.yamlci-operator/step-registry/abi/install/OWNERSci-operator/step-registry/abi/install/bmc/OWNERSci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.shci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.metadata.jsonci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.yamlci-operator/step-registry/gs-baremetal/OWNERSci-operator/step-registry/gs-baremetal/README.mdci-operator/step-registry/gs-baremetal/agent-install/OWNERSci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.metadata.jsonci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml
💤 Files with no reviewable changes (1)
- ci-operator/step-registry/gs-baremetal/OWNERS
Supersedes #76365openshift/release#76365 was closed after the GitHub PR page repeatedly failed to load, which blocked review and updates. This PR carries the same ABI bare-metal step-registry / workflow work forward. |
|
/uncc @dfrazzette @dshchedr |
|
@coderabbitai pause |
✅ Actions performedReviews paused. |
|
@sg-rh, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
@sg-rh, Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
ac15872 to
7bc41c1
Compare
|
Test performed:
|
3df3552 to
6a63324
Compare
|
@sg-rh: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@coderabbitai resume |
✅ Actions performedReviews resumed. |
amp-rh
left a comment
There was a problem hiding this comment.
Nice work on the step registry structure and documentation. The README, ref YAML env var docs, and architecture are all well done. A few questions and suggestions inline, mostly around debuggability of retry loops and failure-mode handling on shared bare-metal infrastructure.
| WipeDisks "${tPID}" "${bmcInfo}" \ | ||
| "${bmcURL}" "${bmcSysId}" "${bmcMgrId}" \ | ||
| "${diskWipeMethod}" && break | ||
| ((--tryLeft)) | ||
| done | ||
| # Restore Boot Order. | ||
| [ -z "${diskWipeMethod}" ] || { | ||
| RedfishAPIcall "${bmcInfo}" "${bmcURL}" PATCH \ | ||
| "Systems/${bmcSysId}" \ | ||
| -d '{"Boot": { | ||
| "BootSourceOverrideEnabled": "Disabled", | ||
| "BootSourceOverrideTarget": "None" | ||
| }}' | ||
| } |
There was a problem hiding this comment.
suggestion (debuggability): All three retry loops in this script (ISO URL probe at L407, bootstrap-complete wait here, install-complete wait at L562) share the same pattern where exhausted retries produce a cryptic failure.
When tryLeft decrements to 0, ((--tryLeft)) evaluates to ((0)) which returns exit code 1 under set -e, killing the subshell. The CI log will show something like:
abi-install-bmc-commands.sh: line 525: ((: --tryLeft: 0
...which doesn't tell the person triaging that bootstrap-complete exhausted N attempts.
A small guard after each while loop would make failures much more debuggable:
((tryLeft)) || { echo "FATAL: bootstrap-complete failed after ${OCP__ABI__WAIT__BOOTSTRAP__TRY} attempts" >&2; exit 1; }(Same pattern recommended for the install-complete and ISO probe loops.)
There was a problem hiding this comment.
As I mentioned in my response to @shakyav regarding similar concerns, it is actually quite clear already:
etirtara@14c692f90aab.4 2026-05-19 w21 19:54:37 "/wrk--edttjRedHat.RHP-k8s--dev/local/repo/CSPI-QE/ocp-ci-docs"
$ bash -exc "$(cat - 0<<'EOF'
typeset -i tryLeft=2
while ((tryLeft)); do
: foo
((--tryLeft))
done
: success
true
EOF
)"
+ typeset -i tryLeft=2
+ (( tryLeft ))
+ : foo
+ (( --tryLeft ))
+ (( tryLeft ))
+ : foo
+ (( --tryLeft ))The whole purpose of having both xtrace and errexit enabled is to reduce the need for instrumenting explicit debug information in the code.
| ' 0< "${bmcInfo}") | ||
| } |& tee "${ARTIFACT_DIR}/ocp--installer--bmc.log") & taskPIDs+=($!) | ||
| # Wait for BootStrap Node to finish. | ||
| ( | ||
| typeset -i tryLeft="${OCP__ABI__WAIT__BOOTSTRAP__TRY}" | ||
| while ((tryLeft)); do | ||
| openshift-install agent wait-for bootstrap-complete && break | ||
| ((--tryLeft)) | ||
| done | ||
| ) | ||
| cp -f "${OCP__ABI__CLUSTER_DIR}/auth/kubeconfig" "${SHARED_DIR}/kubeconfig-minimal" | ||
|
|
||
| # Day-1.5 Phase. | ||
| ( | ||
| typeset cfgKey='' cfgVal='' | ||
| export KUBECONFIG="${OCP__ABI__CLUSTER_DIR}/auth/kubeconfig" | ||
| while IFS=$'\t' read -r cfgKey cfgVal; do | ||
| case ${cfgKey} in | ||
| (NodeProv) | ||
| [ "${cfgVal}" = false ] && { | ||
| # Workers are provisioned by ABI. No | ||
| # BareMetalHost CRDs or Ironic | ||
| # provisioning network. | ||
| while true; do | ||
| oc -n openshift-machine-api \ | ||
| scale MachineSets \ | ||
| --replicas 0 --all \ | ||
| && break || sleep 60 | ||
| done | ||
| } | ||
| ;; | ||
| esac | ||
| done 0< <( |
There was a problem hiding this comment.
question: The Day-1.5 subshell runs in the background (& taskPIDs+=($!)) and performs oc scale MachineSets --replicas 0 --all when NodeProv: false. If this fails, the install proceeds through wait-for install-complete, Day-2 scripts, and reports success, but the cluster has both ABI-provisioned workers and an active MachineSet controller potentially fighting over them.
The EXIT trap kills/waits background tasks but doesn't propagate their exit codes. Is this intentional? If Day-1.5 failure should be fatal, a wait + exit-code check between install-complete and the virtual media eject block would catch it:
wait "${day15_pid}" || { echo "FATAL: Day-1.5 phase failed" >&2; exit 1; }There was a problem hiding this comment.
If oc scale MachineSets --replicas 0 --all fails, one of the Cluster Operators (either baremetal, control-plane-machine-set, or both) will fail to become Ready. Consequently, agent wait-for install-complete will eventually time out.
The purpose of the Day-1.5 phase is to perform the necessary adjustments to prevent agent wait-for install-complete from timing out when we lack the full infrastructure required for an ABI-based OCP deployment. That being said, configuring the manifest during the Day-1 phase remains the preferred approach. If we acquire more information down the road on how to properly configure the manifest properties, we can migrate to the Day-1 method and deprecate the Day-1.5 phase entirely.
The broader architectural question is: should we fail fast and exit with an error during Day-1.5? While this is a good idea in theory, a mere propagation is easy; the real challenge is propagating the failure early enough to intercept and cancel the ongoing agent wait-for install-complete command, neither of which we currently have a mechanism to do.
| pre: | ||
| - chain: abi-chains-bm--bmc | ||
| - chain: cucushift-installer-check-cluster-health | ||
| post: [] |
There was a problem hiding this comment.
question: post: [] means that if the job fails during bootstrap wait or install-complete, there's no cleanup step to eject virtual media or restore boot order on the BMC hosts. The eject loop in abi-install-bmc-commands.sh (L573-580) and the boot-order restore (L503-510) only execute on the happy path.
On shared bare-metal infrastructure, this could leave ISOs mounted and boot order set to CD, potentially affecting the next job that uses these hosts. Is there a plan for the teardown step mentioned in the PR description ("teardown TBD"), or is BMC state recovery handled externally?
There was a problem hiding this comment.
We configure an EXIT trap in the abi-install-bmc step so that when the step script terminates, it tears down both the secure tunnel and the HTTP server, rendering the virtual CD (VCD) ISO mount invalid. Additionally, a safeguard is executed at the very beginning to eject any existing VCD as the first action.
Furthermore, since this runs on bare metal within our own lab infrastructure, a strict teardown is mostly moot. In fact, skipping the teardown could be beneficial for post-mortem debugging if needed.
| function openshift-install () { | ||
| typeset -i es=0 | ||
| { | ||
| echo \ | ||
| "$(date -Iseconds)|${FUNCNAME[0]@Q} ${*@Q}"$'\n'"$(printf '%.0s-' {1..80})" | ||
| command openshift-install \ | ||
| --dir "${OCP__ABI__CLUSTER_DIR}/" \ | ||
| --log-level "${OCP__ABI__INSTLR_LOG_LEVEL}" \ | ||
| "$@" 2>&1 || es=$? | ||
| echo "$(printf '%.0s=' {1..80})" | ||
| exit ${es} | ||
| } | tee -a "${ARTIFACT_DIR}/ocp--installer--cluster.log" | ||
| return ${PIPESTATUS[0]} | ||
| } |
There was a problem hiding this comment.
nit: This openshift-install wrapper function is identical in both scripts (conf L35-48, install L50-62). If the logging format or argument handling needs updating, it must be changed in two places. Given that both scripts already source shared libraries from OpenShift-LP-QE--Tools, this could live there as well, or be extracted to a small helper that both scripts source from the step registry.
There was a problem hiding this comment.
That is a valid point. My concern is that a different installation process might not always share or require these exact same CLI parameters. While being called more than once is a good enough reason to create a function, moving it into an external library requires a much broader use case.
Introducing an external library carries a higher cost and increases our dependency burden; it needs stronger justification if it is only going to be used by two Step scripts.
I am not opposing the idea right off the bat, as I generally prefer to do things right from the get-go, but unfortunately, time is not on our side for this particular PR. Let's keep this idea in mind for when we start developing more Conf and Install steps for other targets in the future.
Thanks for the great suggestion!
|
[REHEARSALNOTIFIER] Note: If this PR includes changes to step registry files ( Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sg-rh, shakyav The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Add ABI bare-metal BMC install steps, chain, and a reference workflow (step-registry)
Adds step-registry support for Agent-Based Installation (ABI) on bare metal without a bastion. Site-specific values (hosts, networks, BMC addresses, VIPs, tunnel ports) come from the cluster profile and env vars—not from hardcoded logic in the step scripts.
Out of scope for this PR:
ci-operator/configjob definitions (interop,gs-bm,metal-redhat-gs, localnet). Those land in follow-up PRs.Layout (step registry)
abi/conf/bm/abi-conf-bmabi/install/bmc/abi-install-bmcabi/chains/bm--bmc/abi-chains-bm--bmc(abi-conf-bm→abi-install-bmc)abi/workflows/bm--bmc--cluster-health/abi-workflows-bm--bmc--cluster-healthDocs:
ci-operator/step-registry/abi/README.md· Registry:abi-conf-bm,abi-install-bmcStep:
abi-conf-bm(configuration / Day-0 & Day-1)Role: Prepare everything abi-install-bmc needs before any ISO is built or any BMC is touched.
Base image:
ci/baremetal-qe-base:latest(includesnmstatectlfor NMState inagent-config).Cluster profile must provide:
pull-secret,ssh-publickey,cred--bmc--usr,cred--bmc--pwd, and${OCP__ABI__CFG_FN}(defaultocp--abi--cfg.yaml).What it does:
Day-0 — cluster configuration
install-config.yamlandagent-config.yamltemplate.OCP__ABI__CFG/${CLUSTER_PROFILE_DIR}/${OCP__ABI__CFG_FN}(UpdateCfg Day0: merge/replace YAML/JSON per README).OCP__ABI__DAY0_SCRIPTS_YAMLhooks (Scripts:list; schema in OpenShift-LP-QE--Tools).Day-1 — manifests before ISO
openshift-install agent create cluster-manifests.OCP__ABI__CFG(UpdateCfg Day1).OCP__ABI__DAY1_SCRIPTS_YAMLhooks.**BMC handoff **
ocp--bmc--info.json)..bmccredentials fromagent-config.yamlso secrets are not carried in the tarball.Pack for install step
ocpClusterInf.tgzin${SHARED_DIR}forabi-install-bmc.Key env vars (see
abi-conf-bm-ref.yamlfor full list):OCP__ABI__BM__CLS_NAME,OCP__ABI__BM__BASE_DOM,OCP__ABI__CFG_FN,OCP__ABI__DAY0_SCRIPTS_YAML,OCP__ABI__DAY1_SCRIPTS_YAML,OCP__ABI__MIN_ISO,OCP__ABI__TUN_SVC__DP_BASE_URL,OCP__ABI__TUN_SVC__DP_PORT.Step:
abi-install-bmc(install / BMC / Day-1.5 & Day-2)Role: Build and deliver the agent ISO, install the cluster via Redfish virtual media and a Chisel tunnel (no bastion), wait for the cluster to be ready, and publish artifacts.
Base image:
ci/baremetal-qe-base:latest.Cluster profile must provide:
ssh-privatekey,${OCP__ABI__CFG_FN}. Secrets: Chisel (/var/run/secrets/chisel), optional Bitwarden vault (BW__OBJ_NAME).What it does:
Workspace
${SHARED_DIR}/ocpClusterInf.tgzinto${OCP__ABI__CLUSTER_DIR}(must matchabi-conf-bm).ISO build & serve
openshift-install agent create image.OCP__ABI__TUN_SVC__DP_BASE_URL,OCP__ABI__TUN_SVC__DP_PORT, andOCP__ABI__TEAM_NAME(Chisel basic-auth files under/secret/chisel).BMC / Redfish (per host)
abi-install-bmc-commands.sh.Install waits
openshift-install agent wait-for bootstrap-completeandwait-for install-completewith configurable retries (OCP__ABI__WAIT__BOOTSTRAP__TRY,OCP__ABI__WAIT__CLUSTER__TRY,OCP__ABI__WAIT__NODE_READY__M).Day-1.5 / Day-2
OCP__ABI__CFG(e.g. Day-1.5 node provisioning flags).OCP__ABI__DAY2_SCRIPTS_YAMLvia BuildCustomScriptsFromYAML (OpenShift-LP-QE--Tools).Artifacts
kubeconfig,ocp.tgz, logs →${ARTIFACT_DIR}; handoff via${SHARED_DIR}.BW__OBJ_NAMEis set.Key env vars (see
abi-install-bmc-ref.yamlfor full list):OCP__ABI__TUN_SVC__CP_URL,OCP__ABI__TUN_SVC__DP_*,OCP__ABI__TEAM_NAME,OCP__ABI__WAIT__*,OCP__ABI__DAY2_SCRIPTS_YAML,BW__OBJ_NAME.Chain:
abi-chains-bm--bmcRuns in order:
abi-conf-bmabi-install-bmcReference workflow:
abi-workflows-bm--bmc--cluster-healthabi-chains-bm--bmc(install), thencucushift-installer-check-cluster-healthPost-install health check is part of setup (last step in
pre), not workflowtest, so follow-up jobs can setsteps.testto their own e2e while still inheriting install + health. Same pattern as install frameworks like ipi-aws.Example follow-up job (not in this PR):