Skip to content

Create GS baremetal Agent-Based Install steps, chain, and reference health workflow#78959

Merged
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
sg-rh:abi-ocp-bm
May 19, 2026
Merged

Create GS baremetal Agent-Based Install steps, chain, and reference health workflow#78959
openshift-merge-bot[bot] merged 1 commit into
openshift:mainfrom
sg-rh:abi-ocp-bm

Conversation

@sg-rh
Copy link
Copy Markdown
Contributor

@sg-rh sg-rh commented May 7, 2026

Add ABI bare-metal BMC install steps, chain, and a reference workflow (step-registry)

Adds step-registry support for Agent-Based Installation (ABI) on bare metal without a bastion. Site-specific values (hosts, networks, BMC addresses, VIPs, tunnel ports) come from the cluster profile and env vars—not from hardcoded logic in the step scripts.

Out of scope for this PR: ci-operator/config job definitions (interop, gs-bm, metal-redhat-gs, localnet). Those land in follow-up PRs.

Layout (step registry)

Path Role
abi/conf/bm/ Configuration step abi-conf-bm
abi/install/bmc/ BMC install step abi-install-bmc
abi/chains/bm--bmc/ Chain abi-chains-bm--bmc (abi-conf-bmabi-install-bmc)
abi/workflows/bm--bmc--cluster-health/ Reference workflow abi-workflows-bm--bmc--cluster-health

Docs: ci-operator/step-registry/abi/README.md · Registry: abi-conf-bm, abi-install-bmc


Step: abi-conf-bm (configuration / Day-0 & Day-1)

Role: Prepare everything abi-install-bmc needs before any ISO is built or any BMC is touched.

Base image: ci/baremetal-qe-base:latest (includes nmstatectl for NMState in agent-config).

Cluster profile must provide: pull-secret, ssh-publickey, cred--bmc--usr, cred--bmc--pwd, and ${OCP__ABI__CFG_FN} (default ocp--abi--cfg.yaml).

What it does:

  1. Day-0 — cluster configuration

    • Create a minimal install-config.yaml and agent-config.yaml template.
    • Apply structured overrides from OCP__ABI__CFG / ${CLUSTER_PROFILE_DIR}/${OCP__ABI__CFG_FN} (UpdateCfg Day0: merge/replace YAML/JSON per README).
    • Run optional OCP__ABI__DAY0_SCRIPTS_YAML hooks (Scripts: list; schema in OpenShift-LP-QE--Tools).
  2. Day-1 — manifests before ISO

    • Run openshift-install agent create cluster-manifests.
    • Apply Day-1 overrides from OCP__ABI__CFG (UpdateCfg Day1).
    • Run optional OCP__ABI__DAY1_SCRIPTS_YAML hooks.
  3. **BMC handoff **

    • Build BMC metadata for the install step (e.g. ocp--bmc--info.json).
    • Strip .bmc credentials from agent-config.yaml so secrets are not carried in the tarball.
  4. Pack for install step

    • Pack the workspace into ocpClusterInf.tgz in ${SHARED_DIR} for abi-install-bmc.

Key env vars (see abi-conf-bm-ref.yaml for full list): OCP__ABI__BM__CLS_NAME, OCP__ABI__BM__BASE_DOM, OCP__ABI__CFG_FN, OCP__ABI__DAY0_SCRIPTS_YAML, OCP__ABI__DAY1_SCRIPTS_YAML, OCP__ABI__MIN_ISO, OCP__ABI__TUN_SVC__DP_BASE_URL, OCP__ABI__TUN_SVC__DP_PORT.


Step: abi-install-bmc (install / BMC / Day-1.5 & Day-2)

Role: Build and deliver the agent ISO, install the cluster via Redfish virtual media and a Chisel tunnel (no bastion), wait for the cluster to be ready, and publish artifacts.

Base image: ci/baremetal-qe-base:latest.

Cluster profile must provide: ssh-privatekey, ${OCP__ABI__CFG_FN}. Secrets: Chisel (/var/run/secrets/chisel), optional Bitwarden vault (BW__OBJ_NAME).

What it does:

  1. Workspace

    • Unpack ${SHARED_DIR}/ocpClusterInf.tgz into ${OCP__ABI__CLUSTER_DIR} (must match abi-conf-bm).
  2. ISO build & serve

    • openshift-install agent create image.
    • Serve the ISO with an HTTP server that supports Range requests (required for agent minimal ISO / partial fetch).
    • Expose the ISO on the Chisel data plane using OCP__ABI__TUN_SVC__DP_BASE_URL, OCP__ABI__TUN_SVC__DP_PORT, and OCP__ABI__TEAM_NAME (Chisel basic-auth files under /secret/chisel).
  3. BMC / Redfish (per host)

    • Use metadata from the conf step; vendor-specific Redfish handling in abi-install-bmc-commands.sh.
    • Virtual media insert, boot order, power on/off, retries, stuck-boot detection, ISO eject / cleanup where implemented.
  4. Install waits

    • openshift-install agent wait-for bootstrap-complete and wait-for install-complete with configurable retries (OCP__ABI__WAIT__BOOTSTRAP__TRY, OCP__ABI__WAIT__CLUSTER__TRY, OCP__ABI__WAIT__NODE_READY__M).
  5. Day-1.5 / Day-2

    • Optional actions from OCP__ABI__CFG (e.g. Day-1.5 node provisioning flags).
    • OCP__ABI__DAY2_SCRIPTS_YAML via BuildCustomScriptsFromYAML (OpenShift-LP-QE--Tools).
  6. Artifacts

    • Admin kubeconfig, ocp.tgz, logs → ${ARTIFACT_DIR}; handoff via ${SHARED_DIR}.
    • Optional upload of kubeconfig to Bitwarden when BW__OBJ_NAME is set.

Key env vars (see abi-install-bmc-ref.yaml for full list): OCP__ABI__TUN_SVC__CP_URL, OCP__ABI__TUN_SVC__DP_*, OCP__ABI__TEAM_NAME, OCP__ABI__WAIT__*, OCP__ABI__DAY2_SCRIPTS_YAML, BW__OBJ_NAME.


Chain: abi-chains-bm--bmc

Runs in order:

  1. abi-conf-bm
  2. abi-install-bmc

Reference workflow: abi-workflows-bm--bmc--cluster-health

Phase Content
pre abi-chains-bm--bmc (install), then cucushift-installer-check-cluster-health
post empty (teardown TBD)
test not defined in workflow

Post-install health check is part of setup (last step in pre), not workflow test, so follow-up jobs can set steps.test to their own e2e while still inheriting install + health. Same pattern as install frameworks like ipi-aws.

Example follow-up job (not in this PR):

steps:
  cluster_profile: metal-redhat-gs
  env: { ... }
  test:
    - ref: your-interop-or-e2e-step
  workflow: abi-workflows-bm--bmc--cluster-health

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Walkthrough

This PR adds Agent-Based Installer (ABI) bare-metal testing infrastructure to RedHatQE interop-testing CI. It defines two new ABI steps (abi-conf-bm for cluster configuration and abi-install-bmc for installation), a workflow to orchestrate them, test job definitions, and updates to CI reporting configuration.

Changes

ABI Step Registry and Workflow Infrastructure

Layer / File(s) Summary
Ownership & Documentation
ci-operator/step-registry/abi/OWNERS, ci-operator/step-registry/abi/README.md
Establishes ABI area ownership under cspi-qe-ocp-lp and documents Day-0/Day-1/Day-2 phases, configuration merge semantics, and script injection patterns.
Configuration Step
ci-operator/step-registry/abi/conf/OWNERS, ci-operator/step-registry/abi/conf/bm/OWNERS, ci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.yaml, ci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.metadata.json, ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh
Defines the abi-conf-bm step which generates install-config.yaml and agent-config.yaml, extracts BMC connectivity info, applies Day-0 overrides, enables minimal ISO mode when needed, and packages cluster manifests.
Installation Step
ci-operator/step-registry/abi/install/OWNERS, ci-operator/step-registry/abi/install/bmc/OWNERS, ci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.yaml, ci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.metadata.json, ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh
Defines the abi-install-bmc step which serves the agent ISO, establishes a Chisel reverse tunnel for BMC access, configures Redfish boot/virtual media, monitors installation phases, orchestrates Day-1.5 scaling, and applies Day-2 customization scripts.
Workflow Orchestration
ci-operator/step-registry/gs-baremetal/OWNERS, ci-operator/step-registry/gs-baremetal/README.md, ci-operator/step-registry/gs-baremetal/agent-install/OWNERS, ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml, ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.metadata.json
Defines the gs-baremetal-agent-install workflow that chains abi-conf-bm and abi-install-bmc steps for the metal-redhat-gs cluster profile and documents the contract for external test chaining.
Test Job & Reporting
ci-operator/config/RedHatQE/interop-testing/.config.prowgen, ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__abi-test.yaml, ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__ciOpEmul--imgRetr.yaml
Adds scheduled abi-test jobs (abi-test-conf-bm and abi-test-install-bmc with 1h and 3h timeouts), a related image-retrieval test, and updates prowgen Slack reporting to include abi-test in #forum-p2p-cspi, #cnv-release-4-20-z, and #cnv-release-4-21-z channels.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains no Ginkgo test definitions. Test/step names present (abi-test-conf-bm, abi-test-install-bmc, img-retr) are static and deterministic without dynamic information.
Test Structure And Quality ✅ Passed No Ginkgo test code in PR. Check is not applicable—PR contains CI configs and Bash scripts only, no Go tests.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. It adds only CI infrastructure (step-registry scripts, YAML configs, docs). The MicroShift compatibility check applies only when e2e tests are added.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are present in this PR. Changes are CI configuration, step-registry infrastructure, scripts, and documentation only. SNO compatibility check not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR adds CI step-registry and testing infrastructure only. No production Kubernetes manifests with scheduling constraints (affinity, topology spread, PDB) are introduced. Check is not applicable.
Ote Binary Stdout Contract ✅ Passed Not applicable. PR adds Bash scripts, YAML/JSON configs, and docs—no Go OTE binaries subject to stdout contract.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR adds no Ginkgo e2e tests. All files are CI/CD infrastructure (YAML configs, bash scripts, markdown docs, JSON metadata). The custom check is not applicable.
Title check ✅ Passed The title accurately describes the main changes: introducing Agent-Based Install (ABI) steps, workflow, and reference configuration for the gs-baremetal cluster profile.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from dfrazzette and dshchedr May 7, 2026 00:07
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml (1)

15-15: ⚡ Quick win

Use a latest docs URL to avoid stale workflow guidance.

Line 15 pins documentation to OCP 4.12, which can drift from current installer behavior. Prefer the .../latest/... docs path used elsewhere in this PR.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml`
at line 15, Update the pinned Red Hat docs URL in the
gs-baremetal-agent-install-workflow.yaml comment so it uses the version-agnostic
path; specifically replace the segment "4.12" in the URL
"https://docs.redhat.com/en/documentation/openshift_container_platform/4.12/html/installing_an_on-premise_cluster_with_the_agent-based-installer/preparing-to-install-with-the-agent-based-installer"
with "latest" so the comment references the .../latest/... docs path and won’t
drift from current installer behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__abi-test.yaml`:
- Line 25: The cron entry "0 23 31 2 *" is invalid (Feb 31) and prevents the job
from ever running; locate both occurrences of that exact string in the YAML (the
cron keys present in the ABI test job) and replace each with a valid cron
expression that matches the intended cadence (for example, use a valid day in
February like the 28th or choose a monthly/last-day-safe schedule that covers
February correctly); ensure both instances are updated consistently so the job
actually runs.

In `@ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh`:
- Around line 16-21: The scripts fetched via curl in the eval blocks
(BuildCustomScriptsFromYAML.sh and EnsureReqs.sh) currently reference the "main"
branch; change both URLs to use immutable commit SHAs instead (replace "main" in
the two curl URLs with the specific commit hash for each script), update any
other instances of the same pattern (e.g., in abi-install-bmc-commands.sh and
other step-registry scripts), and verify the fetched versions still work with
the functions invoked (EnsureReqs yq and whatever BuildCustomScriptsFromYAML
provides) before committing.

In `@ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh`:
- Around line 399-408: The current vendor-prep case forces a hard-fail for
non-Dell BMCs by using "(* ) false;;" which aborts the install; update the case
for bmcVend so only the Dell branch runs the PATCH and all other vendors simply
no-op and continue (replace the default "(* ) false;;" with a no-op branch like
"(* ) ;;" or an explicit comment/no-op), keeping the RedfishAPIcall invocation
and variables (bmcInfo, bmcURL, bmcMgrId, bmcVend) unchanged.
- Around line 18-26: The step directly evals remote scripts from GitHub main
(BuildCustomScriptsFromYAML.sh, Vault--BitWarden--UploadAttachment.sh,
EnsureReqs.sh) which is unsafe; pin or vendor them and stop executing main at
runtime. Fetch and commit the exact versions (or reference a specific commit
tag) of those three scripts into the repo (or update CI to download by SHA and
verify checksum), then change the calls that use eval "$(curl ...)" to source
the local vendored files (or curl the pinned raw URL and verify checksum before
sourcing). Also update the EnsureReqs invocation (EnsureReqs yq chisel) to call
the vendored EnsureReqs implementation so runtime behavior is unchanged but
reproducible and auditable.
- Around line 67-90: The RedfishAPIcall function always returns success because
of the trailing "true"; remove that final "true" so curl failures propagate and
callers (e.g., the boot-override PATCH fallback logic) can detect errors and run
the fallback path; locate the function named RedfishAPIcall and delete the
concluding "true" so the function returns the actual exit status from curl.

In `@ci-operator/step-registry/abi/README.md`:
- Around line 6-11: The markdown table in the README (the block starting with
the header row "| Step                | Reference (source of truth)..." )
violates MD058; add a single blank line immediately before the table and a
single blank line immediately after the table so it is separated from
surrounding paragraphs, then re-run the linter to verify MD058 is resolved.

---

Nitpick comments:
In
`@ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml`:
- Line 15: Update the pinned Red Hat docs URL in the
gs-baremetal-agent-install-workflow.yaml comment so it uses the version-agnostic
path; specifically replace the segment "4.12" in the URL
"https://docs.redhat.com/en/documentation/openshift_container_platform/4.12/html/installing_an_on-premise_cluster_with_the_agent-based-installer/preparing-to-install-with-the-agent-based-installer"
with "latest" so the comment references the .../latest/... docs path and won’t
drift from current installer behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2635734a-3225-4d33-8956-d96852617a62

📥 Commits

Reviewing files that changed from the base of the PR and between 78291f3 and c4c67fa.

⛔ Files ignored due to path filters (2)
  • ci-operator/jobs/RedHatQE/interop-testing/RedHatQE-interop-testing-master-periodics.yaml is excluded by !ci-operator/jobs/**
  • ci-operator/jobs/RedHatQE/interop-testing/RedHatQE-interop-testing-master-presubmits.yaml is excluded by !ci-operator/jobs/**
📒 Files selected for processing (20)
  • ci-operator/config/RedHatQE/interop-testing/.config.prowgen
  • ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__abi-test.yaml
  • ci-operator/config/RedHatQE/interop-testing/RedHatQE-interop-testing-master__ciOpEmul--imgRetr.yaml
  • ci-operator/step-registry/abi/OWNERS
  • ci-operator/step-registry/abi/README.md
  • ci-operator/step-registry/abi/conf/OWNERS
  • ci-operator/step-registry/abi/conf/bm/OWNERS
  • ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh
  • ci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.metadata.json
  • ci-operator/step-registry/abi/conf/bm/abi-conf-bm-ref.yaml
  • ci-operator/step-registry/abi/install/OWNERS
  • ci-operator/step-registry/abi/install/bmc/OWNERS
  • ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh
  • ci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.metadata.json
  • ci-operator/step-registry/abi/install/bmc/abi-install-bmc-ref.yaml
  • ci-operator/step-registry/gs-baremetal/OWNERS
  • ci-operator/step-registry/gs-baremetal/README.md
  • ci-operator/step-registry/gs-baremetal/agent-install/OWNERS
  • ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.metadata.json
  • ci-operator/step-registry/gs-baremetal/agent-install/gs-baremetal-agent-install-workflow.yaml
💤 Files with no reviewable changes (1)
  • ci-operator/step-registry/gs-baremetal/OWNERS

Comment thread ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh
Comment thread ci-operator/step-registry/abi/install/bmc/abi-install-bmc-commands.sh Outdated
Comment thread ci-operator/step-registry/abi/README.md
@sg-rh
Copy link
Copy Markdown
Contributor Author

sg-rh commented May 7, 2026

Supersedes #76365

openshift/release#76365 was closed after the GitHub PR page repeatedly failed to load, which blocked review and updates. This PR carries the same ABI bare-metal step-registry / workflow work forward.

@sg-rh
Copy link
Copy Markdown
Contributor Author

sg-rh commented May 7, 2026

/uncc @dfrazzette @dshchedr

@openshift-ci openshift-ci Bot removed request for dfrazzette and dshchedr May 7, 2026 01:38
@sg-rh
Copy link
Copy Markdown
Contributor Author

sg-rh commented May 7, 2026

@coderabbitai pause

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

✅ Actions performed

Reviews paused.

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sg-rh, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

couldn't prepare candidate: couldn't rebase candidate onto 255a78bafd817794898efe6e5f3910ba11b5434f due to conflicts
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

@sg-rh, pj-rehearse: unable to determine affected jobs. This could be due to a branch that needs to be rebased. ERROR:

couldn't prepare candidate: couldn't rebase candidate onto 951f9ff028faad4d22c3d44433bf60eb936315a1 due to conflicts
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026
@etirta etirta force-pushed the abi-ocp-bm branch 10 times, most recently from ac15872 to 7bc41c1 Compare May 16, 2026 18:30
@etirta
Copy link
Copy Markdown
Contributor

etirta commented May 17, 2026

Test performed:

  • test--01-abi-conf-bm.log:
    • Testing full ISO generation.
    • Testing minimal ISO generation.
  • test--02-abi-install-bmc.log
    • Deploying IENG Goldman Sachs HW Cluster, with the following scenarios:
      • Node gs04, BMC wipe emulation:
        • 1st boot: emulating stuck at boot phase / failed boot.
        • 2nd boot: success.
      • Node gs05, boot into prev. OS on HDD:
        • 1st boot: SSH Key rejection (emulating prev. Cluster was deployed using different SSH Key).
        • 2nd boot: success.
      • Node gs06, boot into prev. OS on HDD:
        • 1st boot: Detect prev. OS.
        • 2nd boot: success.
      • Node gs02, gs03, gs01: No emulation, normal successful boot.

ci-artifacts.zip

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label May 18, 2026
@sg-rh sg-rh changed the title Create gs-baremetal Agent-Based Install workflow and scripts Create GS baremetal Agent-Based Install steps, chain, and sample health workflow May 18, 2026
@sg-rh sg-rh force-pushed the abi-ocp-bm branch 3 times, most recently from 3df3552 to 6a63324 Compare May 19, 2026 05:20
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

@sg-rh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ci-operator-config-metadata 3ceda05 link true /test ci-operator-config-metadata

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment thread ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh
Comment thread ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh
Comment thread ci-operator/step-registry/abi/conf/bm/abi-conf-bm-commands.sh
@etirta
Copy link
Copy Markdown
Contributor

etirta commented May 19, 2026

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 19, 2026

✅ Actions performed

Reviews resumed.

Comment thread ci-operator/step-registry/abi/README.md
@sg-rh sg-rh changed the title Create GS baremetal Agent-Based Install steps, chain, and sample health workflow Create GS baremetal Agent-Based Install steps, chain, and reference health workflow May 19, 2026
Copy link
Copy Markdown
Contributor

@amp-rh amp-rh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on the step registry structure and documentation. The README, ref YAML env var docs, and architecture are all well done. A few questions and suggestions inline, mostly around debuggability of retry loops and failure-mode handling on shared bare-metal infrastructure.

Comment on lines +512 to +525
WipeDisks "${tPID}" "${bmcInfo}" \
"${bmcURL}" "${bmcSysId}" "${bmcMgrId}" \
"${diskWipeMethod}" && break
((--tryLeft))
done
# Restore Boot Order.
[ -z "${diskWipeMethod}" ] || {
RedfishAPIcall "${bmcInfo}" "${bmcURL}" PATCH \
"Systems/${bmcSysId}" \
-d '{"Boot": {
"BootSourceOverrideEnabled": "Disabled",
"BootSourceOverrideTarget": "None"
}}'
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (debuggability): All three retry loops in this script (ISO URL probe at L407, bootstrap-complete wait here, install-complete wait at L562) share the same pattern where exhausted retries produce a cryptic failure.

When tryLeft decrements to 0, ((--tryLeft)) evaluates to ((0)) which returns exit code 1 under set -e, killing the subshell. The CI log will show something like:

abi-install-bmc-commands.sh: line 525: ((: --tryLeft: 0

...which doesn't tell the person triaging that bootstrap-complete exhausted N attempts.

A small guard after each while loop would make failures much more debuggable:

((tryLeft)) || { echo "FATAL: bootstrap-complete failed after ${OCP__ABI__WAIT__BOOTSTRAP__TRY} attempts" >&2; exit 1; }

(Same pattern recommended for the install-complete and ISO probe loops.)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in my response to @shakyav regarding similar concerns, it is actually quite clear already:

etirtara@14c692f90aab.4 2026-05-19 w21 19:54:37 "/wrk--edttjRedHat.RHP-k8s--dev/local/repo/CSPI-QE/ocp-ci-docs"
$ bash -exc "$(cat - 0<<'EOF'
    typeset -i tryLeft=2
    while ((tryLeft)); do
        : foo
        ((--tryLeft))
    done
    : success
    true
EOF
)"
+ typeset -i tryLeft=2
+ (( tryLeft ))
+ : foo
+ (( --tryLeft ))
+ (( tryLeft ))
+ : foo
+ (( --tryLeft ))

The whole purpose of having both xtrace and errexit enabled is to reduce the need for instrumenting explicit debug information in the code.

Comment on lines +528 to +560
' 0< "${bmcInfo}")
} |& tee "${ARTIFACT_DIR}/ocp--installer--bmc.log") & taskPIDs+=($!)
# Wait for BootStrap Node to finish.
(
typeset -i tryLeft="${OCP__ABI__WAIT__BOOTSTRAP__TRY}"
while ((tryLeft)); do
openshift-install agent wait-for bootstrap-complete && break
((--tryLeft))
done
)
cp -f "${OCP__ABI__CLUSTER_DIR}/auth/kubeconfig" "${SHARED_DIR}/kubeconfig-minimal"

# Day-1.5 Phase.
(
typeset cfgKey='' cfgVal=''
export KUBECONFIG="${OCP__ABI__CLUSTER_DIR}/auth/kubeconfig"
while IFS=$'\t' read -r cfgKey cfgVal; do
case ${cfgKey} in
(NodeProv)
[ "${cfgVal}" = false ] && {
# Workers are provisioned by ABI. No
# BareMetalHost CRDs or Ironic
# provisioning network.
while true; do
oc -n openshift-machine-api \
scale MachineSets \
--replicas 0 --all \
&& break || sleep 60
done
}
;;
esac
done 0< <(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: The Day-1.5 subshell runs in the background (& taskPIDs+=($!)) and performs oc scale MachineSets --replicas 0 --all when NodeProv: false. If this fails, the install proceeds through wait-for install-complete, Day-2 scripts, and reports success, but the cluster has both ABI-provisioned workers and an active MachineSet controller potentially fighting over them.

The EXIT trap kills/waits background tasks but doesn't propagate their exit codes. Is this intentional? If Day-1.5 failure should be fatal, a wait + exit-code check between install-complete and the virtual media eject block would catch it:

wait "${day15_pid}" || { echo "FATAL: Day-1.5 phase failed" >&2; exit 1; }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If oc scale MachineSets --replicas 0 --all fails, one of the Cluster Operators (either baremetal, control-plane-machine-set, or both) will fail to become Ready. Consequently, agent wait-for install-complete will eventually time out.

The purpose of the Day-1.5 phase is to perform the necessary adjustments to prevent agent wait-for install-complete from timing out when we lack the full infrastructure required for an ABI-based OCP deployment. That being said, configuring the manifest during the Day-1 phase remains the preferred approach. If we acquire more information down the road on how to properly configure the manifest properties, we can migrate to the Day-1 method and deprecate the Day-1.5 phase entirely.

The broader architectural question is: should we fail fast and exit with an error during Day-1.5? While this is a good idea in theory, a mere propagation is easy; the real challenge is propagating the failure early enough to intercept and cancel the ongoing agent wait-for install-complete command, neither of which we currently have a mechanism to do.

pre:
- chain: abi-chains-bm--bmc
- chain: cucushift-installer-check-cluster-health
post: []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: post: [] means that if the job fails during bootstrap wait or install-complete, there's no cleanup step to eject virtual media or restore boot order on the BMC hosts. The eject loop in abi-install-bmc-commands.sh (L573-580) and the boot-order restore (L503-510) only execute on the happy path.

On shared bare-metal infrastructure, this could leave ISOs mounted and boot order set to CD, potentially affecting the next job that uses these hosts. Is there a plan for the teardown step mentioned in the PR description ("teardown TBD"), or is BMC state recovery handled externally?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We configure an EXIT trap in the abi-install-bmc step so that when the step script terminates, it tears down both the secure tunnel and the HTTP server, rendering the virtual CD (VCD) ISO mount invalid. Additionally, a safeguard is executed at the very beginning to eject any existing VCD as the first action.

Furthermore, since this runs on bare metal within our own lab infrastructure, a strict teardown is mostly moot. In fact, skipping the teardown could be beneficial for post-mortem debugging if needed.

Comment on lines +35 to +48
function openshift-install () {
typeset -i es=0
{
echo \
"$(date -Iseconds)|${FUNCNAME[0]@Q} ${*@Q}"$'\n'"$(printf '%.0s-' {1..80})"
command openshift-install \
--dir "${OCP__ABI__CLUSTER_DIR}/" \
--log-level "${OCP__ABI__INSTLR_LOG_LEVEL}" \
"$@" 2>&1 || es=$?
echo "$(printf '%.0s=' {1..80})"
exit ${es}
} | tee -a "${ARTIFACT_DIR}/ocp--installer--cluster.log"
return ${PIPESTATUS[0]}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This openshift-install wrapper function is identical in both scripts (conf L35-48, install L50-62). If the logging format or argument handling needs updating, it must be changed in two places. Given that both scripts already source shared libraries from OpenShift-LP-QE--Tools, this could live there as well, or be extracted to a small helper that both scripts source from the step registry.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a valid point. My concern is that a different installation process might not always share or require these exact same CLI parameters. While being called more than once is a good enough reason to create a function, moving it into an external library requires a much broader use case.

Introducing an external library carries a higher cost and increases our dependency burden; it needs stronger justification if it is only going to be used by two Step scripts.

I am not opposing the idea right off the bat, as I generally prefer to do things right from the get-go, but unfortunately, time is not on our side for this particular PR. Let's keep this idea in mind for when we start developing more Conf and Install steps for other targets in the future.

Thanks for the great suggestion!

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@sg-rh: no rehearsable tests are affected by this change

Note: If this PR includes changes to step registry files (ci-operator/step-registry/) and you expected jobs to be found, try rebasing your PR onto the base branch. This helps pj-rehearse accurately detect changes when the base branch has moved forward.

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@shakyav
Copy link
Copy Markdown
Contributor

shakyav commented May 19, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 19, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 19, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sg-rh, shakyav

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot Bot merged commit c9c6271 into openshift:main May 19, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants