Fix RHCOS version mismatch for Nutanix IPI installation#73946
Fix RHCOS version mismatch for Nutanix IPI installation#73946openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
Conversation
|
@rrasouli could you help? |
|
/assign @rrasouli |
|
/lgtm |
|
/cc @jcpowermac This PR fixes the Nutanix IPI installation issue where RHCOS version mismatch causes installation failures. Could you help review and approve? Thanks! |
| EXTRACT_MANIFEST_INCLUDED: "true" | ||
| OVERRIDE_RHCOS_IMAGE: https://rhcos.mirror.openshift.com/art/storage/prod/streams/rhel-9.6/builds/9.6.20260117-0/x86_64/rhcos-9.6.20260117-0-nutanix.x86_64.qcow2 | ||
| TEST_FILTERS: ~ChkUpgrade&;~ConnectedOnly&;Smokerun& | ||
| TEST_FILTERS: ~ChkUpgrade&;~ConnectedOnly&;60944|76765& |
There was a problem hiding this comment.
No need to exclude 60944 and 76765 since we have fixed them, they are not backported yet
Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
6f71590 to
51e84b9
Compare
|
/lgtm |
|
[REHEARSALNOTIFIER]
A total of 596 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/lgtm |
|
Hi @jianlinliu ,may I ask if you can help to check it's good to go? Thanks. |
Pls check #73946 (comment) to find out proper owner to review and approve. |
|
@rvanderp3 @jcpowermac @JoelSpeed Could you help review? This fixes RHCOS version mismatch for Nutanix |
|
/approve |
|
@liangxia could you help? Thanks! |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JoelSpeed, liangxia, rrasouli, weinliu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/rehearsals-ack |
|
Could someone with write access please add the |
|
/label rehearsals-ack |
|
@weinliu: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/pj-rehearse ack |
|
@weinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@weinliu: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
) Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
) Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
) Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
) Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
) Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
) Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
) Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
) Add OVERRIDE_RHCOS_IMAGE environment variable to force the use of the correct RHCOS version (9.6.20260117-0) for the debug-winc-nutanix-ipi test. Root cause: - The installer binary (commit a6c94ff2, built 2026-01-14) has embedded RHCOS stream metadata pointing to version 9.6.20251212-1 (built 2025-12-12) - However, the release payload 4.21.0-0.nightly-2026-01-22-192129 expects machine-os version 9.6.20260117-0 (built 2026-01-17) - This 36-day version gap causes API Server startup failure due to missing kernel modules, CRI-O updates, and other dependencies required by Kubernetes 1.34.2 The mismatch manifested as: - Bootstrap phase: SUCCESS (basic etcd service works) - Control plane VMs: Created successfully - API Server: FAILED to start (missing required runtime components) - Install result: Timeout waiting for https://api...:6443 Solution: Override the RHCOS image URL to use the correct version that matches the release payload's machine-os requirement. Also declare the OVERRIDE_RHCOS_IMAGE parameter in the ipi-conf-nutanix step reference to satisfy CI validation requirements.
PR openshift#76620 added debug-winc-nutanix-ipi jobs to main, release-4.22, and release-4.23 but omitted the critical OVERRIDE_RHCOS_IMAGE environment variable that was already present in release-4.21 (added by PR openshift#73946). Without this override, Nutanix IPI installations fail during bootstrap because of RHCOS version mismatches between the installer's embedded metadata and the release payload's expected machine-os version. This manifests as timeout errors (exit 28) in the ipi-conf-nutanix-context step when trying to communicate with the Nutanix Prism Central API. This fix adds OVERRIDE_RHCOS_IMAGE pointing to RHCOS 9.6.20260323-1 (same version used in working 4.21 jobs) to prevent API server startup failures during cluster bootstrap. Fixes: openshift#76620
Problem Description
This PR addresses a critical RHCOS version incompatibility in Nutanix IPI cluster installations.
Root Cause:
Symptoms:
Solution
Add
OVERRIDE_RHCOS_IMAGEenvironment variable to force the use of the correct RHCOS version that matches the release payload's machine-os requirement.Changes
ci-operator/config/.../openshift-openshift-tests-private-release-4.21.yaml
OVERRIDE_RHCOS_IMAGEpointing to the correct build (9.6.20260117-0)TEST_FILTERS: ~ChkUpgrade&;~ConnectedOnly&;Smokerun&(per @rrasouli's feedback, no need to exclude 60944 and 76765 since they are fixed)debug-winc-nutanix-ipijob configurationci-operator/step-registry/ipi/conf/nutanix/ipi-conf-nutanix-ref.yaml
OVERRIDE_RHCOS_IMAGEparameter to satisfy CI validation requirementsci-operator/jobs/.../openshift-openshift-tests-private-release-4.21-presubmits.yaml
/cc @rrasouli
Co-Authored-By: Claude Sonnet 4.5 noreply@anthropic.com