OCPBUGS-82060: Fix GCP ARM capacity exhaustion in multi-arch jobs#77483
OCPBUGS-82060: Fix GCP ARM capacity exhaustion in multi-arch jobs#77483jianlinliu wants to merge 1 commit into
Conversation
|
@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
067d338 to
b9be8e2
Compare
b9be8e2 to
1c5d6a5
Compare
|
@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a83558f to
bc06f12
Compare
|
/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-upgrade-gcp-ovn-multi-x-ax periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x |
|
@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
05544a0 to
28d3298
Compare
|
/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-upgrade-gcp-ovn-multi-x-ax periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-nightly-4.16-ocp-e2e-upgrade-gcp-ovn-multi-x-ax |
|
@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a |
|
@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
4595e1e to
7af8931
Compare
|
@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
df8cf0e to
b94e64d
Compare
|
/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-stable-4.16-ocp-e2e-upgrade-gcp-ovn-multi-a-a periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a |
|
@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@jianlinliu: job(s): periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-stable-4.16-ocp-e2e-upgrade-gcp-ovn-multi-a-a, periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a either don't exist or were not found to be affected, and cannot be rehearsed |
|
/pj-rehearse abort |
|
@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
b94e64d to
623b787
Compare
|
@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
623b787 to
ee6dd22
Compare
This change addresses GCP ARM instance capacity issues in multi-arch CI jobs through zone randomization, instance sizing optimization, balanced worker configuration, and schedule distribution. Root Cause: - GCP T2A ARM instances experiencing capacity exhaustion in us-central1-a zone - ipi-install-heterogeneous always used first machineset, concentrating load - Larger instances (standard-4) have lower availability than smaller instances - Unbalanced worker configuration (3 AMD64 + 2 ARM64) in heterogeneous jobs - All GCP ARM jobs scheduled at same time (Sunday 11:00), creating resource contention Solutions: 1. Zone randomization: Random machineset selection distributes ARM instances across zones to reduce capacity pressure 2. Instance sizing optimization: Use smaller instances (t2a-standard-2) for additional workers and migration infra to improve availability 3. Balanced worker configuration: Set COMPUTE_NODE_REPLICAS to 2 for GCP heterogeneous jobs to create balanced 2+2 worker layout (2 AMD64 + 2 ARM64) 4. Schedule distribution: Change to interval-based scheduling (168h) for releases 4.19-5.0 to prevent all jobs from running simultaneously 5. Disk type compatibility: Add ADDITIONAL_WORKER_DISK_TYPE parameter for heterogeneous workers Why T2A (not C4A/N4A): - C4A and N4A only support Hyperdisk, NOT Persistent Disk - OpenShift monitoring/logging/registry use pd-standard PVCs by default - Using C4A/N4A causes: "pd-standard disk type cannot be used by c4a-standard-4 machine type" errors during pod volume attachment - T2A supports BOTH Persistent Disk (pd-standard, pd-balanced, pd-ssd) AND Hyperdisk, ensuring full OpenShift compatibility Changes: - ipi-install-heterogeneous: Random machineset selection for zone distribution - New ADDITIONAL_WORKER_DISK_TYPE parameter for heterogeneous additional workers - Update GCP multi-arch configs (4.17-5.0): * COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged, larger instance) * ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (smaller) * COMPUTE_NODE_REPLICAS: "2" (for GCP heterogeneous jobs, balanced layout) * MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged) * MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-4 → t2a-standard-2 (4.20-5.0) - Update job schedules (4.19-5.0): * Changed from cron (Sunday 11:00) to interval: 168h * Prevents 18+ jobs from running simultaneously * Jobs will naturally distribute based on completion times Instance Type Configuration: - All releases (4.17-5.0) use T2A (Tau) processor - COMPUTE_NODE_TYPE: t2a-standard-4 (4 vCPU, larger instance) - ADDITIONAL_WORKER_VM_TYPE: t2a-standard-2 (2 vCPU, smaller instance) - MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged) - MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-2 (2 vCPU, optimized for 4.20-5.0) Worker Configuration for GCP Heterogeneous Jobs: - Before: 3 AMD64 workers + 2 ARM64 additional workers = 5 total - After: 2 AMD64 workers + 2 ARM64 additional workers = 4 total - Benefits: Balanced configuration, reduced resources, faster installation Schedule Configuration: - 4.17-4.18: No GCP heterogeneous jobs or keep existing schedules - 4.19-5.0: interval: 168h (weekly, distributed execution) - Prevents resource contention from simultaneous job execution Disk Type Configuration: - T2A supports both Persistent Disk and Hyperdisk - No explicit disk type parameters needed (defaults to pd-standard) - ADDITIONAL_WORKER_DISK_TYPE parameter added to ref for future flexibility Benefits: - Reduces capacity exhaustion through zone randomization - Smaller instances (standard-2) have exponentially better availability - Balanced worker layout (2+2) more efficient than unbalanced (3+2) - Reduced total resource consumption (4 workers vs 5) - Distributed job execution prevents resource contention - Full OpenShift compatibility with default storage (pd-standard) - Lower resource consumption for additional workers and migration infra Related: OCPBUGS-82060
ee6dd22 to
a289177
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jianlinliu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
[REHEARSALNOTIFIER]
A total of 773 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs. A full list of affected jobs can be found here Interacting with pj-rehearseComment: Once you are satisfied with the results of the rehearsals, comment: |
|
/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-stable-4.16-ocp-e2e-upgrade-gcp-ovn-multi-a-a periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-upgrade-gcp-ovn-multi-x-ax |
|
@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel. |
|
@jianlinliu: job(s): periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-stable-4.16-ocp-e2e-upgrade-gcp-ovn-multi-a-a, periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a either don't exist or were not found to be affected, and cannot be rehearsed |
|
@jianlinliu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Closing, please open a separate request if this work is still required. |
|
@jianlinliu: This pull request references Jira Issue OCPBUGS-82060. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Summary
This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing due to ARM instance capacity exhaustion through zone randomization, instance sizing optimization, balanced worker configuration, and schedule distribution.
Root Causes
Solutions
1. Zone Randomization
Random machineset selection in
ipi-install-heterogeneousdistributes ARM instances across zones to reduce capacity pressure.2. Instance Sizing Optimization
Use smaller instances (standard-2) for additional workers and migration infra to improve availability and reduce resource consumption.
3. Balanced Worker Configuration
Set
COMPUTE_NODE_REPLICAS: "2"for GCP heterogeneous jobs to create balanced 2+2 worker layout instead of unbalanced 3+2 configuration.Before:
After:
Benefits:
4. Schedule Distribution
Change job scheduling from cron to interval-based (168h) for releases 4.19-5.0 to prevent simultaneous execution.
Before:
After:
interval: 168h(weekly interval)Benefits:
5. Why T2A (not C4A/N4A)?
C4A and N4A Compatibility Issue:
T2A Advantages:
GCP ARM Instance Disk Support Comparison:
Changes
ipi-install-heterogeneous Step
Multi-arch Configs (4.17-5.0)
All Releases 4.17-5.0 (T2A processor with optimized sizing):
COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged, 4 vCPU)COMPUTE_NODE_REPLICAS: "2" (for GCP heterogeneous jobs, balanced layout)ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (optimized to 2 vCPU)MIGRATION jobs:
MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged)MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-4 → t2a-standard-2 (4.20-5.0, optimized to 2 vCPU)Job Schedules:
interval: 168h(weekly interval, distributed execution)Modified Files (27 total)
Benefits
Reduces capacity exhaustion by:
Full OpenShift compatibility:
Optimal resource allocation:
Improved scheduling efficiency:
Better cluster efficiency:
Future flexibility:
Related Issues