OCPBUGS-82060: Fix GCP ARM capacity exhaustion in multi-arch jobs by jianlinliu · Pull Request #77483 · openshift/release

jianlinliu · 2026-04-07T14:22:41Z

Summary

This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing due to ARM instance capacity exhaustion through zone randomization, instance sizing optimization, balanced worker configuration, and schedule distribution.

Root Causes

GCP T2A ARM capacity exhaustion in us-central1-a zone causing machines to get stuck in PROVISIONING state
ipi-install-heterogeneous always used first machineset, concentrating load in single zone
Larger instances have lower availability - standard-4 instances more likely to hit capacity limits
Unbalanced worker configuration - heterogeneous jobs used 3 AMD64 + 2 ARM64 workers (5 total)
Schedule contention - 18 GCP ARM jobs all scheduled at Sunday 11:00 UTC, creating massive resource contention

Solutions

1. Zone Randomization

Random machineset selection in ipi-install-heterogeneous distributes ARM instances across zones to reduce capacity pressure.

2. Instance Sizing Optimization

Use smaller instances (standard-2) for additional workers and migration infra to improve availability and reduce resource consumption.

3. Balanced Worker Configuration

Set COMPUTE_NODE_REPLICAS: "2" for GCP heterogeneous jobs to create balanced 2+2 worker layout instead of unbalanced 3+2 configuration.

Before:

3 AMD64 workers (default COMPUTE_NODE_REPLICAS: 3)
2 ARM64 additional workers (default ADDITIONAL_WORKERS: 2)
Total: 5 workers

After:

2 AMD64 workers (COMPUTE_NODE_REPLICAS: "2")
2 ARM64 additional workers (ADDITIONAL_WORKERS: 2)
Total: 4 workers

Benefits:

✅ More balanced heterogeneous cluster (2+2 vs 3+2)
✅ Reduced total resource consumption (4 workers vs 5)
✅ Faster cluster installation (one less worker to provision)
✅ Lower cluster overhead (less kubelet/container runtime load)

4. Schedule Distribution

Change job scheduling from cron to interval-based (168h) for releases 4.19-5.0 to prevent simultaneous execution.

Before:

All 18 jobs ran at Sunday 11:00 UTC (cron: 0 11 * * 0)
3 jobs per release × 6 releases = 18 concurrent jobs
Massive resource contention spike every Sunday

After:

All releases 4.19-5.0: Changed to interval: 168h (weekly interval)
Jobs naturally distribute based on completion times
Prevents simultaneous resource contention

Benefits:

✅ Eliminates weekly resource contention spike
✅ Zone randomization more effective when jobs don't overlap
✅ Better capacity distribution throughout the week
✅ Reduced likelihood of hitting zone capacity limits

5. Why T2A (not C4A/N4A)?

C4A and N4A Compatibility Issue:

C4A and N4A only support Hyperdisk, NOT Persistent Disk
OpenShift monitoring, logging, and registry use pd-standard PVCs by default

Rehearse testing with C4A failed with:

AttachVolume.Attach failed for volume "pvc-0ad4471f-65fe-40c6-8850-8768b0a91e07"
rpc error: code = InvalidArgument desc = Failed to Attach: failed cloud service
attach disk call: googleapi: Error 400: pd-standard disk type cannot be used by
c4a-standard-4 machine type., badRequest

T2A Advantages:

✅ Supports BOTH Persistent Disk (pd-standard, pd-balanced, pd-ssd) AND Hyperdisk
✅ Full OpenShift compatibility with default storage classes
✅ No breaking changes to monitoring, logging, or registry
✅ Proven reliability and stability

GCP ARM Instance Disk Support Comparison:

Instance Type	Persistent Disk	Hyperdisk	OpenShift Compatible
T2A (Tau)	✅ Yes (pd-standard, pd-balanced, pd-ssd)	✅ Yes	✅ Yes
C4A (Axion Compute)	❌ No	✅ Yes (hyperdisk-balanced, hyperdisk-extreme)	❌ No
N4A (Axion General)	❌ No	✅ Yes (hyperdisk-balanced, hyperdisk-throughput)	❌ No

Changes

ipi-install-heterogeneous Step

Random machineset selection for zone distribution
New ADDITIONAL_WORKER_DISK_TYPE parameter for GCP disk type configuration

Multi-arch Configs (4.17-5.0)

All Releases 4.17-5.0 (T2A processor with optimized sizing):

COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged, 4 vCPU)
COMPUTE_NODE_REPLICAS: "2" (for GCP heterogeneous jobs, balanced layout)
ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (optimized to 2 vCPU)

MIGRATION jobs:

MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged)
MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-4 → t2a-standard-2 (4.20-5.0, optimized to 2 vCPU)

Job Schedules:

4.19-5.0: Changed to interval: 168h (weekly interval, distributed execution)
Eliminates simultaneous execution and resource contention

Modified Files (27 total)

ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.17*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.18*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.19*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.20*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.21*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-5.0*.yaml (3 files)
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-ref.yaml
ci-operator/jobs/openshift/multiarch/openshift-multiarch-main-periodics.yaml

Benefits

Reduces capacity exhaustion by:
- Spreading load across multiple GCP zones (random machineset selection)
- Using smaller instances (standard-2 vs standard-4) for additional workers and migration infra
- Smaller instances have exponentially better availability
- Balanced worker configuration reduces total resource consumption
- Distributed job execution prevents simultaneous resource spikes
Full OpenShift compatibility:
- T2A supports both Persistent Disk and Hyperdisk
- Works with default pd-standard storage class
- No changes required to monitoring, logging, or registry
Optimal resource allocation:
- Compute nodes use larger instances (t2a-standard-4, 4 vCPU)
- Additional workers use smaller instances (t2a-standard-2, 2 vCPU)
- Migration infra uses smaller instances (t2a-standard-2, 2 vCPU)
- GCP heterogeneous jobs: balanced 2+2 worker layout
Improved scheduling efficiency:
- Interval-based scheduling (168h) prevents simultaneous execution
- Jobs naturally distribute throughout the week
- Zone randomization more effective without overlapping demand
- Eliminates weekly resource contention spike
Better cluster efficiency:
- Faster cluster installation (4 workers vs 5)
- Lower cluster overhead
- More balanced heterogeneous cluster architecture
Future flexibility:
- ADDITIONAL_WORKER_DISK_TYPE parameter allows easy migration to Hyperdisk if needed
- If OpenShift defaults change to hyperdisk-balanced, can reconsider C4A/N4A

Related Issues

Fixes Component Readiness regression: https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/regressions/37839
JIRA Bug: https://issues.redhat.com/browse/OCPBUGS-82060

openshift-ci-robot · 2026-04-07T14:30:06Z

@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is invalid:

expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing with "verify all machines should be in Running state" due to T2A ARM instance capacity exhaustion.

The test was failing across multiple GCP multi-arch variants (4.22, 4.21, and ARM64/multi architectures) with machines getting stuck in PROVISIONING state and then disappearing with "Instance not found on provider" errors. Root cause analysis showed GCP T2A ARM capacity exhaustion in us-central1-a zone.

Changes

1. Randomize zone selection in heterogeneous install step

Modified ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh to randomly select a worker machineset instead of always using the first one

Each machineset is in a different GCP zone, distributing load across zones to avoid zone-specific capacity issues

Added logging to show which machineset index was selected

2. Distribute GCP ARM instance types across releases

Updated all nightly config files for multiarch jobs to use different ARM instance types per release:

Release 4.21: t2a-standard-4 → t2a-standard-2 (smaller for better availability, known Tau processor)

Release 4.22: t2a-standard-4 → c4a-standard-4 (newer Axion generation, addresses regression 37839)

Release 4.23: t2a-standard-4 → c4a-standard-2 (newest Axion + smallest for best availability)

Release 5.0: t2a-standard-4 → c4a-standard-2 (newest Axion + smallest for best availability)

Older releases (≤4.20): unchanged (t2a-standard-4)

Modified files:
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.21*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-5.0*.yaml (3 files)
Benefits

Reduces capacity exhaustion by spreading load across:

Multiple GCP zones (via random machineset selection)

Multiple ARM instance types (t2a-standard-2, t2a-standard-4, c4a-standard-2, c4a-standard-4)

Improved availability:

C4A (Axion) instances generally have better availability than T2A (Tau)

Smaller instance types (standard-2) have exponentially better availability than standard-4

Cost savings: Using smaller instances where appropriate (standard-2 vs standard-4)

Forward compatibility: Newer releases use newer Axion processors with better availability

Test Plan

Monitor multi-arch GCP jobs across all affected releases (4.21, 4.22, 4.23, 5.0)

Verify heterogeneous worker provisioning succeeds across different zone selections

Watch for reduced "verify all machines should be in Running state" failures

Related Issues

Fixes Component Readiness regression: https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/regressions/37839

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-07T15:03:00Z

@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing with "verify all machines should be in Running state" due to T2A ARM instance capacity exhaustion.

The test was failing across multiple GCP multi-arch variants (4.22, 4.21, and ARM64/multi architectures) with machines getting stuck in PROVISIONING state and then disappearing with "Instance not found on provider" errors. Root cause analysis showed GCP T2A ARM capacity exhaustion in us-central1-a zone.

Changes

1. Randomize zone selection in heterogeneous install step

Modified ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh to randomly select a worker machineset instead of always using the first one

Each machineset is in a different GCP zone, distributing load across zones to avoid zone-specific capacity issues

Added logging to show which machineset index was selected

2. Distribute GCP ARM instance types across releases

Strategy: Use bigger instances (standard-4) for COMPUTE_NODE_TYPE and smaller instances (standard-2) for ADDITIONAL_WORKER_VM_TYPE to reduce capacity pressure while maintaining sufficient resources.

Release 4.21 (Tau processor):

COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged, 4 vCPU)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (2 vCPU, smaller)

MIGRATION_CP/INFRA_MACHINE_TYPE: t2a-standard-4 (unchanged, 4 vCPU)

Release 4.22 (newer Axion processor):

COMPUTE_NODE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU, newer generation)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → c4a-standard-2 (2 vCPU, smaller)

MIGRATION_CP/INFRA_MACHINE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU)

Release 4.23 (newest Axion):

COMPUTE_NODE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → c4a-standard-2 (2 vCPU, smaller)

MIGRATION_CP/INFRA_MACHINE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU)

Release 5.0 (same as 4.23):

COMPUTE_NODE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → c4a-standard-2 (2 vCPU, smaller)

MIGRATION_CP/INFRA_MACHINE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU)

Older releases (≤4.20): unchanged (t2a-standard-4)

Modified files:
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.21*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-5.0*.yaml (3 files)
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh (1 file)
Benefits

Reduces capacity exhaustion by:

Spreading load across multiple GCP zones (via random machineset selection)

Using smaller instances for additional heterogeneous workers

Distributing across multiple ARM instance types (T2A Tau and C4A Axion)

Improved availability:

C4A (Axion) instances generally have better availability than T2A (Tau)

Smaller instance types (standard-2) have exponentially better availability than standard-4

Additional workers sized appropriately to reduce capacity pressure

Optimal resource allocation:

Compute nodes maintain full resources (standard-4, 4 vCPU)

Additional heterogeneous workers use smaller instances (standard-2, 2 vCPU) sufficient for testing

Cost optimization:

Smaller instances for additional workers reduce costs

Progressive migration to newer Axion processors

Forward compatibility:

Newer releases (4.22, 4.23, 5.0) use newest Axion processors with best availability

Test Plan

Monitor multi-arch GCP jobs across all affected releases (4.21, 4.22, 4.23, 5.0)

Verify heterogeneous worker provisioning succeeds across different zone selections and instance types

Watch for reduced "verify all machines should be in Running state" failures

Confirm jobs complete successfully with the new instance type distribution

Related Issues

Fixes Component Readiness regression: https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/regressions/37839

JIRA Bug filed: https://issues.redhat.com/browse/OCPBUGS-82060

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-07T15:05:31Z

@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing with "verify all machines should be in Running state" due to T2A ARM instance capacity exhaustion.

The test was failing across multiple GCP multi-arch variants (4.22, 4.21, and ARM64/multi architectures) with machines getting stuck in PROVISIONING state and then disappearing with "Instance not found on provider" errors. Root cause analysis showed GCP T2A ARM capacity exhaustion in us-central1-a zone.

Changes

1. Randomize zone selection in heterogeneous install step

Modified ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh to randomly select a worker machineset instead of always using the first one

Each machineset is in a different GCP zone, distributing load across zones to avoid zone-specific capacity issues

Added logging to show which machineset index was selected

2. Distribute GCP ARM instance types across releases

Strategy: Use bigger instances (standard-4) for COMPUTE_NODE_TYPE and smaller instances (standard-2) for ADDITIONAL_WORKER_VM_TYPE to reduce capacity pressure while maintaining sufficient resources.

Release 4.21 (Tau processor):

COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged, 4 vCPU, Tau)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (2 vCPU, Tau, smaller)

MIGRATION_CP/INFRA_MACHINE_TYPE: t2a-standard-4 (unchanged, 4 vCPU, Tau)

Release 4.22 (Axion processor):

COMPUTE_NODE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU, Axion)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → c4a-standard-2 (2 vCPU, Axion, smaller)

MIGRATION_CP/INFRA_MACHINE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU, Axion)

Release 4.23 (Axion processor):

COMPUTE_NODE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU, Axion)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → c4a-standard-2 (2 vCPU, Axion, smaller)

MIGRATION_CP/INFRA_MACHINE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU, Axion)

Release 5.0 (Axion processor):

COMPUTE_NODE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU, Axion)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → c4a-standard-2 (2 vCPU, Axion, smaller)

MIGRATION_CP/INFRA_MACHINE_TYPE: t2a-standard-4 → c4a-standard-4 (4 vCPU, Axion)

Older releases (≤4.20): Tau processor (t2a-standard-4, unchanged)

Modified files:
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.21*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-5.0*.yaml (3 files)
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh (1 file)
Processor Distribution

Tau (T2A): Releases 4.21 and older (≤4.20)

Axion (C4A): Releases 4.22, 4.23, 5.0

Benefits

Reduces capacity exhaustion by:

Spreading load across multiple GCP zones (via random machineset selection)

Using smaller instances for additional heterogeneous workers

Distributing across multiple ARM instance types (T2A Tau and C4A Axion)

Improved availability:

C4A (Axion) instances generally have better availability than T2A (Tau)

Smaller instance types (standard-2) have exponentially better availability than standard-4

Additional workers sized appropriately to reduce capacity pressure

Optimal resource allocation:

Compute nodes maintain full resources (standard-4, 4 vCPU)

Additional heterogeneous workers use smaller instances (standard-2, 2 vCPU) sufficient for testing

Cost optimization:

Smaller instances for additional workers reduce costs

Progressive migration to newer Axion processors

Forward compatibility:

Newer releases (4.22, 4.23, 5.0) use newest Axion processors with best availability

Test Plan

Monitor multi-arch GCP jobs across all affected releases (4.21, 4.22, 4.23, 5.0)

Verify heterogeneous worker provisioning succeeds across different zone selections and instance types

Watch for reduced "verify all machines should be in Running state" failures

Confirm jobs complete successfully with the new instance type distribution

Related Issues

Fixes Component Readiness regression: https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/regressions/37839

JIRA Bug filed: https://issues.redhat.com/browse/OCPBUGS-82060

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jianlinliu · 2026-04-07T15:25:15Z

/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-upgrade-gcp-ovn-multi-x-ax periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x

openshift-ci-robot · 2026-04-07T15:25:18Z

@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

jianlinliu · 2026-04-08T06:40:42Z

/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-upgrade-gcp-ovn-multi-x-ax periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-nightly-4.16-ocp-e2e-upgrade-gcp-ovn-multi-x-ax

openshift-ci-robot · 2026-04-08T06:40:46Z

@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci-robot · 2026-04-08T07:21:13Z

@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing due to ARM instance capacity exhaustion and C4A disk type compatibility issues.

Root Causes

GCP T2A ARM capacity exhaustion in us-central1-a zone causing machines to get stuck in PROVISIONING state

C4A instances incompatible with pd-ssd disk type (require hyperdisk-balanced)

ipi-install-heterogeneous always used first machineset, concentrating load in single zone

Solutions

1. Zone Randomization

Random machineset selection in ipi-install-heterogeneous distributes ARM instances across zones to reduce capacity pressure.

2. Instance Type Distribution

Spread load across processor families:

Releases 4.17-4.21: T2A (Tau) processor

Releases 4.22-5.0: C4A (Axion) processor

3. Disk Type Compatibility

New parameters ensure C4A instances use hyperdisk-balanced:

COMPUTE_DISK_TYPE: Applied to main workers via install-config.yaml

ADDITIONAL_WORKER_DISK_TYPE: Applied to heterogeneous additional workers via machineset

Changes

ipi-install-heterogeneous Step

Random machineset selection for zone distribution

New ADDITIONAL_WORKER_DISK_TYPE parameter for GCP disk type configuration

ipi-conf-gcp Chain

Include ipi-conf-gcp-osdisk-disktype step

Remove duplicate refs from chains that already include ipi-conf-gcp

Multi-arch Configs (4.17-5.0)

Releases 4.17-4.21 (Tau processor, smaller additional workers):

COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (smaller)

Releases 4.22-5.0 (Axion processor, smaller additional workers):

COMPUTE_NODE_TYPE: t2a-standard-4 → c4a-standard-4

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → c4a-standard-2 (smaller)

COMPUTE_DISK_TYPE: hyperdisk-balanced (for main workers)

ADDITIONAL_WORKER_DISK_TYPE: hyperdisk-balanced (for heterogeneous workers)

MIGRATION jobs (unchanged):

MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged)

MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-4 (unchanged)

Modified Files (32 total)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.17*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.18*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.19*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.20*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.21*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-5.0*.yaml (3 files)
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-ref.yaml
ci-operator/step-registry/ipi/conf/gcp/ipi-conf-gcp-chain.yaml
ci-operator/step-registry/cucushift/installer/rehearse/gcp/ipi/*-provision-chain.yaml (4 files)
ci-operator/step-registry/openshift/e2e/gcp/csi/custom-worker/openshift-e2e-gcp-csi-custom-worker-workflow.yaml
Benefits

Reduces capacity exhaustion by:

Spreading load across multiple GCP zones (random machineset selection)

Using smaller instances for additional heterogeneous workers

Distributing across T2A (4.17-4.21) and C4A (4.22-5.0) processor families

Fixes C4A compatibility:

Ensures C4A instances use hyperdisk-balanced disk type

Prevents pd-ssd incompatibility errors

Optimal resource allocation:

Compute nodes use larger instances (standard-4, 4 vCPU)

Additional workers use smaller instances (standard-2, 2 vCPU)

Better availability:

C4A (Axion) instances have better availability than T2A

Smaller instance types have exponentially better availability

Related Issues

Fixes Component Readiness regression: https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/regressions/37839

JIRA Bug: https://issues.redhat.com/browse/OCPBUGS-82060

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jianlinliu · 2026-04-08T09:22:22Z

/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a

openshift-ci-robot · 2026-04-08T09:22:26Z

@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci-robot · 2026-04-08T12:15:15Z

@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing due to ARM instance capacity exhaustion and C4A disk type compatibility issues.

Root Causes

GCP T2A ARM capacity exhaustion in us-central1-a zone causing machines to get stuck in PROVISIONING state

C4A instances incompatible with pd-ssd disk type (require hyperdisk-balanced)

ipi-install-heterogeneous always used first machineset, concentrating load in single zone

Solutions

1. Zone Randomization

Random machineset selection in ipi-install-heterogeneous distributes ARM instances across zones to reduce capacity pressure.

2. Instance Type Migration

Migrate all releases (4.17-5.0) to C4A (Axion) processor to spread load across processor families and improve availability.

3. Disk Type Compatibility

New parameters ensure C4A instances use hyperdisk-balanced:

COMPUTE_DISK_TYPE: Applied to main workers via install-config.yaml

ADDITIONAL_WORKER_DISK_TYPE: Applied to heterogeneous additional workers via machineset

4. Instance Sizing Optimization

Use smaller instances for additional workers and migration infra to reduce resource consumption and improve availability.

Changes

ipi-install-heterogeneous Step

Random machineset selection for zone distribution

New ADDITIONAL_WORKER_DISK_TYPE parameter for GCP disk type configuration

ipi-conf-gcp Chain

Include ipi-conf-gcp-osdisk-disktype step

Remove duplicate refs from chains that already include ipi-conf-gcp

Multi-arch Configs (4.17-5.0)

All Releases 4.17-5.0 (Axion processor with optimized sizing):

COMPUTE_NODE_TYPE: t2a-standard-4 → c4a-standard-4

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → c4a-standard-2 (smaller instance)

COMPUTE_DISK_TYPE: hyperdisk-balanced (for main workers)

ADDITIONAL_WORKER_DISK_TYPE: hyperdisk-balanced (for heterogeneous workers)

MIGRATION jobs:

MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged)

MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-4 → t2a-standard-2 (4.20-5.0, optimized)

Modified Files (32 total)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.17*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.18*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.19*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.20*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.21*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-5.0*.yaml (3 files)
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-ref.yaml
ci-operator/step-registry/ipi/conf/gcp/ipi-conf-gcp-chain.yaml
ci-operator/step-registry/cucushift/installer/rehearse/gcp/ipi/*-provision-chain.yaml (4 files)
ci-operator/step-registry/openshift/e2e/gcp/csi/custom-worker/openshift-e2e-gcp-csi-custom-worker-workflow.yaml
Benefits

Reduces capacity exhaustion by:

Spreading load across multiple GCP zones (random machineset selection)

Using smaller instances for additional heterogeneous workers and migration infra

Migrating all releases to C4A (Axion) processor

Fixes C4A compatibility:

Ensures C4A instances use hyperdisk-balanced disk type

Prevents pd-ssd incompatibility errors

Optimal resource allocation:

Compute nodes use larger instances (standard-4, 4 vCPU)

Additional workers use smaller instances (standard-2, 2 vCPU)

Migration infra uses smaller instances (standard-2, 2 vCPU)

Better availability:

C4A (Axion) instances have better availability than T2A

Smaller instance types have exponentially better availability

Unified processor family (C4A) simplifies capacity planning

Related Issues

Fixes Component Readiness regression: https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/regressions/37839

JIRA Bug: https://issues.redhat.com/browse/OCPBUGS-82060

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

jianlinliu · 2026-04-08T14:10:54Z

/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-stable-4.16-ocp-e2e-upgrade-gcp-ovn-multi-a-a periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a

openshift-ci-robot · 2026-04-08T14:11:13Z

@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci-robot · 2026-04-08T14:14:30Z

@jianlinliu: job(s): periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-stable-4.16-ocp-e2e-upgrade-gcp-ovn-multi-a-a, periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a either don't exist or were not found to be affected, and cannot be rehearsed

jianlinliu · 2026-04-08T14:25:22Z

/pj-rehearse abort

openshift-ci-robot · 2026-04-08T14:25:25Z

@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci-robot · 2026-04-08T14:28:48Z

@jianlinliu: This pull request references Jira Issue OCPBUGS-82060, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Details

In response to this:

Summary

This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing due to ARM instance capacity exhaustion through zone randomization, instance sizing optimization, balanced worker configuration, and schedule distribution.

Root Causes

GCP T2A ARM capacity exhaustion in us-central1-a zone causing machines to get stuck in PROVISIONING state

ipi-install-heterogeneous always used first machineset, concentrating load in single zone

Larger instances have lower availability - standard-4 instances more likely to hit capacity limits

Unbalanced worker configuration - heterogeneous jobs used 3 AMD64 + 2 ARM64 workers (5 total)

Schedule contention - 18 GCP ARM jobs all scheduled at Sunday 11:00 UTC, creating massive resource contention

Solutions

1. Zone Randomization

Random machineset selection in ipi-install-heterogeneous distributes ARM instances across zones to reduce capacity pressure.

2. Instance Sizing Optimization

Use smaller instances (standard-2) for additional workers and migration infra to improve availability and reduce resource consumption.

3. Balanced Worker Configuration

Set COMPUTE_NODE_REPLICAS: "2" for GCP heterogeneous jobs to create balanced 2+2 worker layout instead of unbalanced 3+2 configuration.

Before:

3 AMD64 workers (default COMPUTE_NODE_REPLICAS: 3)

2 ARM64 additional workers (default ADDITIONAL_WORKERS: 2)

Total: 5 workers

After:

2 AMD64 workers (COMPUTE_NODE_REPLICAS: "2")

2 ARM64 additional workers (ADDITIONAL_WORKERS: 2)

Total: 4 workers

Benefits:

✅ More balanced heterogeneous cluster (2+2 vs 3+2)

✅ Reduced total resource consumption (4 workers vs 5)

✅ Faster cluster installation (one less worker to provision)

✅ Lower cluster overhead (less kubelet/container runtime load)

4. Schedule Distribution

Change job scheduling from cron to interval-based (168h) for releases 4.20-5.0 to prevent simultaneous execution.

Before (Releases 4.19-5.0):

All 18 jobs ran at Sunday 11:00 UTC (cron: 0 11 * * 0)

3 jobs per release × 6 releases = 18 concurrent jobs

Massive resource contention spike every Sunday

After:

4.19: Keeps cron schedule (Sunday 11:00 UTC)

4.20-5.0: Changed to interval: 168h (weekly interval)

Jobs naturally distribute based on completion times

Prevents simultaneous resource contention

Benefits:

✅ Eliminates weekly resource contention spike

✅ Zone randomization more effective when jobs don't overlap

✅ Better capacity distribution throughout the week

✅ Reduced likelihood of hitting zone capacity limits

5. Why T2A (not C4A/N4A)?

C4A and N4A Compatibility Issue:

C4A and N4A only support Hyperdisk, NOT Persistent Disk

OpenShift monitoring, logging, and registry use pd-standard PVCs by default

Rehearse testing with C4A failed with:
AttachVolume.Attach failed for volume "pvc-0ad4471f-65fe-40c6-8850-8768b0a91e07"
rpc error: code = InvalidArgument desc = Failed to Attach: failed cloud service
attach disk call: googleapi: Error 400: pd-standard disk type cannot be used by
c4a-standard-4 machine type., badRequest
T2A Advantages:

✅ Supports BOTH Persistent Disk (pd-standard, pd-balanced, pd-ssd) AND Hyperdisk

✅ Full OpenShift compatibility with default storage classes

✅ No breaking changes to monitoring, logging, or registry

✅ Proven reliability and stability

GCP ARM Instance Disk Support Comparison:

Instance Type Persistent Disk Hyperdisk OpenShift Compatible

T2A (Tau) ✅ Yes (pd-standard, pd-balanced, pd-ssd) ✅ Yes ✅ Yes

C4A (Axion Compute) ❌ No ✅ Yes (hyperdisk-balanced, hyperdisk-extreme) ❌ No

N4A (Axion General) ❌ No ✅ Yes (hyperdisk-balanced, hyperdisk-throughput) ❌ No

Changes

ipi-install-heterogeneous Step

Random machineset selection for zone distribution

New ADDITIONAL_WORKER_DISK_TYPE parameter for GCP disk type configuration

Multi-arch Configs (4.17-5.0)

All Releases 4.17-5.0 (T2A processor with optimized sizing):

COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged, 4 vCPU)

COMPUTE_NODE_REPLICAS: "2" (for GCP heterogeneous jobs, balanced layout)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (optimized to 2 vCPU)

MIGRATION jobs:

MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged)

MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-4 → t2a-standard-2 (4.20-5.0, optimized to 2 vCPU)

Job Schedules:

4.19: cron: 0 11 * * 0 (Sunday 11:00 UTC, unchanged)

4.20-5.0: interval: 168h (weekly interval, distributed execution)

Modified Files (26 total)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.17*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.18*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.19*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.20*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.21*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-5.0*.yaml (3 files)
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-ref.yaml
Benefits

Reduces capacity exhaustion by:

Spreading load across multiple GCP zones (random machineset selection)

Using smaller instances (standard-2 vs standard-4) for additional workers and migration infra

Smaller instances have exponentially better availability

Balanced worker configuration reduces total resource consumption

Distributed job execution prevents simultaneous resource spikes

Full OpenShift compatibility:

T2A supports both Persistent Disk and Hyperdisk

Works with default pd-standard storage class

No changes required to monitoring, logging, or registry

Optimal resource allocation:

Compute nodes use larger instances (t2a-standard-4, 4 vCPU)

Additional workers use smaller instances (t2a-standard-2, 2 vCPU)

Migration infra uses smaller instances (t2a-standard-2, 2 vCPU)

GCP heterogeneous jobs: balanced 2+2 worker layout

Improved scheduling efficiency:

Interval-based scheduling (168h) prevents simultaneous execution

Jobs naturally distribute throughout the week

Zone randomization more effective without overlapping demand

Eliminates weekly resource contention spike

Better cluster efficiency:

Faster cluster installation (4 workers vs 5)

Lower cluster overhead

More balanced heterogeneous cluster architecture

Future flexibility:

ADDITIONAL_WORKER_DISK_TYPE parameter allows easy migration to Hyperdisk if needed

If OpenShift defaults change to hyperdisk-balanced, can reconsider C4A/N4A

Related Issues

Fixes Component Readiness regression: https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/regressions/37839

JIRA Bug: https://issues.redhat.com/browse/OCPBUGS-82060

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

This change addresses GCP ARM instance capacity issues in multi-arch CI jobs through zone randomization, instance sizing optimization, balanced worker configuration, and schedule distribution. Root Cause: - GCP T2A ARM instances experiencing capacity exhaustion in us-central1-a zone - ipi-install-heterogeneous always used first machineset, concentrating load - Larger instances (standard-4) have lower availability than smaller instances - Unbalanced worker configuration (3 AMD64 + 2 ARM64) in heterogeneous jobs - All GCP ARM jobs scheduled at same time (Sunday 11:00), creating resource contention Solutions: 1. Zone randomization: Random machineset selection distributes ARM instances across zones to reduce capacity pressure 2. Instance sizing optimization: Use smaller instances (t2a-standard-2) for additional workers and migration infra to improve availability 3. Balanced worker configuration: Set COMPUTE_NODE_REPLICAS to 2 for GCP heterogeneous jobs to create balanced 2+2 worker layout (2 AMD64 + 2 ARM64) 4. Schedule distribution: Change to interval-based scheduling (168h) for releases 4.19-5.0 to prevent all jobs from running simultaneously 5. Disk type compatibility: Add ADDITIONAL_WORKER_DISK_TYPE parameter for heterogeneous workers Why T2A (not C4A/N4A): - C4A and N4A only support Hyperdisk, NOT Persistent Disk - OpenShift monitoring/logging/registry use pd-standard PVCs by default - Using C4A/N4A causes: "pd-standard disk type cannot be used by c4a-standard-4 machine type" errors during pod volume attachment - T2A supports BOTH Persistent Disk (pd-standard, pd-balanced, pd-ssd) AND Hyperdisk, ensuring full OpenShift compatibility Changes: - ipi-install-heterogeneous: Random machineset selection for zone distribution - New ADDITIONAL_WORKER_DISK_TYPE parameter for heterogeneous additional workers - Update GCP multi-arch configs (4.17-5.0): * COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged, larger instance) * ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (smaller) * COMPUTE_NODE_REPLICAS: "2" (for GCP heterogeneous jobs, balanced layout) * MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged) * MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-4 → t2a-standard-2 (4.20-5.0) - Update job schedules (4.19-5.0): * Changed from cron (Sunday 11:00) to interval: 168h * Prevents 18+ jobs from running simultaneously * Jobs will naturally distribute based on completion times Instance Type Configuration: - All releases (4.17-5.0) use T2A (Tau) processor - COMPUTE_NODE_TYPE: t2a-standard-4 (4 vCPU, larger instance) - ADDITIONAL_WORKER_VM_TYPE: t2a-standard-2 (2 vCPU, smaller instance) - MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged) - MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-2 (2 vCPU, optimized for 4.20-5.0) Worker Configuration for GCP Heterogeneous Jobs: - Before: 3 AMD64 workers + 2 ARM64 additional workers = 5 total - After: 2 AMD64 workers + 2 ARM64 additional workers = 4 total - Benefits: Balanced configuration, reduced resources, faster installation Schedule Configuration: - 4.17-4.18: No GCP heterogeneous jobs or keep existing schedules - 4.19-5.0: interval: 168h (weekly, distributed execution) - Prevents resource contention from simultaneous job execution Disk Type Configuration: - T2A supports both Persistent Disk and Hyperdisk - No explicit disk type parameters needed (defaults to pd-standard) - ADDITIONAL_WORKER_DISK_TYPE parameter added to ref for future flexibility Benefits: - Reduces capacity exhaustion through zone randomization - Smaller instances (standard-2) have exponentially better availability - Balanced worker layout (2+2) more efficient than unbalanced (3+2) - Reduced total resource consumption (4 workers vs 5) - Distributed job execution prevents resource contention - Full OpenShift compatibility with default storage (pd-standard) - Lower resource consumption for additional workers and migration infra Related: OCPBUGS-82060

openshift-ci · 2026-04-08T14:42:59Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jianlinliu
Once this PR has been reviewed and has the lgtm label, please assign tvardema for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

ci-operator/config/openshift/multiarch/OWNERS
ci-operator/jobs/openshift/multiarch/OWNERS
~~ci-operator/step-registry/ipi/install/heterogeneous/OWNERS~~ [jianlinliu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci-robot · 2026-04-08T14:46:47Z

[REHEARSALNOTIFIER]
@jianlinliu: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name	Repo	Type	Reason
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp420-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp419-e2e-azure	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp417-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp417-azure-mto-heterogeneous-perfscale	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp417-azure-heterogeneous-perfscale	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp416-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp418-e2e-azure	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp422-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-main-ocp421-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v1.x-ocp416-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v1.x-ocp420-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v1.x-ocp419-e2e-azure	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v1.x-ocp417-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v1.x-ocp421-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v0.0.1-e2e-gcp-multi-operator-olm	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v0.0.1-e2e-gcp-multi-operator	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v1.x-ocp422-e2e-gcp	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v1.x-ocp418-e2e-azure	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v0.9-e2e-gcp-multi-operator-olm	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-outrigger-project-multiarch-tuning-operator-v0.9-e2e-gcp-multi-operator	outrigger-project/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-openshift-multiarch-tuning-operator-main-ocp416-e2e-gcp	openshift/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-openshift-multiarch-tuning-operator-main-ocp420-e2e-gcp	openshift/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-openshift-multiarch-tuning-operator-main-ocp417-e2e-gcp	openshift/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-openshift-multiarch-tuning-operator-main-ocp417-azure-mto-heterogeneous-perfscale	openshift/multiarch-tuning-operator	presubmit	Registry content changed
pull-ci-openshift-multiarch-tuning-operator-main-ocp417-azure-heterogeneous-perfscale	openshift/multiarch-tuning-operator	presubmit	Registry content changed

A total of 773 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

jianlinliu · 2026-04-08T15:00:03Z

/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-stable-4.16-ocp-e2e-upgrade-gcp-ovn-multi-a-a periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-upgrade-gcp-ovn-multi-x-ax

openshift-ci-robot · 2026-04-08T15:00:07Z

@jianlinliu: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

openshift-ci-robot · 2026-04-08T15:03:43Z

@jianlinliu: job(s): periodic-ci-openshift-multiarch-main-nightly-4.17-upgrade-from-stable-4.16-ocp-e2e-upgrade-gcp-ovn-multi-a-a, periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a either don't exist or were not found to be affected, and cannot be rehearsed

openshift-ci · 2026-04-08T18:10:36Z

@jianlinliu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/rehearse/periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a	`28d3298`	link	unknown	`/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-upgrade-from-stable-4.21-ocp-e2e-upgrade-gcp-ovn-multi-a-a`
ci/rehearse/periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x	`a289177`	link	unknown	`/pj-rehearse periodic-ci-openshift-multiarch-main-nightly-4.22-ocp-e2e-gcp-ovn-multi-x-x-to-a-x`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-bot · 2026-04-09T19:05:47Z

Closing, please open a separate request if this work is still required.

openshift-ci-robot · 2026-04-09T19:05:53Z

@jianlinliu: This pull request references Jira Issue OCPBUGS-82060. The bug has been updated to no longer refer to the pull request using the external bug tracker. All external bug links have been closed. The bug has been moved to the NEW state.

Details

In response to this:

Summary

This PR addresses Component Readiness regression 37839 where GCP multi-arch jobs were failing due to ARM instance capacity exhaustion through zone randomization, instance sizing optimization, balanced worker configuration, and schedule distribution.

Root Causes

GCP T2A ARM capacity exhaustion in us-central1-a zone causing machines to get stuck in PROVISIONING state

ipi-install-heterogeneous always used first machineset, concentrating load in single zone

Larger instances have lower availability - standard-4 instances more likely to hit capacity limits

Unbalanced worker configuration - heterogeneous jobs used 3 AMD64 + 2 ARM64 workers (5 total)

Schedule contention - 18 GCP ARM jobs all scheduled at Sunday 11:00 UTC, creating massive resource contention

Solutions

1. Zone Randomization

Random machineset selection in ipi-install-heterogeneous distributes ARM instances across zones to reduce capacity pressure.

2. Instance Sizing Optimization

Use smaller instances (standard-2) for additional workers and migration infra to improve availability and reduce resource consumption.

3. Balanced Worker Configuration

Set COMPUTE_NODE_REPLICAS: "2" for GCP heterogeneous jobs to create balanced 2+2 worker layout instead of unbalanced 3+2 configuration.

Before:

3 AMD64 workers (default COMPUTE_NODE_REPLICAS: 3)

2 ARM64 additional workers (default ADDITIONAL_WORKERS: 2)

Total: 5 workers

After:

2 AMD64 workers (COMPUTE_NODE_REPLICAS: "2")

2 ARM64 additional workers (ADDITIONAL_WORKERS: 2)

Total: 4 workers

Benefits:

✅ More balanced heterogeneous cluster (2+2 vs 3+2)

✅ Reduced total resource consumption (4 workers vs 5)

✅ Faster cluster installation (one less worker to provision)

✅ Lower cluster overhead (less kubelet/container runtime load)

4. Schedule Distribution

Change job scheduling from cron to interval-based (168h) for releases 4.19-5.0 to prevent simultaneous execution.

Before:

All 18 jobs ran at Sunday 11:00 UTC (cron: 0 11 * * 0)

3 jobs per release × 6 releases = 18 concurrent jobs

Massive resource contention spike every Sunday

After:

All releases 4.19-5.0: Changed to interval: 168h (weekly interval)

Jobs naturally distribute based on completion times

Prevents simultaneous resource contention

Benefits:

✅ Eliminates weekly resource contention spike

✅ Zone randomization more effective when jobs don't overlap

✅ Better capacity distribution throughout the week

✅ Reduced likelihood of hitting zone capacity limits

5. Why T2A (not C4A/N4A)?

C4A and N4A Compatibility Issue:

C4A and N4A only support Hyperdisk, NOT Persistent Disk

OpenShift monitoring, logging, and registry use pd-standard PVCs by default

Rehearse testing with C4A failed with:
AttachVolume.Attach failed for volume "pvc-0ad4471f-65fe-40c6-8850-8768b0a91e07"
rpc error: code = InvalidArgument desc = Failed to Attach: failed cloud service
attach disk call: googleapi: Error 400: pd-standard disk type cannot be used by
c4a-standard-4 machine type., badRequest
T2A Advantages:

✅ Supports BOTH Persistent Disk (pd-standard, pd-balanced, pd-ssd) AND Hyperdisk

✅ Full OpenShift compatibility with default storage classes

✅ No breaking changes to monitoring, logging, or registry

✅ Proven reliability and stability

GCP ARM Instance Disk Support Comparison:

Instance Type Persistent Disk Hyperdisk OpenShift Compatible

T2A (Tau) ✅ Yes (pd-standard, pd-balanced, pd-ssd) ✅ Yes ✅ Yes

C4A (Axion Compute) ❌ No ✅ Yes (hyperdisk-balanced, hyperdisk-extreme) ❌ No

N4A (Axion General) ❌ No ✅ Yes (hyperdisk-balanced, hyperdisk-throughput) ❌ No

Changes

ipi-install-heterogeneous Step

Random machineset selection for zone distribution

New ADDITIONAL_WORKER_DISK_TYPE parameter for GCP disk type configuration

Multi-arch Configs (4.17-5.0)

All Releases 4.17-5.0 (T2A processor with optimized sizing):

COMPUTE_NODE_TYPE: t2a-standard-4 (unchanged, 4 vCPU)

COMPUTE_NODE_REPLICAS: "2" (for GCP heterogeneous jobs, balanced layout)

ADDITIONAL_WORKER_VM_TYPE: t2a-standard-4 → t2a-standard-2 (optimized to 2 vCPU)

MIGRATION jobs:

MIGRATION_CP_MACHINE_TYPE: t2a-standard-4 (unchanged)

MIGRATION_INFRA_MACHINE_TYPE: t2a-standard-4 → t2a-standard-2 (4.20-5.0, optimized to 2 vCPU)

Job Schedules:

4.19-5.0: Changed to interval: 168h (weekly interval, distributed execution)

Eliminates simultaneous execution and resource contention

Modified Files (27 total)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.17*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.18*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.19*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.20*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.21*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.22*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-4.23*.yaml (3 files)
ci-operator/config/openshift/multiarch/openshift-multiarch-main__nightly-5.0*.yaml (3 files)
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-commands.sh
ci-operator/step-registry/ipi/install/heterogeneous/ipi-install-heterogeneous-ref.yaml
ci-operator/jobs/openshift/multiarch/openshift-multiarch-main-periodics.yaml
Benefits

Reduces capacity exhaustion by:

Spreading load across multiple GCP zones (random machineset selection)

Using smaller instances (standard-2 vs standard-4) for additional workers and migration infra

Smaller instances have exponentially better availability

Balanced worker configuration reduces total resource consumption

Distributed job execution prevents simultaneous resource spikes

Full OpenShift compatibility:

T2A supports both Persistent Disk and Hyperdisk

Works with default pd-standard storage class

No changes required to monitoring, logging, or registry

Optimal resource allocation:

Compute nodes use larger instances (t2a-standard-4, 4 vCPU)

Additional workers use smaller instances (t2a-standard-2, 2 vCPU)

Migration infra uses smaller instances (t2a-standard-2, 2 vCPU)

GCP heterogeneous jobs: balanced 2+2 worker layout

Improved scheduling efficiency:

Interval-based scheduling (168h) prevents simultaneous execution

Jobs naturally distribute throughout the week

Zone randomization more effective without overlapping demand

Eliminates weekly resource contention spike

Better cluster efficiency:

Faster cluster installation (4 workers vs 5)

Lower cluster overhead

More balanced heterogeneous cluster architecture

Future flexibility:

ADDITIONAL_WORKER_DISK_TYPE parameter allows easy migration to Hyperdisk if needed

If OpenShift defaults change to hyperdisk-balanced, can reconsider C4A/N4A

Related Issues

Fixes Component Readiness regression: https://sippy-auth.dptools.openshift.org/sippy-ng/component_readiness/regressions/37839

JIRA Bug: https://issues.redhat.com/browse/OCPBUGS-82060

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci Bot requested review from lihongan and oliver-smakal April 7, 2026 14:24

jianlinliu changed the title ~~Fix GCP ARM capacity exhaustion in multi-arch jobs~~ OCPBUGS-82060: Fix GCP ARM capacity exhaustion in multi-arch jobs Apr 7, 2026

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 7, 2026

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch 4 times, most recently from 067d338 to b9be8e2 Compare April 7, 2026 14:50

openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Apr 7, 2026

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch from b9be8e2 to 1c5d6a5 Compare April 7, 2026 15:02

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch 2 times, most recently from a83558f to bc06f12 Compare April 7, 2026 15:14

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch 7 times, most recently from 05544a0 to 28d3298 Compare April 8, 2026 06:28

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch 3 times, most recently from 4595e1e to 7af8931 Compare April 8, 2026 12:06

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch 4 times, most recently from df8cf0e to b94e64d Compare April 8, 2026 13:56

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch from b94e64d to 623b787 Compare April 8, 2026 14:27

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch from 623b787 to ee6dd22 Compare April 8, 2026 14:34

jianlinliu force-pushed the fix-gcp-arm-capacity-multiarch branch from ee6dd22 to a289177 Compare April 8, 2026 14:42

openshift-bot closed this Apr 9, 2026

barbacbd mentioned this pull request Apr 15, 2026

OCPBUGS-82060: Distribute GCP ARM64 instance types across releases to reduce quota exhaustion #77809

Merged

Conversation

jianlinliu commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Causes

Solutions

1. Zone Randomization

2. Instance Sizing Optimization

3. Balanced Worker Configuration

4. Schedule Distribution

5. Why T2A (not C4A/N4A)?

Changes

ipi-install-heterogeneous Step

Multi-arch Configs (4.17-5.0)

Modified Files (27 total)

Benefits

Related Issues

Uh oh!

openshift-ci-robot commented Apr 7, 2026

Summary

Changes

1. Randomize zone selection in heterogeneous install step

2. Distribute GCP ARM instance types across releases

Benefits

Test Plan

Related Issues

Uh oh!

openshift-ci-robot commented Apr 7, 2026

Summary

Changes

1. Randomize zone selection in heterogeneous install step

2. Distribute GCP ARM instance types across releases

Benefits

Test Plan

Related Issues

Uh oh!

openshift-ci-robot commented Apr 7, 2026

Summary

Changes

1. Randomize zone selection in heterogeneous install step

2. Distribute GCP ARM instance types across releases

Processor Distribution

Benefits

Test Plan

Related Issues

Uh oh!

jianlinliu commented Apr 7, 2026

Uh oh!

openshift-ci-robot commented Apr 7, 2026

Uh oh!

jianlinliu commented Apr 8, 2026

Uh oh!

openshift-ci-robot commented Apr 8, 2026

Uh oh!

openshift-ci-robot commented Apr 8, 2026

Summary

Root Causes

Solutions

1. Zone Randomization

2. Instance Type Distribution

3. Disk Type Compatibility

Changes

ipi-install-heterogeneous Step

ipi-conf-gcp Chain

Multi-arch Configs (4.17-5.0)

Modified Files (32 total)

Benefits

Related Issues

Uh oh!

jianlinliu commented Apr 8, 2026

Uh oh!

openshift-ci-robot commented Apr 8, 2026

Uh oh!

openshift-ci-robot commented Apr 8, 2026

Summary

Root Causes

Solutions

1. Zone Randomization

2. Instance Type Migration

3. Disk Type Compatibility

jianlinliu commented Apr 7, 2026 •

edited

Loading