Skip to content

SREP-4345: Fix ROSA CI stability for account-roles and osd-cluster-ready#77500

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
dustman9000:fix-rosa-ci-stability
Apr 8, 2026
Merged

SREP-4345: Fix ROSA CI stability for account-roles and osd-cluster-ready#77500
openshift-merge-bot[bot] merged 1 commit intoopenshift:mainfrom
dustman9000:fix-rosa-ci-stability

Conversation

@dustman9000
Copy link
Copy Markdown
Member

Summary

Two fixes for systemic ROSA CI failures affecting all HCP and Classic jobs:

1. Account-roles version fallback (fixes 4.23/5.0 jobs)

When nightly builds for unreleased OCP versions (4.23, 5.0) don't have IAM policies published in ROSA yet, rosa create account-roles --version 4.23 fails. The fix detects unavailable versions and falls back to the latest available version in that channel group.

Affected jobs (all 100% failing):

  • periodic-ci-openshift-release-main-nightly-4.23-e2e-rosa-hcp-ovn
  • periodic-ci-openshift-release-main-nightly-4.23-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-5.0-e2e-rosa-sts-ovn

2. Increase osd-cluster-ready timeout from 60m to 120m (fixes Classic STS)

The osd-cluster-ready job requires 20 consecutive health checks including certificate validation. certman-operator cert provisioning on staging is slow enough to reset the check counter repeatedly, exceeding the 60m timeout.

Affected jobs (all 100% failing):

  • periodic-ci-openshift-release-main-nightly-4.18-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-4.19-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-4.20-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-4.21-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-rosa-sts-ovn

Jira: https://redhat.atlassian.net/browse/SREP-4345

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

openshift-ci-robot commented Apr 7, 2026

@dustman9000: This pull request references SREP-4345 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Two fixes for systemic ROSA CI failures affecting all HCP and Classic jobs:

1. Account-roles version fallback (fixes 4.23/5.0 jobs)

When nightly builds for unreleased OCP versions (4.23, 5.0) don't have IAM policies published in ROSA yet, rosa create account-roles --version 4.23 fails. The fix detects unavailable versions and falls back to the latest available version in that channel group.

Affected jobs (all 100% failing):

  • periodic-ci-openshift-release-main-nightly-4.23-e2e-rosa-hcp-ovn
  • periodic-ci-openshift-release-main-nightly-4.23-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-5.0-e2e-rosa-sts-ovn

2. Increase osd-cluster-ready timeout from 60m to 120m (fixes Classic STS)

The osd-cluster-ready job requires 20 consecutive health checks including certificate validation. certman-operator cert provisioning on staging is slow enough to reset the check counter repeatedly, exceeding the 60m timeout.

Affected jobs (all 100% failing):

  • periodic-ci-openshift-release-main-nightly-4.18-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-4.19-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-4.20-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-4.21-e2e-rosa-sts-ovn
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-rosa-sts-ovn

Jira: https://redhat.atlassian.net/browse/SREP-4345

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 7, 2026
@openshift-ci openshift-ci bot requested review from gaol and tzhou5 April 7, 2026 17:38
@dustman9000 dustman9000 force-pushed the fix-rosa-ci-stability branch from cb04460 to 2528751 Compare April 7, 2026 17:47
Two fixes for systemic ROSA CI failures:

1. Account-roles version fallback: when nightly builds for unreleased
   OCP versions (4.23, 5.0) dont have IAM policies published yet,
   fall back to the latest available version instead of failing.

2. Increase osd-cluster-ready timeout from 60m to 120m: certman-operator
   cert delivery via Hive can be slow on staging (DNS validation), and
   the osd-cluster-ready job crashes on transient errors (log.Fatal),
   burning time in exponential backoff. Doubling the timeout gives more
   headroom while the upstream crash-loop fix is worked on in
   openshift/osd-cluster-ready.
@dustman9000 dustman9000 force-pushed the fix-rosa-ci-stability branch from 2528751 to 7327e69 Compare April 7, 2026 18:10
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@dustman9000: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
pull-ci-openshift-svt-master-reliability-v2-rosa-4.17-nightly-x86-reliability-v2-20h openshift/svt presubmit Registry content changed
pull-ci-openshift-svt-master-reliability-v2-rosa-4.17-nightly-x86-reliability-v2-1h openshift/svt presubmit Registry content changed
pull-ci-openshift-svt-master-reliability-v2-rosa_hcp-4.17-nightly-x86-reliability-v2-20h openshift/svt presubmit Registry content changed
pull-ci-openshift-svt-master-reliability-v2-rosa_hcp-4.17-nightly-x86-reliability-v2-1h openshift/svt presubmit Registry content changed
pull-ci-rh-ecosystem-edge-neuron-ci-main-4.19-stable-aws-neuron-operator-e2e rh-ecosystem-edge/neuron-ci presubmit Registry content changed
pull-ci-rh-ecosystem-edge-neuron-ci-main-4.20-stable-aws-neuron-operator-e2e rh-ecosystem-edge/neuron-ci presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-master-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-release-5.0-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-release-4.23-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-release-4.22-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-release-4.21-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-release-4.20-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-release-4.19-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-release-4.18-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-file-integrity-operator-release-4.17-e2e-rosa openshift/file-integrity-operator presubmit Registry content changed
pull-ci-openshift-aws-load-balancer-operator-main-e2e-aws-rosa-operator openshift/aws-load-balancer-operator presubmit Registry content changed
pull-ci-openshift-aws-load-balancer-operator-release-1.2-e2e-aws-rosa-operator openshift/aws-load-balancer-operator presubmit Registry content changed
pull-ci-openshift-aws-load-balancer-operator-release-1.1-e2e-aws-rosa-operator openshift/aws-load-balancer-operator presubmit Registry content changed
pull-ci-openshift-aws-load-balancer-operator-release-1.0-e2e-aws-rosa-operator openshift/aws-load-balancer-operator presubmit Registry content changed
pull-ci-CSPI-QE-MSI-single-cluster-smoke-v4.14-single-cluster-rosa-4-14-candidate-smoke CSPI-QE/MSI presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-cluster-density-v2-249nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-control-plane-120nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-control-plane-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-data-path-9nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed
pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-rosa-4.21-nightly-x86-node-density-heavy-24nodes openshift-eng/ocp-qe-perfscale-ci presubmit Registry content changed

A total of 292 jobs have been affected by this change. The above listing is non-exhaustive and limited to 25 jobs.

A full list of affected jobs can be found here

Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@dustman9000
Copy link
Copy Markdown
Member Author

/pj-rehearse periodic-ci-openshift-release-main-nightly-4.21-e2e-rosa-sts-ovn

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@dustman9000: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@dustman9000
Copy link
Copy Markdown
Member Author

/pj-rehearse periodic-ci-openshift-release-main-nightly-4.21-e2e-rosa-sts-ovn

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@dustman9000: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@dustman9000
Copy link
Copy Markdown
Member Author

/pj-rehearse ack

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@dustman9000: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-ci-robot openshift-ci-robot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Apr 8, 2026
@joshbranham
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 8, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 8, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dustman9000, joshbranham

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 8, 2026

@dustman9000: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/rehearse/periodic-ci-openshift-release-main-nightly-4.21-e2e-rosa-sts-ovn 7327e69 link unknown /pj-rehearse periodic-ci-openshift-release-main-nightly-4.21-e2e-rosa-sts-ovn

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-merge-bot openshift-merge-bot bot merged commit 757aec7 into openshift:main Apr 8, 2026
9 of 10 checks passed
dustman9000 added a commit to dustman9000/release that referenced this pull request Apr 9, 2026
PR openshift#77500 introduced a version fallback block that references
CLUSTER_SWITCH before it was defined, causing an unbound variable
error with set -o nounset. Move the CLUSTER_SWITCH assignment
before the fallback block so it is available when needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants