OCPBUGS-94187: Backport HA topology deployment scaling to release-4.22 by tmshort · Pull Request #758 · openshift/operator-framework-operator-controller

tmshort · 2026-06-30T19:16:50Z

Summary

Backport of OCPBUGS-62517 fix to release-4.22.

Bug: During OCP upgrades, ClusterOperator olm goes Available=False because the operator-controller and catalogd deployments each run with only 1 replica. When the single pod is replaced during a rolling update, there is a brief window where no pod is available.

Fix: For HighlyAvailable topologies, cluster-olm-operator overrides replicas to 2 and enables a PodDisruptionBudget with minAvailable: 1. Pod anti-affinity spreads replicas across different nodes.

Commits

UPSTREAM: <carry>: OCPBUGS-62517: Set replicas=1, PDB, and pod anti-affinity for HA topology
- Adds replicas: 1 default to openshift/helm/catalogd.yaml and openshift/helm/operator-controller.yaml (cluster-olm-operator overrides to 2 for HA)
- Adds podDisruptionBudget.enabled: false default (cluster-olm-operator enables for HA)
- Adds pod anti-affinity (preferredDuringSchedulingIgnoredDuringExecution, weight: 100, by hostname) to manifests
UPSTREAM: <carry>: use AlwaysAllow UnhealthyPodEvictionPolicy option in PDBs (#2688)
- Adds unhealthyPodEvictionPolicy: AlwaysAllow to PDB Helm templates
- Allows eviction of unhealthy pods during node drain even when PDB would otherwise block it
UPSTREAM: <carry>: add OLMv1 topology-based deployment scaling e2e test
- Verifies HA topologies get replicas=2 + PDB, non-HA topologies keep replicas=1 + no PDB

Test plan

Verify CI passes for upgrade jobs
Check periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade passes clusteroperator/olm should not change condition/Available test
On HA cluster: operator-controller and catalogd deployments have 2 replicas and a PDB
On SNO cluster: replicas=1, no PDB (non-HA behavior unchanged)

Paired PR: cluster-olm-operator release-4.22 (HA topology detection and scaling logic)

Jira: https://redhat.atlassian.net/browse/OCPBUGS-94187

🤖 Generated with Claude Code

…ffinity for HA topology Rolling updates in HighlyAvailable clusters leave catalogd and operator-controller unavailable when the only running pod is evicted before its replacement is ready. Fix by defaulting replicas=1 and PDB disabled in the static Helm values (safe for SNO/External topologies, passes the SNO conformance test that asserts exactly one replica in SingleReplica topology mode). Add pod anti-affinity to prefer scheduling replicas on different nodes. cluster-olm-operator detects the cluster's ControlPlaneTopology at startup and overrides these values to replicas=2 and PDB enabled when a HighlyAvailable topology is detected, then re-renders the manifests before starting controllers. When a topology change is observed at runtime (exceedingly rare), the operator exits so its deployment controller restarts it, triggering a fresh Helm render with the correct values for the new topology. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>

…in PDBs (#2688) Allow eviction of unhealthy (not ready) pods even if there are no disruptions allowed on a PodDisruptionBudget. This can help to drain/maintain a node and recover without a manual intervention when multiple instances of nodes or pods are misbehaving. Upstream commit: 869124a

openshift-ci-robot · 2026-06-30T19:16:58Z

@tmshort: This pull request references Jira Issue OCPBUGS-94187, which is valid. The bug has been moved to the POST state.

7 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.22.0) matches configured target version for branch (4.22.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)
release note type set to "Release Note Not Required"
dependent bug Jira Issue OCPBUGS-62517 is in the state Verified, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
dependent Jira Issue OCPBUGS-62517 targets the "5.0.0" version, which is one of the valid target versions: 5.0.0
bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Jira (jkeister@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Summary

Backport of OCPBUGS-62517 fix to release-4.22.

Bug: During OCP upgrades, ClusterOperator olm goes Available=False because the operator-controller and catalogd deployments each run with only 1 replica. When the single pod is replaced during a rolling update, there is a brief window where no pod is available.

Fix: For HighlyAvailable topologies, cluster-olm-operator overrides replicas to 2 and enables a PodDisruptionBudget with minAvailable: 1. Pod anti-affinity spreads replicas across different nodes.

Commits

UPSTREAM: <carry>: OCPBUGS-62517: Set replicas=1, PDB, and pod anti-affinity for HA topology

Adds replicas: 1 default to openshift/helm/catalogd.yaml and openshift/helm/operator-controller.yaml (cluster-olm-operator overrides to 2 for HA)

Adds podDisruptionBudget.enabled: false default (cluster-olm-operator enables for HA)

Adds pod anti-affinity (preferredDuringSchedulingIgnoredDuringExecution, weight: 100, by hostname) to manifests

UPSTREAM: <carry>: use AlwaysAllow UnhealthyPodEvictionPolicy option in PDBs (#2688)

Adds unhealthyPodEvictionPolicy: AlwaysAllow to PDB Helm templates

Allows eviction of unhealthy pods during node drain even when PDB would otherwise block it

UPSTREAM: <carry>: add OLMv1 topology-based deployment scaling e2e test

Verifies HA topologies get replicas=2 + PDB, non-HA topologies keep replicas=1 + no PDB

Test plan

Verify CI passes for upgrade jobs

Check periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade passes clusteroperator/olm should not change condition/Available test

On HA cluster: operator-controller and catalogd deployments have 2 replicas and a PDB

On SNO cluster: replicas=1, no PDB (non-HA behavior unchanged)

Paired PR: cluster-olm-operator release-4.22 (HA topology detection and scaling logic)

Jira: https://redhat.atlassian.net/browse/OCPBUGS-94187

🤖 Generated with Claude Code

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai · 2026-06-30T19:18:31Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 927fa637-1711-4d61-ac1a-fef5a4a25345

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

tmshort · 2026-06-30T19:21:46Z

/test e2e-gcp-ovn-upgrade

openshift-ci · 2026-06-30T19:27:21Z

@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3ce9cb0-74b9-11f1-8508-8fb8e861b82f-0

tmshort · 2026-07-01T01:11:28Z

/retest

tmshort · 2026-07-01T03:51:42Z

These tests will fail until openshift/cluster-olm-operator#215 is merged.

openshift-ci · 2026-07-01T03:55:08Z

@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a4ca57d0-7500-11f1-812b-c867d5ec9129-0

tmshort · 2026-07-01T13:18:28Z

https://pr-payload-tests.ci.openshift.org/runs/ci/a4ca57d0-7500-11f1-812b-c867d5ec9129-0 passed with both PRs of them.

Adds a new test that verifies cluster-olm-operator correctly configures operator-controller and catalogd deployments based on the cluster's control plane topology: - HA topologies (HighlyAvailable, HighlyAvailableArbiter, DualReplica): replicas=2 with a PodDisruptionBudget present - Non-HA topologies (SingleReplica/SNO, External): replicas=1, no PDB Also registers policyv1 in the test scheme to support PDB list queries. Assisted-by: claude Signed-off-by: Todd Short <tshort@redhat.com>

tmshort · 2026-07-01T14:09:44Z

/test default-catalog-consistency

pedjak · 2026-07-01T15:04:01Z

/lgtm

rashmigottipati

/lgtm
/approve

openshift-ci · 2026-07-01T17:45:10Z

@tmshort: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci · 2026-07-01T18:15:08Z

@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c8134f30-7578-11f1-8e7d-1d1eca22b8ec-0

tmshort · 2026-07-02T00:37:20Z

https://pr-payload-tests.ci.openshift.org/runs/ci/c8134f30-7578-11f1-8e7d-1d1eca22b8ec-0 passed

rashmigottipati

/verified by @rashmigottipati

openshift-ci-robot · 2026-07-02T01:04:58Z

@rashmigottipati: This PR has been marked as verified by @rashmigottipati.

Details

In response to this:

/verified by @rashmigottipati

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-07-02T01:05:05Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rashmigottipati, tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [tmshort]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tmshort and others added 2 commits June 30, 2026 15:06

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 30, 2026

openshift-ci Bot requested review from bentito and fgiudici June 30, 2026 19:17

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 30, 2026

tmshort mentioned this pull request Jun 30, 2026

OCPBUGS-94187: Scale to replicas=2 and enable PDB on HighlyAvailable topology [release-4.22] openshift/cluster-olm-operator#215

Open

4 tasks

tmshort force-pushed the ocpbugs-94187-release-4.22 branch from 1d73978 to a786e3a Compare July 1, 2026 13:30

openshift-ci Bot assigned pedjak Jul 1, 2026

openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jul 1, 2026

rashmigottipati approved these changes Jul 1, 2026

View reviewed changes

openshift-ci Bot assigned rashmigottipati Jul 1, 2026

rashmigottipati approved these changes Jul 2, 2026

View reviewed changes

openshift-ci-robot added the verified Signifies that the PR passed pre-merge verification criteria label Jul 2, 2026

Uh oh!

Conversation

tmshort commented Jun 30, 2026

Summary

Commits

Test plan

Uh oh!

openshift-ci-robot commented Jun 30, 2026

Summary

Commits

Test plan

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

tmshort commented Jun 30, 2026

Uh oh!

openshift-ci Bot commented Jun 30, 2026

Uh oh!

tmshort commented Jul 1, 2026

Uh oh!

tmshort commented Jul 1, 2026

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

tmshort commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmshort commented Jul 1, 2026

Uh oh!

pedjak commented Jul 1, 2026

Uh oh!

rashmigottipati left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

openshift-ci Bot commented Jul 1, 2026

Uh oh!

tmshort commented Jul 2, 2026

Uh oh!

rashmigottipati left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci-robot commented Jul 2, 2026

Uh oh!

openshift-ci Bot commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

tmshort commented Jul 1, 2026 •

edited

Loading