OCPBUGS-94187: Backport HA topology deployment scaling to release-4.22#758
OCPBUGS-94187: Backport HA topology deployment scaling to release-4.22#758tmshort wants to merge 3 commits into
Conversation
…ffinity for HA topology Rolling updates in HighlyAvailable clusters leave catalogd and operator-controller unavailable when the only running pod is evicted before its replacement is ready. Fix by defaulting replicas=1 and PDB disabled in the static Helm values (safe for SNO/External topologies, passes the SNO conformance test that asserts exactly one replica in SingleReplica topology mode). Add pod anti-affinity to prefer scheduling replicas on different nodes. cluster-olm-operator detects the cluster's ControlPlaneTopology at startup and overrides these values to replicas=2 and PDB enabled when a HighlyAvailable topology is detected, then re-renders the manifests before starting controllers. When a topology change is observed at runtime (exceedingly rare), the operator exits so its deployment controller restarts it, triggering a fresh Helm render with the correct values for the new topology. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>
…in PDBs (#2688) Allow eviction of unhealthy (not ready) pods even if there are no disruptions allowed on a PodDisruptionBudget. This can help to drain/maintain a node and recover without a manual intervention when multiple instances of nodes or pods are misbehaving. Upstream commit: 869124a
|
@tmshort: This pull request references Jira Issue OCPBUGS-94187, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (jkeister@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository: openshift/coderabbit/.coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/test e2e-gcp-ovn-upgrade |
|
@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b3ce9cb0-74b9-11f1-8508-8fb8e861b82f-0 |
|
/retest |
|
These tests will fail until openshift/cluster-olm-operator#215 is merged. |
|
@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a4ca57d0-7500-11f1-812b-c867d5ec9129-0 |
|
https://pr-payload-tests.ci.openshift.org/runs/ci/a4ca57d0-7500-11f1-812b-c867d5ec9129-0 passed with both PRs of them. |
Adds a new test that verifies cluster-olm-operator correctly configures operator-controller and catalogd deployments based on the cluster's control plane topology: - HA topologies (HighlyAvailable, HighlyAvailableArbiter, DualReplica): replicas=2 with a PodDisruptionBudget present - Non-HA topologies (SingleReplica/SNO, External): replicas=1, no PDB Also registers policyv1 in the test scheme to support PDB list queries. Assisted-by: claude Signed-off-by: Todd Short <tshort@redhat.com>
1d73978 to
a786e3a
Compare
|
/test default-catalog-consistency |
|
/lgtm |
|
@tmshort: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
@tmshort: This PR was included in a payload test run from openshift/cluster-olm-operator#215
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c8134f30-7578-11f1-8e7d-1d1eca22b8ec-0 |
rashmigottipati
left a comment
There was a problem hiding this comment.
/verified by @rashmigottipati
|
@rashmigottipati: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rashmigottipati, tmshort The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Summary
Backport of OCPBUGS-62517 fix to release-4.22.
Bug: During OCP upgrades,
ClusterOperator olmgoesAvailable=Falsebecause theoperator-controllerandcatalogddeployments each run with only 1 replica. When the single pod is replaced during a rolling update, there is a brief window where no pod is available.Fix: For HighlyAvailable topologies,
cluster-olm-operatoroverrides replicas to 2 and enables a PodDisruptionBudget withminAvailable: 1. Pod anti-affinity spreads replicas across different nodes.Commits
UPSTREAM: <carry>: OCPBUGS-62517: Set replicas=1, PDB, and pod anti-affinity for HA topologyreplicas: 1default toopenshift/helm/catalogd.yamlandopenshift/helm/operator-controller.yaml(cluster-olm-operator overrides to 2 for HA)podDisruptionBudget.enabled: falsedefault (cluster-olm-operator enables for HA)UPSTREAM: <carry>: use AlwaysAllow UnhealthyPodEvictionPolicy option in PDBs (#2688)unhealthyPodEvictionPolicy: AlwaysAllowto PDB Helm templatesUPSTREAM: <carry>: add OLMv1 topology-based deployment scaling e2e testTest plan
periodic-ci-openshift-release-main-ci-4.22-e2e-gcp-ovn-upgradepassesclusteroperator/olm should not change condition/AvailabletestPaired PR: cluster-olm-operator release-4.22 (HA topology detection and scaling logic)
Jira: https://redhat.atlassian.net/browse/OCPBUGS-94187
🤖 Generated with Claude Code