WIP: OCPBUGS-62517 ha replicas by tmshort · Pull Request #709 · openshift/operator-framework-operator-controller

tmshort · 2026-04-24T21:07:28Z

Summary by CodeRabbit

New Features
- Added Pod Disruption Budgets for catalogd and operator-controller deployments to ensure at least one pod remains available during service disruptions, improving availability and resilience.
Chores
- Increased replica counts from 1 to 2 for both catalogd-controller-manager and operator-controller-controller-manager deployments.
- Updated Helm chart configurations to reflect new replica settings.

…tionBudget With replicas=1, rolling updates cause Available=False on the olm ClusterOperator because there is a window where no pod is ready. Setting replicas=2 ensures at least one pod remains available throughout the rolling update (maxUnavailable=0, maxSurge=1), keeping the ClusterOperator Available=True. Re-enable the PodDisruptionBudget (minAvailable=1) which was previously disabled because it would deadlock node drains with a single replica. With 2 replicas, the PDB correctly protects against voluntary disruptions while still allowing node drains. This replaces the previous <carry> that set podDisruptionBudget.enabled=false. Fixes: https://issues.redhat.com/browse/OCPBUGS-62517

…azily catalogd's catalog HTTP server previously called net.Listen eagerly at startup on every pod, even non-leaders that never called http.Serve. With replicas > 1 this caused ~50% of catalog requests to queue indefinitely in the kernel accept backlog. Replace manager.Server with a custom catalogServerRunnable that binds the port lazily inside Start() (called on every pod since NeedLeaderElection returns false) and closes a ready channel to signal readiness. A /readyz check selects on that channel so all pods become ready immediately. Non-leader pods serve requests but return 404 (empty local cache); callers are expected to retry. This avoids the rolling-update deadlock that would occur if Start() were gated on leadership. Upstream PR: operator-framework/operator-controller#2674 Fixes: https://issues.redhat.com/browse/OCPBUGS-62517

coderabbitai · 2026-04-24T21:07:42Z

Walkthrough

The catalog server startup is refactored from controller-runtime's manager.Server to a custom catalogServerRunnable, with every pod now listening immediately (via NeedLeaderElection() == false). Readiness checks are added, and shutdown is handled by a goroutine. Simultaneously, both catalogd and operator-controller deployments are scaled from 1 to 2 replicas with PodDisruptionBudgets enforcing minAvailable: 1.

Changes

Cohort / File(s)	Summary
Catalog Server Refactor `internal/catalogd/serverutil/serverutil.go`	Replaces `manager.Server` (leader-only gating) with custom `catalogServerRunnable`. Start() now closes a readiness channel before serving, spawns a shutdown goroutine monitoring context cancellation, and ignores `http.ErrServerClosed`. Parameter renamed from `tlsFileWatcher` to `cw`.
Catalogd Deployment Scaling & PDB `openshift/catalogd/manifests.yaml`, `openshift/catalogd/manifests-experimental.yaml`	Increases `catalogd-controller-manager` Deployment replicas from 1 to 2 and adds PodDisruptionBudget with `minAvailable: 1`.
Catalogd Helm Configuration `openshift/helm/catalogd.yaml`	Removes `podDisruptionBudget.enabled: false` and sets `replicas: 2`.
Operator-Controller Deployment Scaling & PDB `openshift/operator-controller/manifests.yaml`, `openshift/operator-controller/manifests-experimental.yaml`	Increases `operator-controller-controller-manager` Deployment replicas from 1 to 2 and adds PodDisruptionBudget with `minAvailable: 1`.
Operator-Controller Helm Configuration `openshift/helm/operator-controller.yaml`	Removes `podDisruptionBudget.enabled: false` and sets `replicas: 2`.

Sequence Diagram(s)

sequenceDiagram
    participant Mgr as Manager
    participant CSR as catalogServerRunnable
    participant ReadyChk as Readiness<br/>Checker
    participant HTTPSrv as http.Server
    participant SDGo as Shutdown<br/>Goroutine
    participant Ctx as Context

    Mgr->>CSR: Start(ctx)
    activate CSR
    
    CSR->>SDGo: spawn goroutine
    activate SDGo
    
    CSR->>ReadyChk: close readyzChan
    Note over ReadyChk: Readiness check now passes
    
    CSR->>HTTPSrv: Serve on CatalogAddr
    activate HTTPSrv
    
    Note over HTTPSrv: Serving requests,<br/>TLS from cert watcher
    
    Mgr->>ReadyChk: AddReadyzCheck("catalog-server", ...)
    
    alt Server shutdown scenario
        Ctx->>Ctx: Context cancelled
        SDGo->>Ctx: <ctx.Done() received>
        SDGo->>HTTPSrv: Shutdown(30s timeout)
        HTTPSrv->>HTTPSrv: graceful close
        HTTPSrv-->>SDGo: returns
        SDGo->>CSR: (goroutine exits)
        deactivate SDGo
    end
    
    HTTPSrv-->>CSR: error or nil (closed ignored)
    deactivate HTTPSrv
    CSR-->>Mgr: 
    deactivate CSR

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (11 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'WIP: OCPBUGS-62517 ha replicas' clearly summarizes the main changes: scaling deployments to HA (high availability) with 2 replicas and adding Pod Disruption Budgets.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names	✅ Passed	No test files were modified in this PR; all changes are to Go source code and manifest files.
Test Structure And Quality	✅ Passed	This PR modifies serverutil.go and manifest/Helm files but does not contain Ginkgo test code, making this check not applicable.
Microshift Test Compatibility	✅ Passed	PR contains no new Ginkgo e2e tests; only production code and manifest changes were made.
Single Node Openshift (Sno) Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests. Changes consist of refactoring catalog server registration logic, updating Kubernetes manifests for high availability, and updating Helm chart configuration. The existing test suite uses Godog for e2e tests and standard Go testing for unit tests.
Topology-Aware Scheduling Compatibility	✅ Passed	This PR adjusts replica counts to 2 and enables PodDisruptionBudget with minAvailable:1, compatible with all OpenShift topologies without new problematic scheduling constraints.
Ote Binary Stdout Contract	✅ Passed	PR refactors catalog server registration and updates manifest files without introducing process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility	✅ Passed	This PR does not add any new Ginkgo e2e tests. The changes consist of modifications to the catalog server implementation in internal/catalogd/serverutil/serverutil.go, Kubernetes manifests in YAML files, and Helm chart values files. Since the check specifically applies to new Ginkgo e2e tests and no such tests are being added in this PR, the check is not applicable and therefore passes.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-24T21:07:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~DOWNSTREAM_OWNERS~~ [tmshort]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/catalogd/serverutil/serverutil.go`:
- Around line 67-68: The comment for catalogServerRunnable is stale: it states
"leader-only Runnable" while NeedLeaderElection() returns false (so it runs on
all pods); update the comment above catalogServerRunnable to remove or correct
the "leader-only" wording and accurately describe that the runnable runs on all
pods and lazily binds the catalog HTTP port in Start(), ensuring the comment
matches the behavior of NeedLeaderElection() and Start().
- Around line 78-87: PopulateCache in
internal/operator-controller/catalogmetadata/client/client.go is caching non-200
responses (notably 404s from non-leader pods) which causes permanent failures
during rollouts; update PopulateCache to not cache error responses and to
implement retries for transient/non-leader 404s (use a small retry loop with
exponential backoff and/or attempt alternate endpoints), only writing to the
local cache when a 2xx response is received; keep existing behavior that
Start()/NeedLeaderElection() returns false but ensure PopulateCache retries
across requests before giving up and then surface a final error without
persisting the 404 into the cache.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: def80a5a-2570-4718-a4c6-821b6a09c784

📥 Commits

Reviewing files that changed from the base of the PR and between 9b9eb03 and cf8b48a.

📒 Files selected for processing (7)

internal/catalogd/serverutil/serverutil.go
openshift/catalogd/manifests-experimental.yaml
openshift/catalogd/manifests.yaml
openshift/helm/catalogd.yaml
openshift/helm/operator-controller.yaml
openshift/operator-controller/manifests-experimental.yaml
openshift/operator-controller/manifests.yaml

coderabbitai · 2026-04-24T21:18:45Z

+// catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port
+// lazily inside Start(), so non-leader pods never hold the listen socket.


⚠️ Potential issue | 🟡 Minor

Stale comment contradicts implementation.

The comment says "leader-only Runnable" but NeedLeaderElection() returns false at line 87, meaning this runs on all pods. The comment should be updated to match the implementation.

📝 Proposed fix

-// catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port -// lazily inside Start(), so non-leader pods never hold the listen socket. +// catalogServerRunnable is a Runnable that binds the catalog HTTP port on all pods. +// NeedLeaderElection returns false so all replicas serve catalog requests immediately.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port

// lazily inside Start(), so non-leader pods never hold the listen socket.

// catalogServerRunnable is a Runnable that binds the catalog HTTP port on all pods.

// NeedLeaderElection returns false so all replicas serve catalog requests immediately.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/catalogd/serverutil/serverutil.go` around lines 67 - 68, The comment for catalogServerRunnable is stale: it states "leader-only Runnable" while NeedLeaderElection() returns false (so it runs on all pods); update the comment above catalogServerRunnable to remove or correct the "leader-only" wording and accurately describe that the runnable runs on all pods and lazily binds the catalog HTTP port in Start(), ensuring the comment matches the behavior of NeedLeaderElection() and Start().

coderabbitai · 2026-04-24T21:18:45Z

+// NeedLeaderElection returns false so the catalog server starts on every pod
+// immediately, regardless of leadership.  This is required for rolling updates:
+// if Start() were gated on leadership, a new pod could not win the leader lease
+// (held by the still-running old pod) and therefore could never pass the
+// catalog-server readiness check, deadlocking the rollout.
+//
+// Non-leader pods serve the catalog HTTP port but have an empty local cache
+// (only the leader's reconciler downloads catalog content), so requests to a
+// non-leader return 404.  Callers are expected to retry.
+func (r *catalogServerRunnable) NeedLeaderElection() bool { return false }


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Search for catalog client code and retry/backoff patterns rg -n -A5 'catalogd-service|/api/v1|ClusterCatalog' --type go | head -100 # Look for retry logic in catalog client ast-grep --pattern $'func $FUNC($$$) { $$$ retry($$$) $$$ }'

Repository: openshift/operator-framework-operator-controller

Length of output: 6971

🏁 Script executed:

#!/bin/bash # Find files related to catalog client implementation fd -e go | xargs rg -l 'catalogd' | grep -E '(client|catalog)' | head -20 # Search for HTTP client code making requests to catalogd rg -n 'http.*Get|http\.Client' --type go | grep -i catalog | head -20 # Look for where catalog data is fetched rg -n 'catalogd-service|/api/v1' --type go --no-heading | head -30

Repository: openshift/operator-framework-operator-controller

Length of output: 7441

🏁 Script executed:

#!/bin/bash # Look for the actual catalog client implementation find . -path ./test -prune -o -name '*catalog*.go' -type f -print | head -20 # Search for retry logic and backoff patterns rg -n 'retry|backoff|exponential|MaxRetries' --type go | head -30

Repository: openshift/operator-framework-operator-controller

Length of output: 5753

🏁 Script executed:

#!/bin/bash # Search in internal/operator-controller for catalog fetching logic rg -n 'catalogd\|/api/v1/catalogs\|ClusterCatalog' internal/operator-controller --type go -B2 -A5 | head -100

Repository: openshift/operator-framework-operator-controller

Length of output: 74

🏁 Script executed:

#!/bin/bash # Read the catalog client implementation cat -n internal/operator-controller/catalogmetadata/client/client.go # Also check the test file to understand expected behavior cat -n internal/operator-controller/catalogmetadata/client/client_test.go | head -300

Repository: openshift/operator-framework-operator-controller

Length of output: 17978

Client lacks retry logic for 404 responses from non-leader pods.

The catalog client in internal/operator-controller/catalogmetadata/client/client.go does not implement retry logic. When PopulateCache() receives a non-200 status code (including 404 from non-leader pods), it caches the error without retrying. This means a request that hits a non-leader pod will cache the 404 error permanently rather than retrying against the leader pod, breaking the rolling update mechanism that depends on client retries.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@internal/catalogd/serverutil/serverutil.go` around lines 78 - 87, PopulateCache in internal/operator-controller/catalogmetadata/client/client.go is caching non-200 responses (notably 404s from non-leader pods) which causes permanent failures during rollouts; update PopulateCache to not cache error responses and to implement retries for transient/non-leader 404s (use a small retry loop with exponential backoff and/or attempt alternate endpoints), only writing to the local cache when a 2xx response is received; keep existing behavior that Start()/NeedLeaderElection() returns false but ensure PopulateCache retries across requests before giving up and then surface a final error without persisting the 404 into the cache.

openshift-ci · 2026-04-25T01:09:22Z

@tmshort: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/openshift-e2e-aws	`cf8b48a`	link	true	`/test openshift-e2e-aws`
ci/prow/e2e-aws-upgrade-ovn-single-node	`cf8b48a`	link	false	`/test e2e-aws-upgrade-ovn-single-node`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

tmshort added 2 commits April 23, 2026 15:40

openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026

openshift-ci Bot requested review from pedjak and perdasilva April 24, 2026 21:07

openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2026

coderabbitai Bot reviewed Apr 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: OCPBUGS-62517 ha replicas#709

WIP: OCPBUGS-62517 ha replicas#709
tmshort wants to merge 2 commits intoopenshift:mainfrom
tmshort:ocpbugs-62517-ha-replicas

tmshort commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

coderabbitai Bot Apr 24, 2026

Uh oh!

openshift-ci Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		// catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port
		// lazily inside Start(), so non-leader pods never hold the listen socket.

Conversation

tmshort commented Apr 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

openshift-ci Bot commented Apr 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

openshift-ci Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tmshort commented Apr 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 24, 2026 •

edited

Loading