Skip to content

WIP: OCPBUGS-62517 ha replicas#709

Open
tmshort wants to merge 2 commits intoopenshift:mainfrom
tmshort:ocpbugs-62517-ha-replicas
Open

WIP: OCPBUGS-62517 ha replicas#709
tmshort wants to merge 2 commits intoopenshift:mainfrom
tmshort:ocpbugs-62517-ha-replicas

Conversation

@tmshort
Copy link
Copy Markdown
Contributor

@tmshort tmshort commented Apr 24, 2026

Summary by CodeRabbit

  • New Features

    • Added Pod Disruption Budgets for catalogd and operator-controller deployments to ensure at least one pod remains available during service disruptions, improving availability and resilience.
  • Chores

    • Increased replica counts from 1 to 2 for both catalogd-controller-manager and operator-controller-controller-manager deployments.
    • Updated Helm chart configurations to reflect new replica settings.

tmshort added 2 commits April 23, 2026 15:40
…tionBudget

With replicas=1, rolling updates cause Available=False on the olm
ClusterOperator because there is a window where no pod is ready.
Setting replicas=2 ensures at least one pod remains available throughout
the rolling update (maxUnavailable=0, maxSurge=1), keeping the
ClusterOperator Available=True.

Re-enable the PodDisruptionBudget (minAvailable=1) which was previously
disabled because it would deadlock node drains with a single replica.
With 2 replicas, the PDB correctly protects against voluntary disruptions
while still allowing node drains.

This replaces the previous <carry> that set podDisruptionBudget.enabled=false.

Fixes: https://issues.redhat.com/browse/OCPBUGS-62517
…azily

catalogd's catalog HTTP server previously called net.Listen eagerly at
startup on every pod, even non-leaders that never called http.Serve. With
replicas > 1 this caused ~50% of catalog requests to queue indefinitely in
the kernel accept backlog.

Replace manager.Server with a custom catalogServerRunnable that binds the
port lazily inside Start() (called on every pod since NeedLeaderElection
returns false) and closes a ready channel to signal readiness. A /readyz
check selects on that channel so all pods become ready immediately. Non-leader
pods serve requests but return 404 (empty local cache); callers are expected
to retry. This avoids the rolling-update deadlock that would occur if Start()
were gated on leadership.

Upstream PR: operator-framework/operator-controller#2674

Fixes: https://issues.redhat.com/browse/OCPBUGS-62517
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 24, 2026

Walkthrough

The catalog server startup is refactored from controller-runtime's manager.Server to a custom catalogServerRunnable, with every pod now listening immediately (via NeedLeaderElection() == false). Readiness checks are added, and shutdown is handled by a goroutine. Simultaneously, both catalogd and operator-controller deployments are scaled from 1 to 2 replicas with PodDisruptionBudgets enforcing minAvailable: 1.

Changes

Cohort / File(s) Summary
Catalog Server Refactor
internal/catalogd/serverutil/serverutil.go
Replaces manager.Server (leader-only gating) with custom catalogServerRunnable. Start() now closes a readiness channel before serving, spawns a shutdown goroutine monitoring context cancellation, and ignores http.ErrServerClosed. Parameter renamed from tlsFileWatcher to cw.
Catalogd Deployment Scaling & PDB
openshift/catalogd/manifests.yaml, openshift/catalogd/manifests-experimental.yaml
Increases catalogd-controller-manager Deployment replicas from 1 to 2 and adds PodDisruptionBudget with minAvailable: 1.
Catalogd Helm Configuration
openshift/helm/catalogd.yaml
Removes podDisruptionBudget.enabled: false and sets replicas: 2.
Operator-Controller Deployment Scaling & PDB
openshift/operator-controller/manifests.yaml, openshift/operator-controller/manifests-experimental.yaml
Increases operator-controller-controller-manager Deployment replicas from 1 to 2 and adds PodDisruptionBudget with minAvailable: 1.
Operator-Controller Helm Configuration
openshift/helm/operator-controller.yaml
Removes podDisruptionBudget.enabled: false and sets replicas: 2.

Sequence Diagram(s)

sequenceDiagram
    participant Mgr as Manager
    participant CSR as catalogServerRunnable
    participant ReadyChk as Readiness<br/>Checker
    participant HTTPSrv as http.Server
    participant SDGo as Shutdown<br/>Goroutine
    participant Ctx as Context

    Mgr->>CSR: Start(ctx)
    activate CSR
    
    CSR->>SDGo: spawn goroutine
    activate SDGo
    
    CSR->>ReadyChk: close readyzChan
    Note over ReadyChk: Readiness check now passes
    
    CSR->>HTTPSrv: Serve on CatalogAddr
    activate HTTPSrv
    
    Note over HTTPSrv: Serving requests,<br/>TLS from cert watcher
    
    Mgr->>ReadyChk: AddReadyzCheck("catalog-server", ...)
    
    alt Server shutdown scenario
        Ctx->>Ctx: Context cancelled
        SDGo->>Ctx: <ctx.Done() received>
        SDGo->>HTTPSrv: Shutdown(30s timeout)
        HTTPSrv->>HTTPSrv: graceful close
        HTTPSrv-->>SDGo: returns
        SDGo->>CSR: (goroutine exits)
        deactivate SDGo
    end
    
    HTTPSrv-->>CSR: error or nil (closed ignored)
    deactivate HTTPSrv
    CSR-->>Mgr: 
    deactivate CSR
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~28 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'WIP: OCPBUGS-62517 ha replicas' clearly summarizes the main changes: scaling deployments to HA (high availability) with 2 replicas and adding Pod Disruption Budgets.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No test files were modified in this PR; all changes are to Go source code and manifest files.
Test Structure And Quality ✅ Passed This PR modifies serverutil.go and manifest/Helm files but does not contain Ginkgo test code, making this check not applicable.
Microshift Test Compatibility ✅ Passed PR contains no new Ginkgo e2e tests; only production code and manifest changes were made.
Single Node Openshift (Sno) Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. Changes consist of refactoring catalog server registration logic, updating Kubernetes manifests for high availability, and updating Helm chart configuration. The existing test suite uses Godog for e2e tests and standard Go testing for unit tests.
Topology-Aware Scheduling Compatibility ✅ Passed This PR adjusts replica counts to 2 and enables PodDisruptionBudget with minAvailable:1, compatible with all OpenShift topologies without new problematic scheduling constraints.
Ote Binary Stdout Contract ✅ Passed PR refactors catalog server registration and updates manifest files without introducing process-level stdout writes.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed This PR does not add any new Ginkgo e2e tests. The changes consist of modifications to the catalog server implementation in internal/catalogd/serverutil/serverutil.go, Kubernetes manifests in YAML files, and Helm chart values files. Since the check specifically applies to new Ginkgo e2e tests and no such tests are being added in this PR, the check is not applicable and therefore passes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 24, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tmshort

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot requested review from pedjak and perdasilva April 24, 2026 21:07
@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 24, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@internal/catalogd/serverutil/serverutil.go`:
- Around line 67-68: The comment for catalogServerRunnable is stale: it states
"leader-only Runnable" while NeedLeaderElection() returns false (so it runs on
all pods); update the comment above catalogServerRunnable to remove or correct
the "leader-only" wording and accurately describe that the runnable runs on all
pods and lazily binds the catalog HTTP port in Start(), ensuring the comment
matches the behavior of NeedLeaderElection() and Start().
- Around line 78-87: PopulateCache in
internal/operator-controller/catalogmetadata/client/client.go is caching non-200
responses (notably 404s from non-leader pods) which causes permanent failures
during rollouts; update PopulateCache to not cache error responses and to
implement retries for transient/non-leader 404s (use a small retry loop with
exponential backoff and/or attempt alternate endpoints), only writing to the
local cache when a 2xx response is received; keep existing behavior that
Start()/NeedLeaderElection() returns false but ensure PopulateCache retries
across requests before giving up and then surface a final error without
persisting the 404 into the cache.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: def80a5a-2570-4718-a4c6-821b6a09c784

📥 Commits

Reviewing files that changed from the base of the PR and between 9b9eb03 and cf8b48a.

📒 Files selected for processing (7)
  • internal/catalogd/serverutil/serverutil.go
  • openshift/catalogd/manifests-experimental.yaml
  • openshift/catalogd/manifests.yaml
  • openshift/helm/catalogd.yaml
  • openshift/helm/operator-controller.yaml
  • openshift/operator-controller/manifests-experimental.yaml
  • openshift/operator-controller/manifests.yaml

Comment on lines +67 to +68
// catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port
// lazily inside Start(), so non-leader pods never hold the listen socket.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Stale comment contradicts implementation.

The comment says "leader-only Runnable" but NeedLeaderElection() returns false at line 87, meaning this runs on all pods. The comment should be updated to match the implementation.

📝 Proposed fix
-// catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port
-// lazily inside Start(), so non-leader pods never hold the listen socket.
+// catalogServerRunnable is a Runnable that binds the catalog HTTP port on all pods.
+// NeedLeaderElection returns false so all replicas serve catalog requests immediately.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port
// lazily inside Start(), so non-leader pods never hold the listen socket.
// catalogServerRunnable is a Runnable that binds the catalog HTTP port on all pods.
// NeedLeaderElection returns false so all replicas serve catalog requests immediately.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/catalogd/serverutil/serverutil.go` around lines 67 - 68, The comment
for catalogServerRunnable is stale: it states "leader-only Runnable" while
NeedLeaderElection() returns false (so it runs on all pods); update the comment
above catalogServerRunnable to remove or correct the "leader-only" wording and
accurately describe that the runnable runs on all pods and lazily binds the
catalog HTTP port in Start(), ensuring the comment matches the behavior of
NeedLeaderElection() and Start().

Comment on lines +78 to +87
// NeedLeaderElection returns false so the catalog server starts on every pod
// immediately, regardless of leadership. This is required for rolling updates:
// if Start() were gated on leadership, a new pod could not win the leader lease
// (held by the still-running old pod) and therefore could never pass the
// catalog-server readiness check, deadlocking the rollout.
//
// Non-leader pods serve the catalog HTTP port but have an empty local cache
// (only the leader's reconciler downloads catalog content), so requests to a
// non-leader return 404. Callers are expected to retry.
func (r *catalogServerRunnable) NeedLeaderElection() bool { return false }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for catalog client code and retry/backoff patterns
rg -n -A5 'catalogd-service|/api/v1|ClusterCatalog' --type go | head -100

# Look for retry logic in catalog client
ast-grep --pattern $'func $FUNC($$$) {
  $$$
  retry($$$)
  $$$
}'

Repository: openshift/operator-framework-operator-controller

Length of output: 6971


🏁 Script executed:

#!/bin/bash
# Find files related to catalog client implementation
fd -e go | xargs rg -l 'catalogd' | grep -E '(client|catalog)' | head -20

# Search for HTTP client code making requests to catalogd
rg -n 'http.*Get|http\.Client' --type go | grep -i catalog | head -20

# Look for where catalog data is fetched
rg -n 'catalogd-service|/api/v1' --type go --no-heading | head -30

Repository: openshift/operator-framework-operator-controller

Length of output: 7441


🏁 Script executed:

#!/bin/bash
# Look for the actual catalog client implementation
find . -path ./test -prune -o -name '*catalog*.go' -type f -print | head -20

# Search for retry logic and backoff patterns
rg -n 'retry|backoff|exponential|MaxRetries' --type go | head -30

Repository: openshift/operator-framework-operator-controller

Length of output: 5753


🏁 Script executed:

#!/bin/bash
# Search in internal/operator-controller for catalog fetching logic
rg -n 'catalogd\|/api/v1/catalogs\|ClusterCatalog' internal/operator-controller --type go -B2 -A5 | head -100

Repository: openshift/operator-framework-operator-controller

Length of output: 74


🏁 Script executed:

#!/bin/bash
# Read the catalog client implementation
cat -n internal/operator-controller/catalogmetadata/client/client.go

# Also check the test file to understand expected behavior
cat -n internal/operator-controller/catalogmetadata/client/client_test.go | head -300

Repository: openshift/operator-framework-operator-controller

Length of output: 17978


Client lacks retry logic for 404 responses from non-leader pods.

The catalog client in internal/operator-controller/catalogmetadata/client/client.go does not implement retry logic. When PopulateCache() receives a non-200 status code (including 404 from non-leader pods), it caches the error without retrying. This means a request that hits a non-leader pod will cache the 404 error permanently rather than retrying against the leader pod, breaking the rolling update mechanism that depends on client retries.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/catalogd/serverutil/serverutil.go` around lines 78 - 87,
PopulateCache in internal/operator-controller/catalogmetadata/client/client.go
is caching non-200 responses (notably 404s from non-leader pods) which causes
permanent failures during rollouts; update PopulateCache to not cache error
responses and to implement retries for transient/non-leader 404s (use a small
retry loop with exponential backoff and/or attempt alternate endpoints), only
writing to the local cache when a 2xx response is received; keep existing
behavior that Start()/NeedLeaderElection() returns false but ensure
PopulateCache retries across requests before giving up and then surface a final
error without persisting the 404 into the cache.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 25, 2026

@tmshort: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/openshift-e2e-aws cf8b48a link true /test openshift-e2e-aws
ci/prow/e2e-aws-upgrade-ovn-single-node cf8b48a link false /test e2e-aws-upgrade-ovn-single-node

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant