🐛 Fix catalogd ha readiness#2674
Conversation
The catalog HTTP server has OnlyServeWhenLeader: true, so only the leader pod should serve catalog content. Previously, net.Listen was called eagerly at startup for all pods: the listen socket was bound on non-leaders even though http.Serve was never called, causing TCP connections to queue without being served. With replicas > 1 this made ~50% of catalog content requests fail silently. Replace manager.Server with a custom Runnable (catalogServerRunnable) in serverutil that: - Binds the catalog port lazily inside Start(), which is only called on the leader by controller-runtime's leader election machinery. - Closes a ready channel once the listener is established, and registers a channel-select readiness check via AddReadyzCheck so non-leader pods fail the /readyz probe and are excluded from Service endpoints. This keeps cmd/catalogd/main.go health/readiness setup identical to cmd/operator-controller/main.go (healthz.Ping for both liveness and readiness); the catalog-server readiness check is an implementation detail of serverutil.AddCatalogServerToManager. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
✅ Deploy Preview for olmv1 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Pull request overview
Fixes catalogd HA behavior and adds an experimental e2e scenario to exercise multi-replica failover, aiming to prevent catalog serving stalls/unreadiness during upgrades.
Changes:
- Introduces HA-gated e2e steps + a new
@CatalogdHAfeature scenario that deletes the catalogd leader pod and waits for leader re-election. - Updates experimental manifests/helm values to run
catalogdandoperator-controllerwith 2 replicas. - Reworks catalogd HTTP server integration to use a custom runnable + readiness check.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
internal/catalogd/serverutil/serverutil.go |
Replaces controller-runtime manager.Server usage with a custom runnable and a readyz check. |
test/e2e/steps/steps.go |
Registers new Godog steps for HA leader failover scenario. |
test/e2e/steps/hooks.go |
Adds a CatalogdHA feature gate and enables it based on node count. |
test/e2e/steps/ha_steps.go |
Implements steps to force-delete the leader pod and detect a newly elected leader. |
test/e2e/features/ha.feature |
Adds @CatalogdHA scenario validating catalog continues serving after leader disruption. |
manifests/experimental.yaml |
Sets catalogd and operator-controller replicas to 2 for experimental installs. |
manifests/experimental-e2e.yaml |
Sets catalogd and operator-controller replicas to 2 for experimental-e2e installs. |
helm/experimental.yaml |
Sets Helm experimental values to deploy both components with 2 replicas. |
Makefile |
Increases default e2e timeout and bumps experimental-e2e timeout to 25m. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The experimental e2e suite uses a 2-node kind cluster, making it a natural fit to validate HA behaviour. Set replicas=2 for both components in helm/experimental.yaml so the experimental and experimental-e2e manifests exercise the multi-replica path end-to-end. This is safe for operator-controller (no leader-only HTTP servers) and for catalogd now that the catalog server starts on all pods via NeedLeaderElection=false, preventing the rolling-update deadlock that would arise if the server were leader-only. Also adds a @CatalogdHA experimental e2e scenario that force-deletes the catalogd leader pod and verifies that a new leader is elected and the catalog resumes serving. The scenario is gated on a 2-node cluster (detected in BeforeSuite and reflected in the featureGates map), so it is automatically skipped in the standard 1-node e2e suite. The experimental e2e timeout is bumped from 20m to 25m to accommodate leader re-election time (~163s worst case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>
8c2f948 to
99db2dc
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2674 +/- ##
==========================================
+ Coverage 67.97% 68.00% +0.03%
==========================================
Files 144 144
Lines 10573 10596 +23
==========================================
+ Hits 7187 7206 +19
- Misses 2866 2869 +3
- Partials 520 521 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Fixes catalogd to permit multiple replicas. Adds tests to the experimental-e2e where there are multiple nodes, and subsequently, multiple replicas. This is part of a fix to avoid OLMv1 becoming unready when a cluster is upgraded.
Related to: OCPBUGS-62517
catalogd's catalog HTTP server previously called net.Listen eagerly at startup on every pod, even non-leaders that never called http.Serve. With replicas > 1 this caused ~50% of catalog requests to queue indefinitely in the kernel accept backlog.
Fix: replace manager.Server with a custom catalogServerRunnable that binds the port lazily inside Start() (only called on the leader) and closes a channel to signal readiness. A /readyz check selects on that channel, so non-leader pods fail the probe and are excluded from Service endpoints. cmd/catalogd/main.go health/readiness setup is now identical to cmd/operator-controller/main.go.
With that fix in place, helm/experimental.yaml is updated to set replicas: 2 for both components so the experimental (2-node kind) e2e suite exercises the multi-replica path. A new @CatalogdHA scenario force-deletes the catalogd leader pod and asserts that a new leader is elected and the catalog resumes serving. The scenario is automatically skipped in the standard 1-node suite (gated via BeforeSuite node-count detection in featureGates). The experimental e2e timeout is bumped from 20m to 25m to accommodate worst-case leader re-election (~163s).
Description
Reviewer Checklist