:bug: Fix catalogd ha readiness by tmshort · Pull Request #2674 · operator-framework/operator-controller

tmshort · 2026-04-24T19:34:19Z

Fixes catalogd to permit multiple replicas. Adds tests to the experimental-e2e where there are multiple nodes, and subsequently, multiple replicas. This is part of a fix to avoid OLMv1 becoming unready when a cluster is upgraded.

Related to: OCPBUGS-62517

catalogd's catalog HTTP server previously called net.Listen eagerly at startup on every pod, even non-leaders that never called http.Serve. With replicas > 1 this caused ~50% of catalog requests to queue indefinitely in the kernel accept backlog.

Fix: replace manager.Server with a custom catalogServerRunnable that binds the port lazily inside Start() (only called on the leader) and closes a channel to signal readiness. A /readyz check selects on that channel, so non-leader pods fail the probe and are excluded from Service endpoints. cmd/catalogd/main.go health/readiness setup is now identical to cmd/operator-controller/main.go.

With that fix in place, helm/experimental.yaml is updated to set replicas: 2 for both components so the experimental (2-node kind) e2e suite exercises the multi-replica path. A new @CatalogdHA scenario force-deletes the catalogd leader pod and asserts that a new leader is elected and the catalog resumes serving. The scenario is automatically skipped in the standard 1-node suite (gated via BeforeSuite node-count detection in featureGates). The experimental e2e timeout is bumped from 20m to 25m to accommodate worst-case leader re-election (~163s).

Description

Reviewer Checklist

API Go Documentation
Tests: Unit Tests (and E2E Tests, if appropriate)
Comprehensive Commit Messages
Links to related GitHub Issue(s)

The catalog HTTP server has OnlyServeWhenLeader: true, so only the leader pod should serve catalog content. Previously, net.Listen was called eagerly at startup for all pods: the listen socket was bound on non-leaders even though http.Serve was never called, causing TCP connections to queue without being served. With replicas > 1 this made ~50% of catalog content requests fail silently. Replace manager.Server with a custom Runnable (catalogServerRunnable) in serverutil that: - Binds the catalog port lazily inside Start(), which is only called on the leader by controller-runtime's leader election machinery. - Closes a ready channel once the listener is established, and registers a channel-select readiness check via AddReadyzCheck so non-leader pods fail the /readyz probe and are excluded from Service endpoints. This keeps cmd/catalogd/main.go health/readiness setup identical to cmd/operator-controller/main.go (healthz.Ping for both liveness and readiness); the catalog-server readiness check is an implementation detail of serverutil.AddCatalogServerToManager. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

openshift-ci · 2026-04-24T19:34:25Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign oceanc80 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

netlify · 2026-04-24T19:34:33Z

✅ Deploy Preview for olmv1 ready!

Name	Link
🔨 Latest commit	`99db2dc`
🔍 Latest deploy log	https://app.netlify.com/projects/olmv1/deploys/69ebcc81ff9f820008ced85b
😎 Deploy Preview	https://deploy-preview-2674--olmv1.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot

Pull request overview

Fixes catalogd HA behavior and adds an experimental e2e scenario to exercise multi-replica failover, aiming to prevent catalog serving stalls/unreadiness during upgrades.

Changes:

Introduces HA-gated e2e steps + a new @CatalogdHA feature scenario that deletes the catalogd leader pod and waits for leader re-election.
Updates experimental manifests/helm values to run catalogd and operator-controller with 2 replicas.
Reworks catalogd HTTP server integration to use a custom runnable + readiness check.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`internal/catalogd/serverutil/serverutil.go`	Replaces controller-runtime `manager.Server` usage with a custom runnable and a `readyz` check.
`test/e2e/steps/steps.go`	Registers new Godog steps for HA leader failover scenario.
`test/e2e/steps/hooks.go`	Adds a `CatalogdHA` feature gate and enables it based on node count.
`test/e2e/steps/ha_steps.go`	Implements steps to force-delete the leader pod and detect a newly elected leader.
`test/e2e/features/ha.feature`	Adds `@CatalogdHA` scenario validating catalog continues serving after leader disruption.
`manifests/experimental.yaml`	Sets `catalogd` and `operator-controller` replicas to 2 for experimental installs.
`manifests/experimental-e2e.yaml`	Sets `catalogd` and `operator-controller` replicas to 2 for experimental-e2e installs.
`helm/experimental.yaml`	Sets Helm experimental values to deploy both components with 2 replicas.
`Makefile`	Increases default e2e timeout and bumps experimental-e2e timeout to 25m.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The experimental e2e suite uses a 2-node kind cluster, making it a natural fit to validate HA behaviour. Set replicas=2 for both components in helm/experimental.yaml so the experimental and experimental-e2e manifests exercise the multi-replica path end-to-end. This is safe for operator-controller (no leader-only HTTP servers) and for catalogd now that the catalog server starts on all pods via NeedLeaderElection=false, preventing the rolling-update deadlock that would arise if the server were leader-only. Also adds a @CatalogdHA experimental e2e scenario that force-deletes the catalogd leader pod and verifies that a new leader is elected and the catalog resumes serving. The scenario is gated on a 2-node cluster (detected in BeforeSuite and reflected in the featureGates map), so it is automatically skipped in the standard 1-node e2e suite. The experimental e2e timeout is bumped from 20m to 25m to accommodate leader re-election time (~163s worst case). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Todd Short <tshort@redhat.com>

codecov · 2026-04-24T20:35:28Z

Codecov Report

❌ Patch coverage is 76.19048% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.00%. Comparing base (7277a46) to head (99db2dc).

Files with missing lines	Patch %	Lines
internal/catalogd/serverutil/serverutil.go	76.19%	6 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2674      +/-   ##
==========================================
+ Coverage   67.97%   68.00%   +0.03%     
==========================================
  Files         144      144              
  Lines       10573    10596      +23     
==========================================
+ Hits         7187     7206      +19     
- Misses       2866     2869       +3     
- Partials      520      521       +1

Flag	Coverage Δ
e2e	`37.16% <76.19%> (-0.18%)`	⬇️
experimental-e2e	`53.00% <76.19%> (+0.42%)`	⬆️
unit	`53.52% <0.00%> (-0.10%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot AI review requested due to automatic review settings April 24, 2026 19:34

openshift-ci Bot requested review from ankitathomas and oceanc80 April 24, 2026 19:34

Copilot started reviewing on behalf of tmshort April 24, 2026 19:34 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

Comment thread Makefile

Comment thread test/e2e/features/ha.feature Outdated

Comment thread internal/catalogd/serverutil/serverutil.go

Comment thread internal/catalogd/serverutil/serverutil.go

Comment thread test/e2e/steps/hooks.go Outdated

tmshort force-pushed the fix-catalogd-ha-readiness branch from 8c2f948 to 99db2dc Compare April 24, 2026 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛 Fix catalogd ha readiness#2674

🐛 Fix catalogd ha readiness#2674
tmshort wants to merge 2 commits intooperator-framework:mainfrom
tmshort:fix-catalogd-ha-readiness

tmshort commented Apr 24, 2026 •

edited

Loading

Uh oh!

openshift-ci Bot commented Apr 24, 2026

Uh oh!

netlify Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tmshort commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Reviewer Checklist

Uh oh!

openshift-ci Bot commented Apr 24, 2026

Uh oh!

netlify Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for olmv1 ready!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Apr 24, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tmshort commented Apr 24, 2026 •

edited

Loading

netlify Bot commented Apr 24, 2026 •

edited

Loading