Conversation
…tionBudget With replicas=1, rolling updates cause Available=False on the olm ClusterOperator because there is a window where no pod is ready. Setting replicas=2 ensures at least one pod remains available throughout the rolling update (maxUnavailable=0, maxSurge=1), keeping the ClusterOperator Available=True. Re-enable the PodDisruptionBudget (minAvailable=1) which was previously disabled because it would deadlock node drains with a single replica. With 2 replicas, the PDB correctly protects against voluntary disruptions while still allowing node drains. This replaces the previous <carry> that set podDisruptionBudget.enabled=false. Fixes: https://issues.redhat.com/browse/OCPBUGS-62517
…azily catalogd's catalog HTTP server previously called net.Listen eagerly at startup on every pod, even non-leaders that never called http.Serve. With replicas > 1 this caused ~50% of catalog requests to queue indefinitely in the kernel accept backlog. Replace manager.Server with a custom catalogServerRunnable that binds the port lazily inside Start() (called on every pod since NeedLeaderElection returns false) and closes a ready channel to signal readiness. A /readyz check selects on that channel so all pods become ready immediately. Non-leader pods serve requests but return 404 (empty local cache); callers are expected to retry. This avoids the rolling-update deadlock that would occur if Start() were gated on leadership. Upstream PR: operator-framework/operator-controller#2674 Fixes: https://issues.redhat.com/browse/OCPBUGS-62517
WalkthroughThe catalog server startup is refactored from controller-runtime's Changes
Sequence Diagram(s)sequenceDiagram
participant Mgr as Manager
participant CSR as catalogServerRunnable
participant ReadyChk as Readiness<br/>Checker
participant HTTPSrv as http.Server
participant SDGo as Shutdown<br/>Goroutine
participant Ctx as Context
Mgr->>CSR: Start(ctx)
activate CSR
CSR->>SDGo: spawn goroutine
activate SDGo
CSR->>ReadyChk: close readyzChan
Note over ReadyChk: Readiness check now passes
CSR->>HTTPSrv: Serve on CatalogAddr
activate HTTPSrv
Note over HTTPSrv: Serving requests,<br/>TLS from cert watcher
Mgr->>ReadyChk: AddReadyzCheck("catalog-server", ...)
alt Server shutdown scenario
Ctx->>Ctx: Context cancelled
SDGo->>Ctx: <ctx.Done() received>
SDGo->>HTTPSrv: Shutdown(30s timeout)
HTTPSrv->>HTTPSrv: graceful close
HTTPSrv-->>SDGo: returns
SDGo->>CSR: (goroutine exits)
deactivate SDGo
end
HTTPSrv-->>CSR: error or nil (closed ignored)
deactivate HTTPSrv
CSR-->>Mgr:
deactivate CSR
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~28 minutes 🚥 Pre-merge checks | ✅ 11 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (11 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: tmshort The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@internal/catalogd/serverutil/serverutil.go`:
- Around line 67-68: The comment for catalogServerRunnable is stale: it states
"leader-only Runnable" while NeedLeaderElection() returns false (so it runs on
all pods); update the comment above catalogServerRunnable to remove or correct
the "leader-only" wording and accurately describe that the runnable runs on all
pods and lazily binds the catalog HTTP port in Start(), ensuring the comment
matches the behavior of NeedLeaderElection() and Start().
- Around line 78-87: PopulateCache in
internal/operator-controller/catalogmetadata/client/client.go is caching non-200
responses (notably 404s from non-leader pods) which causes permanent failures
during rollouts; update PopulateCache to not cache error responses and to
implement retries for transient/non-leader 404s (use a small retry loop with
exponential backoff and/or attempt alternate endpoints), only writing to the
local cache when a 2xx response is received; keep existing behavior that
Start()/NeedLeaderElection() returns false but ensure PopulateCache retries
across requests before giving up and then surface a final error without
persisting the 404 into the cache.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: def80a5a-2570-4718-a4c6-821b6a09c784
📒 Files selected for processing (7)
internal/catalogd/serverutil/serverutil.goopenshift/catalogd/manifests-experimental.yamlopenshift/catalogd/manifests.yamlopenshift/helm/catalogd.yamlopenshift/helm/operator-controller.yamlopenshift/operator-controller/manifests-experimental.yamlopenshift/operator-controller/manifests.yaml
| // catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port | ||
| // lazily inside Start(), so non-leader pods never hold the listen socket. |
There was a problem hiding this comment.
Stale comment contradicts implementation.
The comment says "leader-only Runnable" but NeedLeaderElection() returns false at line 87, meaning this runs on all pods. The comment should be updated to match the implementation.
📝 Proposed fix
-// catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port
-// lazily inside Start(), so non-leader pods never hold the listen socket.
+// catalogServerRunnable is a Runnable that binds the catalog HTTP port on all pods.
+// NeedLeaderElection returns false so all replicas serve catalog requests immediately.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // catalogServerRunnable is a leader-only Runnable that binds the catalog HTTP port | |
| // lazily inside Start(), so non-leader pods never hold the listen socket. | |
| // catalogServerRunnable is a Runnable that binds the catalog HTTP port on all pods. | |
| // NeedLeaderElection returns false so all replicas serve catalog requests immediately. |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/catalogd/serverutil/serverutil.go` around lines 67 - 68, The comment
for catalogServerRunnable is stale: it states "leader-only Runnable" while
NeedLeaderElection() returns false (so it runs on all pods); update the comment
above catalogServerRunnable to remove or correct the "leader-only" wording and
accurately describe that the runnable runs on all pods and lazily binds the
catalog HTTP port in Start(), ensuring the comment matches the behavior of
NeedLeaderElection() and Start().
| // NeedLeaderElection returns false so the catalog server starts on every pod | ||
| // immediately, regardless of leadership. This is required for rolling updates: | ||
| // if Start() were gated on leadership, a new pod could not win the leader lease | ||
| // (held by the still-running old pod) and therefore could never pass the | ||
| // catalog-server readiness check, deadlocking the rollout. | ||
| // | ||
| // Non-leader pods serve the catalog HTTP port but have an empty local cache | ||
| // (only the leader's reconciler downloads catalog content), so requests to a | ||
| // non-leader return 404. Callers are expected to retry. | ||
| func (r *catalogServerRunnable) NeedLeaderElection() bool { return false } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Search for catalog client code and retry/backoff patterns
rg -n -A5 'catalogd-service|/api/v1|ClusterCatalog' --type go | head -100
# Look for retry logic in catalog client
ast-grep --pattern $'func $FUNC($$$) {
$$$
retry($$$)
$$$
}'Repository: openshift/operator-framework-operator-controller
Length of output: 6971
🏁 Script executed:
#!/bin/bash
# Find files related to catalog client implementation
fd -e go | xargs rg -l 'catalogd' | grep -E '(client|catalog)' | head -20
# Search for HTTP client code making requests to catalogd
rg -n 'http.*Get|http\.Client' --type go | grep -i catalog | head -20
# Look for where catalog data is fetched
rg -n 'catalogd-service|/api/v1' --type go --no-heading | head -30Repository: openshift/operator-framework-operator-controller
Length of output: 7441
🏁 Script executed:
#!/bin/bash
# Look for the actual catalog client implementation
find . -path ./test -prune -o -name '*catalog*.go' -type f -print | head -20
# Search for retry logic and backoff patterns
rg -n 'retry|backoff|exponential|MaxRetries' --type go | head -30Repository: openshift/operator-framework-operator-controller
Length of output: 5753
🏁 Script executed:
#!/bin/bash
# Search in internal/operator-controller for catalog fetching logic
rg -n 'catalogd\|/api/v1/catalogs\|ClusterCatalog' internal/operator-controller --type go -B2 -A5 | head -100Repository: openshift/operator-framework-operator-controller
Length of output: 74
🏁 Script executed:
#!/bin/bash
# Read the catalog client implementation
cat -n internal/operator-controller/catalogmetadata/client/client.go
# Also check the test file to understand expected behavior
cat -n internal/operator-controller/catalogmetadata/client/client_test.go | head -300Repository: openshift/operator-framework-operator-controller
Length of output: 17978
Client lacks retry logic for 404 responses from non-leader pods.
The catalog client in internal/operator-controller/catalogmetadata/client/client.go does not implement retry logic. When PopulateCache() receives a non-200 status code (including 404 from non-leader pods), it caches the error without retrying. This means a request that hits a non-leader pod will cache the 404 error permanently rather than retrying against the leader pod, breaking the rolling update mechanism that depends on client retries.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/catalogd/serverutil/serverutil.go` around lines 78 - 87,
PopulateCache in internal/operator-controller/catalogmetadata/client/client.go
is caching non-200 responses (notably 404s from non-leader pods) which causes
permanent failures during rollouts; update PopulateCache to not cache error
responses and to implement retries for transient/non-leader 404s (use a small
retry loop with exponential backoff and/or attempt alternate endpoints), only
writing to the local cache when a 2xx response is received; keep existing
behavior that Start()/NeedLeaderElection() returns false but ensure
PopulateCache retries across requests before giving up and then surface a final
error without persisting the 404 into the cache.
|
@tmshort: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary by CodeRabbit
New Features
Chores