Skip to content

Continuous reconcile loop with large bundles causes high CPU usage #2655

@vdanyliv

Description

@vdanyliv

Bug Report

Description

The operator-controller enters a tight continuous reconcile loop when managing a ClusterExtension with a large bundle (e.g., knative-operator), consuming ~1 full CPU core indefinitely even when the bundle is fully installed and no changes are needed.

Environment

  • operator-controller version: v1.5.1 (also verified against main branch / v1.7.0 source)
  • Kubernetes: k3s v1.34.1
  • ClusterExtension: knative-operator v1.18.0 (large bundle with many CRDs, ClusterRoles, Deployments)

Steps to Reproduce

  1. Install operator-controller
  2. Create a ClusterExtension for a large bundle like knative-operator:
    apiVersion: olm.operatorframework.io/v1
    kind: ClusterExtension
    metadata:
      name: knative-operator
    spec:
      namespace: operators
      serviceAccount:
        name: knative-olm
      source:
        catalog:
          packageName: knative-operator
          version: "1.18.0"
        sourceType: Catalog
  3. Wait for the bundle to be fully installed (status shows Installed: True, Progressing: True, reason: Succeeded)
  4. Observe CPU usage of the operator-controller pod

Expected Behavior

Once the bundle is fully installed and no changes are pending, the controller should be mostly idle, only reconciling when watched resources change externally.

Actual Behavior

The controller reconciles continuously every ~1 second in a tight loop:

"reconcile starting" → "handling finalizers" → "getting installed bundle" → "resolving bundle" → 
"unpacking resolved bundle" → "applying bundle contents" → "watching managed objects" → 
"reconcile ending" → immediately "reconcile starting" again

The pod consumes ~960m CPU continuously. Logs show no errors — every reconcile succeeds.

Root Cause Analysis

The ApplyBundle step calls a.Apply() on every reconcile, which performs a server-side apply of all managed objects and re-establishes watches. For large bundles like knative-operator (which manages dozens of CRDs, ClusterRoles, Deployments, etc.), the act of applying/watching these objects generates watch events that immediately re-trigger the next reconcile.

Key observations from the source code (clusterextension_reconcile_steps.go on main):

  1. No short-circuit in ApplyBundle — there is no check to skip Apply() when the bundle is unchanged and already fully applied
  2. No requeue delayReconcileSteps.Reconcile() returns ctrl.Result{} (immediate requeue on watch events), never ctrl.Result{RequeueAfter: ...}
  3. UnpackBundle has a cache optimization (bundleUnchanged check) but ApplyBundle does not have an equivalent

Suggested Fix

One or more of:

  • Skip Apply() when bundle is unchanged and fully installed — if rolloutSucceeded was already true in a previous reconcile and the resolved bundle hasn't changed, skip the apply step entirely
  • Add a RequeueAfter delay after a successful reconcile with no changes, to break the tight loop (e.g., ctrl.Result{RequeueAfter: 5 * time.Minute})
  • Deduplicate watch events — avoid re-establishing watches on every reconcile if the set of managed objects hasn't changed

Impact

  • ~1 CPU core wasted per ClusterExtension with a large bundle
  • Increased API server load from continuous server-side applies
  • Unnecessary network traffic and etcd writes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions