Skip to content

🐛 fix: allow reconciliation of deadline-exceeded ClusterObjectSets#2643

Open
joelanford wants to merge 1 commit intooperator-framework:mainfrom
joelanford:fix/cos-deadline-exceeded-archival
Open

🐛 fix: allow reconciliation of deadline-exceeded ClusterObjectSets#2643
joelanford wants to merge 1 commit intooperator-framework:mainfrom
joelanford:fix/cos-deadline-exceeded-archival

Conversation

@joelanford
Copy link
Copy Markdown
Member

@joelanford joelanford commented Apr 11, 2026

Description

The skipProgressDeadlineExceededPredicate blocked all update events for COS objects
with ProgressDeadlineExceeded, which prevented archival of stuck revisions — the
lifecycle state patch was silently dropped.

This PR:

  • Removes the predicate so all COS events are fully reconciled
  • Updates markAsProgressing to set ProgressDeadlineExceeded instead of
    RollingOut/Retrying when the deadline has been exceeded, preventing the reconcile
    loop the predicate was masking. Succeeded always applies; unregistered reasons panic
  • Continues reconciling after ProgressDeadlineExceeded rather than clearing the
    error and stopping requeue. This allows revisions to recover if a transient error
    resolves itself, even after the deadline was exceeded
  • Extracts durationUntilDeadline as a shared helper for deadline computation
  • Adds a deadlineAwareRateLimiter that caps exponential backoff at the deadline so
    ProgressDeadlineExceeded is set promptly even during error retries
  • Adds an e2e test that creates a COS with a never-ready deployment, waits for
    ProgressDeadlineExceeded, archives the COS, and verifies resource cleanup

Addresses feedback from:
#2610 (comment)

Reviewer Checklist

  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

Copilot AI review requested due to automatic review settings April 11, 2026 17:08
@netlify
Copy link
Copy Markdown

netlify bot commented Apr 11, 2026

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 46dcf54
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/69da876c06d48a00089c7da7
😎 Deploy Preview https://deploy-preview-2643--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci
Copy link
Copy Markdown

openshift-ci bot commented Apr 11, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign joelanford for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a reconciliation dead-end where ClusterObjectSet (COS) updates were being dropped once a revision hit ProgressDeadlineExceeded, preventing stuck revisions from being archived and cleaned up. It removes the update-blocking predicate, makes progress-deadline handling “sticky” in status updates, and adds a deadline-aware rate limiter plus an E2E scenario to validate archival cleanup.

Changes:

  • Remove the ProgressDeadlineExceeded-skipping watch predicate and introduce a controller RateLimiter that caps exponential backoff at the progress deadline.
  • Refactor progress-deadline computation into a shared durationUntilDeadline helper and adjust progressing/retrying status updates to prefer ProgressDeadlineExceeded once exceeded.
  • Add an E2E scenario (and step helper) that archives a deadline-exceeded COS and verifies resource cleanup.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
internal/operator-controller/controllers/clusterobjectset_controller.go Removes the skip predicate, adds deadline-aware rate limiting, refactors deadline computation, and changes progressing/retrying condition behavior when the deadline is exceeded.
test/e2e/steps/steps.go Adds a new Godog step to patch a COS lifecycle state to Archived.
test/e2e/features/revision.feature Adds an E2E scenario that forces ProgressDeadlineExceeded, archives the COS, and asserts resources are removed.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

drift := 2 * time.Second
requeueAfter := (remaining + drift).Round(time.Second)
l.Info(fmt.Sprintf("ProgressDeadline not exceeded, requeue after ~%v to check again.", requeueAfter))
res = ctrl.Result{RequeueAfter: requeueAfter}
Comment on lines 24 to 35
"k8s.io/utils/clock"
"pkg.package-operator.run/boxcutter"
"pkg.package-operator.run/boxcutter/machinery"
machinerytypes "pkg.package-operator.run/boxcutter/machinery/types"
"pkg.package-operator.run/boxcutter/ownerhandling"
"pkg.package-operator.run/boxcutter/probing"
"k8s.io/client-go/util/workqueue"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/builder"
"sigs.k8s.io/controller-runtime/pkg/client"
"sigs.k8s.io/controller-runtime/pkg/controller"
"sigs.k8s.io/controller-runtime/pkg/controller/controllerutil"
backoff := r.delegate.When(item)

cos := &ocv1.ClusterObjectSet{}
if err := r.client.Get(context.Background(), item.NamespacedName, cos); err != nil {
When ClusterObjectSet "${COS_NAME}" lifecycle is set to "Archived"
Then ClusterObjectSet "${COS_NAME}" is archived
And resource "configmap/test-configmap" is eventually not found
And resource "deployment/test-deployment" is eventually not found
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not seems to be testing the same scenario @joelanford
Could we ensure the same scenario here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. User installs a ClusterExtension. The CE controller creates COS-rev-1.
  2. COS-rev-1 gets stuck (e.g. a Deployment never becomes ready). After ProgressDeadlineMinutes, the reconciler sets Progressing=False/ProgressDeadlineExceeded.
  3. User updates the ClusterExtension. The CE controller creates COS-rev-2.
  4. COS-rev-2 rolls out successfully. It patches COS-rev-1 with lifecycleState: Archived so the old revision gets cleaned up.
  5. The watch predicate sees COS-rev-1 has ProgressDeadlineExceeded and drops the event.
  6. COS-rev-1 never reconciles, never processes the archival, and stays stuck forever.

Remove the skipProgressDeadlineExceededPredicate that blocked all update
events for COS objects with ProgressDeadlineExceeded. This predicate
prevented archival of stuck revisions because the lifecycle state patch
was dropped as an update event.

To prevent the reconcile loop that the predicate was masking,
markAsProgressing now sets ProgressDeadlineExceeded instead of
RollingOut/Retrying when the deadline has been exceeded. Terminal
reasons (Succeeded) always apply. Unregistered reasons panic.

Continue reconciling after ProgressDeadlineExceeded rather than clearing
the error and stopping requeue. This allows revisions to recover if a
transient error resolves itself, even after the deadline was exceeded.

Extract durationUntilDeadline as a shared helper for deadline
computation. Add a deadlineAwareRateLimiter that caps exponential
backoff at the deadline so ProgressDeadlineExceeded is set promptly
even during error retries.

Add an e2e test that creates a COS with a never-ready deployment, waits
for ProgressDeadlineExceeded, archives the COS, and verifies cleanup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@joelanford joelanford force-pushed the fix/cos-deadline-exceeded-archival branch from a586063 to 46dcf54 Compare April 11, 2026 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants