Skip to content

Conversation

@tmshort
Copy link
Contributor

@tmshort tmshort commented Nov 12, 2025

Refactor all controllers to use Patch() instead of Update()
when adding or removing finalizers to improve performance, and to avoid
removing non-cached fields erroneously. Create shared finalizer utilities
to eliminate code duplication across controllers.

This is necesary because we no longer cache thelast-applied-configuration
annotation, so when we add/remove the finalizers, we are removing that field
from the metadata. This causes issues with clients when they don't see that
annotation (e.g. apply the same ClusterExtension twice).

  • Add shared finalizer.EnsureFinalizer() utilities
  • Update ClusterCatalog, ClusterExtension, and ClusterExtensionRevision
    controllers to use Patch-based finalizer management
  • Maintain early return behavior after adding finalizers on create
  • Remove unused internal/operator-controller/finalizers package
  • Update all unit tests to match new behavior

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com
Signed-off-by: Todd Short tshort@redhat.com

This also restores the non-caching of "last-applied-config".

Description

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@tmshort tmshort requested a review from a team as a code owner November 12, 2025 15:54
@openshift-ci
Copy link

openshift-ci bot commented Nov 12, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kevinrizza for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@netlify
Copy link

netlify bot commented Nov 12, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit 6e95c7b
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/691e317e58fdaf000875981b
😎 Deploy Preview https://deploy-preview-2328--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@tmshort
Copy link
Contributor Author

tmshort commented Nov 12, 2025

The ClusterExtension and ClusterCatalog finalizer code used Update, the ClusterExtensionRevision used Apply, so the CE and CE finalizer code was rewritten to be like CER. There is common finalizer code between them now.

@tmshort
Copy link
Contributor Author

tmshort commented Nov 12, 2025

Ping @joelanford

@tmshort
Copy link
Contributor Author

tmshort commented Nov 12, 2025

catalogd pod is panicing... it only fails due to the summary generated, it's not "noticeable" during a regular run!

@pedjak
Copy link
Contributor

pedjak commented Nov 12, 2025

This is necesary because we no longer cache thelast-applied-configuration annotation, so when we add/remove the finalizers, we are removing that field from the metadata. This causes issues with clients when they don't see that annotation (e.g. apply the same ClusterExtension twice).

If we want to remove conflicts, we should use patch, but adding/removing finalizer we do not touch removed fields (including the annotation). Controller-runtime cache is write-through cache, so an update or patch operation does not perform any direct modification on the cache - all changes are made by the informer.

In what situations did we see conflicts, given that only single controller is responsible for a resource type?

@tmshort
Copy link
Contributor Author

tmshort commented Nov 12, 2025

If we want to remove conflicts, we should use patch, but adding/removing finalizer we do not touch removed fields (including the annotation). Controller-runtime cache is write-through cache, so an update or patch operation does not perform any direct modification on the cache - all changes are made by the informer.

In what situations did we see conflicts, given that only single controller is responsible for a resource type?

This is not a conflict issue. The problem was that we are using Update(), which uses the cached version of the resource to make the changes. The cached version no longer includes the last-applied-configuration annotation, so that annotation is removed on the Update().

This causes a problem with e.g. kubectl applying a ClusterExtension a second time. If you apply a CE twice without this code (i.e. current main branch), you will see the following:

tshort@cube:~/.../config/samples (main %=)$ kubectl apply -f ce.yaml
clusterextension.olm.operatorframework.io/argocd created
tshort@cube:~/.../config/samples (main %=)$ oc get clusterextension -o yaml
apiVersion: v1
items:
- apiVersion: olm.operatorframework.io/v1
  kind: ClusterExtension
  metadata:
    creationTimestamp: "2025-11-12T15:22:25Z"
    finalizers:
    - olm.operatorframework.io/cleanup-unpack-cache
    - olm.operatorframework.io/cleanup-contentmanager-cache
    generation: 1
    name: argocd
    resourceVersion: "1296"
    uid: b096377d-577e-4d25-8f3d-b0e0b12fe894
...
tshort@cube:~/.../config/samples (main %=)$ kubectl apply -f ce.yaml
Warning: resource clusterextensions/argocd is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
clusterextension.olm.operatorframework.io/argocd configured
tshort@cube:~/.../config/samples (main %=)$

The problem is that the last-applied-configuration annotation is not present. Thus, causing the second kubectl apply command to complain. The expected result is that the second kubectl apply states clusterextension.olm.operatorframework.io/argocd unchanged

@codecov
Copy link

codecov bot commented Nov 12, 2025

Codecov Report

❌ Patch coverage is 69.40299% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.51%. Comparing base (1355ff7) to head (6e95c7b).

Files with missing lines Patch % Lines
...troller/controllers/clusterextension_controller.go 58.62% 7 Missing and 5 partials ⚠️
...logd/controllers/core/clustercatalog_controller.go 52.38% 5 Missing and 5 partials ⚠️
...controllers/clusterextensionrevision_controller.go 25.00% 2 Missing and 4 partials ⚠️
internal/shared/util/cache/transform.go 76.00% 5 Missing and 1 partial ⚠️
cmd/operator-controller/main.go 81.25% 3 Missing ⚠️
internal/operator-controller/applier/boxcutter.go 33.33% 1 Missing and 1 partial ⚠️
internal/shared/util/finalizer/finalizer.go 93.54% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2328      +/-   ##
==========================================
- Coverage   74.23%   70.51%   -3.72%     
==========================================
  Files          91       92       +1     
  Lines        7239     7255      +16     
==========================================
- Hits         5374     5116     -258     
- Misses       1433     1700     +267     
- Partials      432      439       +7     
Flag Coverage Δ
e2e 44.60% <57.46%> (+0.25%) ⬆️
experimental-e2e 14.58% <35.07%> (-33.91%) ⬇️
unit 58.37% <41.04%> (-0.18%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

return false, fmt.Errorf("marshalling patch to add finalizer: %w", err)
}
// Note: Patch will update obj with the server response, including ResourceVersion
if err := c.Patch(ctx, obj, client.RawPatch(types.MergePatchType, patchJSON)); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use SSA here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that could be a good improvement, that would simplify the reconcile logic a lot, i.e. we do not need to care about appending/sorting finalizers at all.

@pedjak
Copy link
Contributor

pedjak commented Nov 13, 2025

This is not a conflict issue. The problem was that we are using Update(), which uses the cached version of the resource to make the changes. The cached version no longer includes the last-applied-configuration annotation, so that annotation is removed on the Update().

Thanks for explaining it, make sense. Given that I think we should update PR description to reflect that, currently we have:

Refactor all controllers to use client.Patch() instead of Update() when adding or removing finalizers to reduce conflicts and improve performance.

Given the explanation above, we are not reducing conflict or improving performances - we are ensuring correctness, i.e. not removing the annotation.

@tmshort tmshort force-pushed the apply-finalizers branch 2 times, most recently from cf27d6e to bfb7a16 Compare November 13, 2025 16:40
@tmshort tmshort changed the title 🐛 Use Patch instead of Update for finalizer operations 🐛 Use SSA instead of Update for finalizer operations Nov 13, 2025
@tmshort tmshort requested review from joelanford and pedjak November 13, 2025 16:41
Applier Applier
RevisionStatesGetter RevisionStatesGetter
Finalizers crfinalizer.Finalizers
FinalizerHandlers map[string]FinalizerHandler
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a controller should set single finalizer, more than one is strange. I would suggest reverting the change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is more than one for CEs right now because there are multiple finalizer concerns:

  1. Cleaning up the unpacked bundle image directory
  2. Shutting down the content manager informers (helm-only)

I'm not sure how the current helm-to-boxcutter migration deals with the content manager informer shutdown. Do those informers get cleaned up when boxcutter takes over or when the CE is deleted (potentially much later). Ideally we'd shutdown the helm-only content manager informers as soon as boxcutter takes over.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but that's getting a bit outside the scope of this PR)

Copy link
Contributor Author

@tmshort tmshort Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, there ARE two pre-existing finalizers on the ClusterExtension

Copy link
Contributor Author

@tmshort tmshort Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joelanford

I'm not sure how the current helm-to-boxcutter migration deals with the content manager informer shutdown. Do those informers get cleaned up when boxcutter takes over or when the CE is deleted (potentially much later). Ideally we'd shutdown the helm-only content manager informers as soon as boxcutter takes over.

In order to switch from the Helm applier to the Boxcutter applier, the pod has to restart, so everything is effectively shutdown. The content-manager finalizer is a no-op under Boxcutter, but the finalizer still needs to be removed.

u.SetGroupVersionKind(gvk)
u.SetName(obj.GetName())
u.SetNamespace(obj.GetNamespace())
u.SetFinalizers(newFinalizers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to be u.SetFinalizers(finalizers)

Copy link
Contributor Author

@tmshort tmshort Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. this is based on the assumption that the arguments to EnsureFinalizers represent the only finalizers we want in the list. This is not the intent.

The EnsureFinalizers API ensures that the given finalizers are present. The intent is not to set the finalizers to only be those given, and is not intended to remove finalizers.

Because this is a list, we can't patch individual items into the list, we need to specify the whole list (at least that's how the SSA API is working). I verified this in the code by calling EnsureFinalizers() separately for each finalizer string, rather than in bulk. Only the last finalizer showed up when it was u.SetFinalizers(finalizers). Both finalizers showed up when the code was as above.

Yes, I realize the name EnsureFinalizers is a bit confusing, and should probably be AddFinalizers, but I was following the naming scheme (and behavior) of the removed code, which was inconsistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a list, we can't patch individual items into the list, we need to specify the whole list (at least that's how the SSA API is working).

I need to disagree here, because Finalizer is defined as a set with merge patch stratefy:

https://github.com/kubernetes/apimachinery/blob/master/pkg/apis/meta/v1/types.go#L256-L259

Therefore, we do not patch item by item, we set all finalizers we need to own and make SSA requests.

BTW, my understanding here is that finalizers argument contains all finalizers we would like to set, not just one.

Copy link
Contributor Author

@tmshort tmshort Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Therefore, we do not patch item by item, we set all finalizers we need to own and make SSA requests.

Agreed, that's how SSA works, but that's not how the API layer above it has to work.

The function as defined adds finalizers, that's the intent. The intent is not to make the finalizer list equal to the arguments.

I'm renaming the function to indicate the desired (existing) behavior. It is confusing to have the mechanism to remove finalizers as "ensuring nothing".

If we wanted the behavior as you suggest, then SetFinalizers() would be more appropriate. But it would be at the cost of flexibility, and potential issues if other controllers add their own finalizers.

}

// Update the passed object with the new finalizers
obj.SetFinalizers(newFinalizers)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it has to get `obj.SetFinalizers(u.GetFinalizers())


// RemoveFinalizer removes one or more finalizers from the object using server-side apply.
// If none of the finalizers exist, this is a no-op.
func RemoveFinalizer(ctx context.Context, c client.Client, obj client.Object, finalizers ...string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to have this function because EnsureFinalizer(ctx, c, obj) is going to remove previously added finalizers that we owned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of EnsureFinalizers is to only add finalizers to the list of finalizers, not to remove them. The RemoveFinalizers removes finalizers individually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in SSA approach, you are not remove things by omitting them in request, if you owned them previously. Hence, if we had somewhere in the code:

EnsureFinalizers(ctx, client, ce, "foo", "bar")

to remove bar, we can perform:

EnsureFinalizers(ctx, client, ce, "foo")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But that is not the intent of the API; it's to add items individually; not to remove. Just because the underlying mechanism is SSA, doesn't mean the layered API has to behave that way.

IMHO, It's confusing to have "Ensure" remove items, "Ensure" means to check that something is true or correct. In this case "Ensure" is confusing, so I'm renaming the function.

@tmshort
Copy link
Contributor Author

tmshort commented Nov 14, 2025

After digging into the solutions (Add/Ensure/Remove/Update/Set), none of them are optimal. Specifically, if another controller adds a finalizer to our resources, we see it, and that messes with our finalizer management. This means we hit the API server with a SSA more than we should.

If a single-function mechanism (e.g. Set/Ensure) is used, then on every reconcile we'll see the extra finalizer and attempt to set the finalizers to the input list (which may be empty) which attempts to remove the extra finalizer, and because we don't own it, it will never be removed. So any checks to ensure finalizers are set as we expect will fail, forcing an API server call.

I prefer the explicit add and remove mechanism, as in this current PR. It explicitly checks the current list of finalizers and only cares about the ones that are input. It will use the existing list of finalizers to add/remove, and will only issue a SSA if changes are needed; it will ignore the extra finalizer. However, in the unlikely case where the cache is out-of-sync, then we will hit the API server with a SSA until the cache is synced. It's better than every reconcile as in the single-function case.

EDIT: All our finalizers use the same prefix, other controllers' finalizers should use a different prefix. So, now that I thought about it some more, I think a single Ensure function will work. However, sorting is still required to check if the finalizers are in the right state, such that we reduce the number of Server API calls.

@pedjak
Copy link
Contributor

pedjak commented Nov 17, 2025

@tmshort apologize I did not mention it earlier: moving to SSA requires also to migrate the information stored under .metadata.managedField - otherwise field removal would not be possible, see for example:

kubernetes/kubernetes#99003

In our case, it is easy to reproduce:

  • deploy OLM from main branch, create a ClusterExtension. Under .metadata.managedField we have now:
- apiVersion: olm.operatorframework.io/v1                                                                                                                                                                                         
  fieldsType: FieldsV1                                                                                                                                                                                                            
  fieldsV1:                                                                                                                                                                                                                       
       f:metadata:
         f:finalizers:                                                                                                                                                                                                               
           .: {}                                                                                                                                                                                                                     
           v:"olm.operatorframework.io/cleanup-contentmanager-cache": {}                                                                                                                                                             
           v:"olm.operatorframework.io/cleanup-unpack-cache": {}                                                                                                                                                                     
  manager: operator-controller                                                                                                                                                                                                    
  operation: Update                                 

This is expected, because we have added finalizers via Update call.

  • Now, build and deploy OLM using this PR and try to remove the existing ClusterExtension. The operation is not going to be successful, even if there is no errors to be see in the logs. The controller is not able to remove finalizers, because we now use operation Apply and the combination of manager and operation determines the field owner. Clearly, given that we do not provide finalizers, we are not able to remove it because we are not the owners.

To fix that we would need to migrate the ownership under .metadata.managedFields. However, it looks like a way more work for the cause that needs to be fixed: removal of last-applied-configuration annotation to save a bit of memory.

Instead of going to through SSA migration, I would suggest that we revert removal of that annotation as the quick fix, and then introduce SSA if needed + detailed testing.

@tmshort
Copy link
Contributor Author

tmshort commented Nov 17, 2025

/hold
Yes, this does not take into account migration of pre-existing resources. I could switch it back to using Patch(), as moving to SAA with existing fields will give us some trouble.

That being said, there is currently a failure in the test-experimental-e2e, which seems to be triggered by this change; since I'm not seeing the issue on main. I would need to fix that before this could be merged.

I will point out that managed fields is not a "bit of memory", but can double the size of the cache. The memory savings was significant.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 17, 2025
@pedjak
Copy link
Contributor

pedjak commented Nov 17, 2025

I will point out that managed fields is not a "bit of memory", but can double the size of the cache. The memory savings was significant.

I agree, but managed fields are readded in the cache with #2318

@tmshort
Copy link
Contributor Author

tmshort commented Nov 17, 2025

I will point out that managed fields is not a "bit of memory", but can double the size of the cache. The memory savings was significant.

I agree, but managed fields are readded in the cache with #2318

Whoops, I meant last-applied-config.

@tmshort tmshort changed the title 🐛 Use SSA instead of Update for finalizer operations WIP: Use SSA instead of Update for finalizer operations Nov 18, 2025
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 18, 2025
@tmshort
Copy link
Contributor Author

tmshort commented Nov 18, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 18, 2025
@tmshort tmshort force-pushed the apply-finalizers branch 2 times, most recently from cb203de to 385295c Compare November 19, 2025 16:55
tmshort and others added 14 commits November 19, 2025 15:40
Refactor all controllers to use Patch() instead of Update()
when adding or removing finalizers to improve performance, and to avoid
removing non-cached fields erroneously. Create shared finalizer utilities
to eliminate code duplication across controllers.

This is necesary because we no longer cache the`last-applied-configuration`
annotation, so when we add/remove the finalizers, we are removing that field
from the metadata. This causes issues with clients when they don't see that
annotation (e.g. apply the same ClusterExtension twice).

- Add shared finalizer.EnsureFinalizer() utilities
- Update ClusterCatalog, ClusterExtension, and ClusterExtensionRevision
  controllers to use Patch-based finalizer management
- Maintain early return behavior after adding finalizers on create
- Remove unused internal/operator-controller/finalizers package
- Update all unit tests to match new behavior

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
Signed-off-by: Todd Short <tshort@redhat.com>
@tmshort tmshort changed the title WIP: Use SSA instead of Update for finalizer operations 🌱 Use Patch instead of Update for finalizer operations Nov 19, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants