pkg/controller: Fix panic when creating cluster-scoped RBAC in OG controller #2349

timflannagan · 2021-09-08T23:54:44Z

Description of the change:
Fixes #2091.

This is a follow-up to operator-framework#2309 that attempted to fix the original issue. When checking whether the ClusterRole/ClusterRoleBinding resources already exist, we're also checking whether the existing labels are owned by the CSV we're currently handling. When accessing the "cr" or "crb" variables that the Create calls output, a panic is produced as we're attempting to access the meta.Labels key from those resources, except those resources themselves are nil. Update the check to verify that the cr/crb variables are not nil before attempting to access those object's labels. The testing fake client may need to be updated in the future to handle returning these resources properly.

I ran go test ./pkg/controller/operators/olm/... -v -run ^TestSyncOperatorGroups$ -count 250 a couple of times and wasn't able to reproduce. I was able to consistently reproduce without these changes using a much smaller count sizing before.

Motivation for the change:
Reduce testing flakes.

Reviewer Checklist

Implementation matches the proposed design, or proposal is updated to match implementation
Sufficient unit test coverage
Sufficient end-to-end test coverage
Docs updated or added to /doc
Commit messages sensible and descriptive

timflannagan · 2021-09-09T18:02:19Z

Evan had pointed out that this implementation may be another regression from fixing the original issue in slack.

/hold

timflannagan · 2021-09-14T19:46:38Z

/hold cancel

timflannagan · 2021-09-14T20:01:01Z

/hold

pkg/controller/operators/olm/operatorgroup.go

estroz · 2021-09-14T20:32:05Z

pkg/controller/operators/olm/operatorgroup.go

+				if crb != nil && ownerutil.IsOwnedByLabel(crb, csv) {
+					continue
+				}
+				return err


Should this error be returned?

I think this needs to return an error if we want !IsOwnedByLabel to cause a resync (i.e. to retry this logic).

njhale · 2021-09-21T17:40:21Z

pkg/controller/operators/olm/operatorgroup.go

+				if crb != nil && ownerutil.IsOwnedByLabel(crb, csv) {
+					continue
+				}
+				return err


I think this needs to return an error if we want !IsOwnedByLabel to cause a resync (i.e. to retry this logic).

njhale · 2021-09-21T17:51:36Z

pkg/controller/operators/olm/operatorgroup.go

+				if !k8serrors.IsAlreadyExists(err) {
+					return err
+				}
 				// if the CR already exists, but the label is correct, the cache is just behind
-				if k8serrors.IsAlreadyExists(err) && ownerutil.IsOwnedByLabel(cr, csv) {
+				if cr != nil && ownerutil.IsOwnedByLabel(cr, csv) {
 					continue
-				} else {
-					return err
 				}


It looks like the @ecordell 's original patch, effectively returns any error if !IsOwnedByLabel, which this code does not. Assuming errors returned from this function trigger a resync -- i.e. a retry -- then we'll fail to retry the operation when we detect the cluster state isn't as expected.

I suspect something like the following will produce the desired results:

if k8serrors.IsAlreadyExists(err) && cr != nil && ownerutil.IsOwnedByLabel(cr, csv) { continue } return err

When I had played around with this exact implementation, we had reverted back to the original testing flake behavior (e.g. "crb-role already exists") as there's likely a larger issue in the fake testing client where the resource returned from the Create call is always going to be nil. With that said, it's likely we don't need to nil check the variable returned by Create as this is a testing client problem. Any opinions?

Moving some of this from slack:

I had tried adding a client-go testing reactor for returning an object for these situations:

+// WithOperatorGroupReactors returns a fakeK8sClientOption that configures a Clientset to return the ClusterRole +// or ClusterRoleBinding object during a create call. +// Note(tflannag): Fix for https://github.com/operator-framework/operator-lifecycle-manager/issues/2091. +func WithOperatorGroupReactors(tb testing.TB) Option { + return func(c ClientsetDecorator) { + c.PrependReactor("create", "clusterrolebindings", func(action clitesting.Action) (bool, runtime.Object, error) { + createAction := action.(clitesting.CreateActionImpl) + return false, createAction.Object, nil + }) + c.PrependReactor("create", "clusterroles", func(action clitesting.Action) (bool, runtime.Object, error) { + createAction := action.(clitesting.CreateActionImpl) + return false, createAction.Object, nil + }) + } +}

But it wasn't immediately clear how to best integrate those reactors into the current fake client setup such that we're not always appending these reactors and increasing the time it takes to run unit tests, which likely indicates it's more useful as a one-off functional type setup:

- k8sClientFake := k8sfake.NewSimpleClientset(config.k8sObjs...) + k8sClientFake := fake.NewClientSetDecoratorWithReactors(config.k8sObjs, config.fakeClientOptions...) k8sClientFake.Resources = apiResourcesForObjects(append(config.extObjs, config.regObjs...)) - config.operatorClient = operatorclient.NewClient(k8sClientFake, apiextensionsfake.NewSimpleClientset(config.extObjs...), apiregistrationfake.NewSimpleClientset(config.regObjs...)) + config.operatorClient = operatorclient.NewClient(k8sClientFake.Clientset, apiextensionsfake.NewSimpleClientset(config.extObjs...), apiregistrationfake.NewSimpleClientset(config.regObjs...))

And at the call site creating a new fake client:

... + withFakeClientOptions(clientfake.WithOperatorGroupReactors(t)), ...

I tried running down this route but got sidetracked with other work.

But it wasn't immediately clear how to best integrate those reactors into the current fake client setup such that we're not always appending these reactors and increasing the time it takes to run unit tests, which likely indicates it's more useful as a one-off functional type setup

imho this reactor is generally useful for all unit tests, but I'm not sure the error from the create action would be carried forward properly by the existing decorator even if the example you gave was added.

When I had played around with this exact implementation, we had reverted back to the original testing flake behavior (e.g. "crb-role already exists")

this is expected in the implementation with a bad client fake, but it should no longer panic at test time, and should do the correct thing at runtime; i.e. return an error -- even if it's not the "right" error -- when the cluster has a different view of the cluster.

…troller Fixes [operator-framework#2091](operator-framework#2091). This is a follow-up to [operator-framework#2309](operator-framework#2309) that attempted to fix the original issue. When checking whether the ClusterRole/ClusterRoleBinding resources already exist, we're also checking whether the existing labels are owned by the CSV we're currently handling. When accessing the "cr" or "crb" variables that the Create calls output, a panic is produced as we're attempting to access the meta.Labels key from those resources, except those resources themselves are nil. Update the check to verify that the cr/crb variables are not nil before attempting to access those object's labels. The testing fake client may need to be updated in the future to handle returning these resources properly. Signed-off-by: timflannagan <timflannagan@gmail.com>

timflannagan · 2021-09-23T15:34:59Z

/hold cancel

kevinrizza · 2021-09-23T16:56:56Z

/approve

asdf

dinhxuanvu · 2021-09-23T17:15:26Z

pkg/controller/operators/olm/operator_test.go

 	clockFake := utilclock.NewFakeClock(time.Date(2006, time.January, 2, 15, 4, 5, 0, time.FixedZone("MST", -7*3600)))
 	now := metav1.NewTime(clockFake.Now().UTC())
+	const (


I thought we had this const somewhere in the code base already.

dinhxuanvu

/lgtm

openshift-ci · 2021-09-23T17:16:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dinhxuanvu, kevinrizza, timflannagan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dinhxuanvu,kevinrizza]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from gallettilance and kevinrizza September 8, 2021 23:54

timflannagan force-pushed the fix-panic-isownedbykindlabels-label-nil branch 3 times, most recently from 09d1e58 to bf99831 Compare September 8, 2021 23:57

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2021

timflannagan force-pushed the fix-panic-isownedbykindlabels-label-nil branch from bf99831 to f2adb2d Compare September 14, 2021 19:36

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2021

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2021

timflannagan commented Sep 14, 2021

View reviewed changes

pkg/controller/operators/olm/operatorgroup.go Show resolved Hide resolved

timflannagan force-pushed the fix-panic-isownedbykindlabels-label-nil branch from f2adb2d to 579c7dd Compare September 14, 2021 20:03

estroz reviewed Sep 14, 2021

View reviewed changes

njhale previously requested changes Sep 21, 2021

View reviewed changes

timflannagan and others added 2 commits September 23, 2021 11:32

test(og): de-flake sync unit tests

ad27f33

timflannagan force-pushed the fix-panic-isownedbykindlabels-label-nil branch from 579c7dd to ad27f33 Compare September 23, 2021 15:33

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2021

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2021

dinhxuanvu reviewed Sep 23, 2021

View reviewed changes

dinhxuanvu approved these changes Sep 23, 2021

View reviewed changes

openshift-ci bot assigned dinhxuanvu Sep 23, 2021

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 23, 2021

openshift-merge-robot merged commit 188ee1a into operator-framework:master Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/controller: Fix panic when creating cluster-scoped RBAC in OG controller #2349

pkg/controller: Fix panic when creating cluster-scoped RBAC in OG controller #2349

timflannagan commented Sep 8, 2021 •

edited

Loading

timflannagan commented Sep 9, 2021

timflannagan commented Sep 14, 2021

timflannagan commented Sep 14, 2021

estroz Sep 14, 2021

njhale Sep 21, 2021

njhale Sep 21, 2021

njhale Sep 21, 2021

timflannagan Sep 21, 2021

timflannagan Sep 21, 2021 •

edited

Loading

njhale Sep 21, 2021

timflannagan commented Sep 23, 2021

kevinrizza commented Sep 23, 2021

dinhxuanvu Sep 23, 2021

dinhxuanvu left a comment

openshift-ci bot commented Sep 23, 2021

pkg/controller: Fix panic when creating cluster-scoped RBAC in OG controller #2349

pkg/controller: Fix panic when creating cluster-scoped RBAC in OG controller #2349

Conversation

timflannagan commented Sep 8, 2021 • edited Loading

timflannagan commented Sep 9, 2021

timflannagan commented Sep 14, 2021

timflannagan commented Sep 14, 2021

estroz Sep 14, 2021

Choose a reason for hiding this comment

njhale Sep 21, 2021

Choose a reason for hiding this comment

njhale Sep 21, 2021

Choose a reason for hiding this comment

njhale Sep 21, 2021

Choose a reason for hiding this comment

timflannagan Sep 21, 2021

Choose a reason for hiding this comment

timflannagan Sep 21, 2021 • edited Loading

Choose a reason for hiding this comment

njhale Sep 21, 2021

Choose a reason for hiding this comment

timflannagan commented Sep 23, 2021

kevinrizza commented Sep 23, 2021

dinhxuanvu Sep 23, 2021

Choose a reason for hiding this comment

dinhxuanvu left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Sep 23, 2021

timflannagan commented Sep 8, 2021 •

edited

Loading

timflannagan Sep 21, 2021 •

edited

Loading