Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/controller: Fix panic when creating cluster-scoped RBAC in OG controller #2349

Conversation

timflannagan
Copy link
Contributor

@timflannagan timflannagan commented Sep 8, 2021

Description of the change:
Fixes #2091.

This is a follow-up to operator-framework#2309 that attempted to fix the original issue. When checking whether the ClusterRole/ClusterRoleBinding resources already exist, we're also checking whether the existing labels are owned by the CSV we're currently handling. When accessing the "cr" or "crb" variables that the Create calls output, a panic is produced as we're attempting to access the meta.Labels key from those resources, except those resources themselves are nil. Update the check to verify that the cr/crb variables are not nil before attempting to access those object's labels. The testing fake client may need to be updated in the future to handle returning these resources properly.

I ran go test ./pkg/controller/operators/olm/... -v -run ^TestSyncOperatorGroups$ -count 250 a couple of times and wasn't able to reproduce. I was able to consistently reproduce without these changes using a much smaller count sizing before.

Motivation for the change:
Reduce testing flakes.

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /doc
  • Commit messages sensible and descriptive

@timflannagan timflannagan force-pushed the fix-panic-isownedbykindlabels-label-nil branch 3 times, most recently from 09d1e58 to bf99831 Compare September 8, 2021 23:57
@timflannagan
Copy link
Contributor Author

Evan had pointed out that this implementation may be another regression from fixing the original issue in slack.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 9, 2021
@timflannagan timflannagan force-pushed the fix-panic-isownedbykindlabels-label-nil branch from bf99831 to f2adb2d Compare September 14, 2021 19:36
@timflannagan
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2021
@timflannagan
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 14, 2021
@timflannagan timflannagan force-pushed the fix-panic-isownedbykindlabels-label-nil branch from f2adb2d to 579c7dd Compare September 14, 2021 20:03
if crb != nil && ownerutil.IsOwnedByLabel(crb, csv) {
continue
}
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this error be returned?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to return an error if we want !IsOwnedByLabel to cause a resync (i.e. to retry this logic).

njhale
njhale previously requested changes Sep 21, 2021
if crb != nil && ownerutil.IsOwnedByLabel(crb, csv) {
continue
}
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to return an error if we want !IsOwnedByLabel to cause a resync (i.e. to retry this logic).

Comment on lines 535 to 538
if !k8serrors.IsAlreadyExists(err) {
return err
}
// if the CR already exists, but the label is correct, the cache is just behind
if k8serrors.IsAlreadyExists(err) && ownerutil.IsOwnedByLabel(cr, csv) {
if cr != nil && ownerutil.IsOwnedByLabel(cr, csv) {
continue
} else {
return err
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the @ecordell 's original patch, effectively returns any error if !IsOwnedByLabel, which this code does not. Assuming errors returned from this function trigger a resync -- i.e. a retry -- then we'll fail to retry the operation when we detect the cluster state isn't as expected.

I suspect something like the following will produce the desired results:

if k8serrors.IsAlreadyExists(err) && cr != nil && ownerutil.IsOwnedByLabel(cr, csv) {
    continue
}
return err

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I had played around with this exact implementation, we had reverted back to the original testing flake behavior (e.g. "crb-role already exists") as there's likely a larger issue in the fake testing client where the resource returned from the Create call is always going to be nil. With that said, it's likely we don't need to nil check the variable returned by Create as this is a testing client problem. Any opinions?

Copy link
Contributor Author

@timflannagan timflannagan Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving some of this from slack:

I had tried adding a client-go testing reactor for returning an object for these situations:

+// WithOperatorGroupReactors returns a fakeK8sClientOption that configures a Clientset to return the ClusterRole
+// or ClusterRoleBinding object during a create call.
+// Note(tflannag): Fix for https://github.com/operator-framework/operator-lifecycle-manager/issues/2091.
+func WithOperatorGroupReactors(tb testing.TB) Option {
+       return func(c ClientsetDecorator) {
+               c.PrependReactor("create", "clusterrolebindings", func(action clitesting.Action) (bool, runtime.Object, error) {
+                       createAction := action.(clitesting.CreateActionImpl)
+                       return false, createAction.Object, nil
+               })
+               c.PrependReactor("create", "clusterroles", func(action clitesting.Action) (bool, runtime.Object, error) {
+                       createAction := action.(clitesting.CreateActionImpl)
+                       return false, createAction.Object, nil
+               })
+       }
+}

But it wasn't immediately clear how to best integrate those reactors into the current fake client setup such that we're not always appending these reactors and increasing the time it takes to run unit tests, which likely indicates it's more useful as a one-off functional type setup:

-       k8sClientFake := k8sfake.NewSimpleClientset(config.k8sObjs...)
+       k8sClientFake := fake.NewClientSetDecoratorWithReactors(config.k8sObjs, config.fakeClientOptions...)
        k8sClientFake.Resources = apiResourcesForObjects(append(config.extObjs, config.regObjs...))
-       config.operatorClient = operatorclient.NewClient(k8sClientFake, apiextensionsfake.NewSimpleClientset(config.extObjs...), apiregistrationfake.NewSimpleClientset(config.regObjs...))
+       config.operatorClient = operatorclient.NewClient(k8sClientFake.Clientset, apiextensionsfake.NewSimpleClientset(config.extObjs...), apiregistrationfake.NewSimpleClientset(config.regObjs...))

And at the call site creating a new fake client:

...
+                               withFakeClientOptions(clientfake.WithOperatorGroupReactors(t)),
...

I tried running down this route but got sidetracked with other work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it wasn't immediately clear how to best integrate those reactors into the current fake client setup such that we're not always appending these reactors and increasing the time it takes to run unit tests, which likely indicates it's more useful as a one-off functional type setup

imho this reactor is generally useful for all unit tests, but I'm not sure the error from the create action would be carried forward properly by the existing decorator even if the example you gave was added.

When I had played around with this exact implementation, we had reverted back to the original testing flake behavior (e.g. "crb-role already exists")

this is expected in the implementation with a bad client fake, but it should no longer panic at test time, and should do the correct thing at runtime; i.e. return an error -- even if it's not the "right" error -- when the cluster has a different view of the cluster.

timflannagan and others added 2 commits September 23, 2021 11:32
…troller

Fixes [operator-framework#2091](operator-framework#2091).

This is a follow-up to [operator-framework#2309](operator-framework#2309) that attempted to fix the original issue. When checking whether the ClusterRole/ClusterRoleBinding resources already exist, we're also checking whether the existing labels are owned by the CSV we're currently handling. When accessing the "cr" or "crb" variables that the Create calls output, a panic is produced as we're attempting to access the meta.Labels key from those resources, except those resources themselves are nil. Update the check to verify that the cr/crb variables are not nil before attempting to access those object's labels. The testing fake client may need to be updated in the future to handle returning these resources properly.

Signed-off-by: timflannagan <timflannagan@gmail.com>
@timflannagan timflannagan force-pushed the fix-panic-isownedbykindlabels-label-nil branch from 579c7dd to ad27f33 Compare September 23, 2021 15:33
@timflannagan
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2021
@kevinrizza
Copy link
Member

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 23, 2021
clockFake := utilclock.NewFakeClock(time.Date(2006, time.January, 2, 15, 4, 5, 0, time.FixedZone("MST", -7*3600)))
now := metav1.NewTime(clockFake.Now().UTC())
const (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had this const somewhere in the code base already.

Copy link
Member

@dinhxuanvu dinhxuanvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Sep 23, 2021
@openshift-ci
Copy link

openshift-ci bot commented Sep 23, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dinhxuanvu, kevinrizza, timflannagan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [dinhxuanvu,kevinrizza]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit 188ee1a into operator-framework:master Sep 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flake: TestSyncOperatorGroups/AllNamespaces/CSVPresent/InstallModeNotSupported
6 participants