Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1917537: Fix time comparison in CSV reconcile loop #1974

Conversation

hasbro17
Copy link
Contributor

Description of the change:
Fix the comparison check for csv.status.LastUpdateTime in the CSV reconcile from == to Time.Equal().

Motivation for the change:
See the following bug for background on the reconcile hotloop: https://bugzilla.redhat.com/show_bug.cgi?id=1917537

After installing an operator on an openshift 4.6.z cluster (e.g 4.6.13) the OLM operator can get into a loop where the CSV object is being continuously being reconciled and updated roughly every second.

time="2021-01-26T19:55:44Z" level=info msg="checking ocs-operator.v4.6.1"
2021-01-26T19:55:44.512Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:44.513Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:44.513Z	DEBUG	controllers.operator	Operator is already up-to-date	{"request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:44.513Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:44.514Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "clusterserviceversion", "request": "openshift-storage/ocs-operator.v4.6.1"}
2021-01-26T19:55:45.634Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:45.635Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "subscription", "request": "openshift-storage/ocs-operator"}
time="2021-01-26T19:55:45Z" level=info msg="csv in operatorgroup" csv=ocs-operator.v4.6.1 id=zuP7V namespace=openshift-storage opgroup=openshift-storage-dvlsm phase=Succeeded
time="2021-01-26T19:55:45Z" level=info msg="checking ocs-operator.v4.6.1"
2021-01-26T19:55:45.666Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "clusterserviceversion", "request": "openshift-storage/ocs-operator.v4.6.1"}
2021-01-26T19:55:45.678Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:45.680Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:45.680Z	DEBUG	controllers.operator	Operator is already up-to-date	{"request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:45.680Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "operator", "request": "/ocs-operator.openshift-storage"}
2021-01-26T19:55:46.848Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "subscription", "request": "openshift-storage/ocs-operator"}
2021-01-26T19:55:46.848Z	DEBUG	controllers.operator	reconciling operator	{"request": "/ocs-operator.openshift-storage"}
time="2021-01-26T19:55:46Z" level=info msg="csv in operatorgroup" csv=ocs-operator.v4.6.1 id=zykOJ namespace=openshift-storage opgroup=openshift-storage-dvlsm phase=Succeeded
time="2021-01-26T19:55:46Z" level=info msg="checking ocs-operator.v4.6.1"
...

This results in increased CPU usage and requests to the APIserver.
Pre-fix-usage

While it's unclear how the operator gets into this reconcile hotloop when installing certain operators (OCS, ACM) and not others (etcd, cert-manager), the comparison check outCSV.Status.LastUpdateTime == clusterServiceVersion.Status.LastUpdateTime is incorrect as it always results false since the values being compared are *metav1.Time pointer addresses.

This results in the OLM operator always updating the CSV object with no real status change and triggering another reconcile event.

With the proposed fix the reconcile hotloop does go away since the CSV is no longer being updated when there are no status changes.
Screen Shot 2021-01-26 at 11 33 20 AM

TODO:

  • Unknown how OLM gets into the reconcile loop or why it doesn't occur on a 4.7.z cluster
  • Figure out a unit or e2e test to cover this case

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Jan 26, 2021
@openshift-ci-robot
Copy link
Collaborator

@hasbro17: This pull request references Bugzilla bug 1917537, which is invalid:

  • expected the bug to target the "4.7.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

WIP: Bug 1917537: Fix time comparison in CSV reconcile loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@hasbro17
Copy link
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Jan 27, 2021
@openshift-ci-robot
Copy link
Collaborator

@hasbro17: This pull request references Bugzilla bug 1917537, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Jan 27, 2021
@hasbro17
Copy link
Contributor Author

/retest

@hasbro17 hasbro17 changed the title WIP: Bug 1917537: Fix time comparison in CSV reconcile loop Bug 1917537: Fix time comparison in CSV reconcile loop Jan 27, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 27, 2021
@@ -1097,7 +1097,7 @@ func (a *Operator) syncClusterServiceVersion(obj interface{}) (syncError error)
}

// status changed, update CSV
if !(outCSV.Status.LastUpdateTime == clusterServiceVersion.Status.LastUpdateTime &&
if !(outCSV.Status.LastUpdateTime.Equal(clusterServiceVersion.Status.LastUpdateTime) &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot that this could be a nil dereference if outCSV.Status.LastUpdateTime == nil. Going to add a check for that first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, never mind. I keep forgetting a method is orthogonal to the underlying data and LastUpdateTime.Equal() is fine even if LastUpdateTime == nil.

@ecordell
Copy link
Member

ecordell commented Feb 2, 2021

/approve

@openshift-ci-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ecordell, hasbro17

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2021
@ecordell
Copy link
Member

ecordell commented Feb 2, 2021

/cherry-pick release-4.6

@openshift-cherrypick-robot

@ecordell: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kevinrizza
Copy link
Member

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 2, 2021
@hasbro17
Copy link
Contributor Author

hasbro17 commented Feb 2, 2021

/retest

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 0373c9b into operator-framework:master Feb 2, 2021
@openshift-ci-robot
Copy link
Collaborator

@hasbro17: All pull requests linked via external trackers have merged:

Bugzilla bug 1917537 has been moved to the MODIFIED state.

In response to this:

Bug 1917537: Fix time comparison in CSV reconcile loop

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot

@ecordell: new pull request created: #1987

In response to this:

/cherry-pick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants