New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-4.8] Bug 1998938: Use client-go's leader election implementation #421
[release-4.8] Bug 1998938: Use client-go's leader election implementation #421
Conversation
…figMaps Update the `marketplace-operator` ClusterRole RBAC and add an entry that allows the marketplace operator to update ConfigMaps in order to get client-go's leader election leader-for-lease implementation working correctly.
Update the cmd,pkg files that house the cleanup/migration logic for ensuring that any deprecated APIs, like the OperatorSource resource, are transistioned to the CatalogSource resource. These APIs have been deprecated for multiple OCP releases now, and the marketplace project is largely a downstream project at this point as most of the heavy-lifting functionality has been migrated over to upstream OLM, so these changes should help cleanup the main package.
Update the cmd/manager/main.go package and use client-go's leader election implementation instead of operator-sdk's implementation. This type of leader election implementation is acquired through a leasing mechanism vs. operator-sdk's leader for life, which can produce a failed upgrade in some edge cases as the new operator version cannot acquire leader election. Alternatives: - Using controller-runtime's implementation, but that would likely include a sizable refactor in the driver function when setting up the manager instance here, as that implementation only attempts to gain leader election when mgr.Start() is called. This is problematic in the case that we have several pre-requisites before starting any controllers/informer caches/etc., like loading the default CatalogSource YAML manifests into a cache. - Avoid needing a leader election mechanism by spinning up a sidecar container that recycles Pods when a new rollout is detected.
@timflannagan: This pull request references Bugzilla bug 1998938, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 6 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Bugzilla (xzha@redhat.com), skipping review request. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
clientGo, err := client.New(cfg, client.Options{Scheme: mgr.GetScheme()}) | ||
if err != nil && !k8sErrors.IsNotFound(err) { | ||
logrus.Fatal(err, "Failed to instantiate the client for migrator") | ||
} | ||
migrator := migrator.New(clientGo) | ||
err = migrator.Migrate() | ||
if err != nil { | ||
logrus.Error(err, "[migration] Error while migrating Marketplace away from OperatorSource API") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick note to reviewers: the master/release-4.9 PR automatic cherrypick failed when attempting to apply these commit(s) to the existing cmd/manager/main.go implementation. As a result, I added the commit for removing the migration logic that was added during the 4.9 release timeframe: bcb9289. Should be a safe change to sneak in given that OperatorSource APIs were deprecated in 4.5 and removed in 4.6+.
@@ -28,7 +30,9 @@ func Context() context.Context { | |||
|
|||
select { | |||
case <-signalCtx.Done(): | |||
logrus.Info("received the done signal") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, these look like debug artifacts that were added in the initial, merged PR that made it's way through code review. I'm not entirely sure I see the value in logging the current state of signal handling, at least at the info-level log level, but not the end of the world. Happy to modify that cherrypick(-ed) commit and remove this additional logging if anyone has a strong opinion.
@timflannagan: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/approve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dinhxuanvu, kevinrizza, timflannagan The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
5 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/uncc |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
4 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
19 similar comments
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
/retest-required Please review the full test history for this PR and help us cut down flakes. |
[patch-manager] 🚀 Approved for z-stream by score: 0.50 adding cherry-pick approved |
@timflannagan: All pull requests linked via external trackers have merged: Bugzilla bug 1998938 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherrypick release-4.7 |
@timflannagan: #421 failed to apply on top of branch "release-4.7":
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
…k to 4.8.12 [1] had been doing well for a while, but around September 17 [2] and 21 [3] it began perma-failing. Digging into the recent [4], the cluster-version operator blocked on marketplace: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1449854837989576704/artifacts/e2e-aws-upgrade-rollback-oldest-supported/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Progressing") | .reason + ": " + .message' ClusterOperatorUpdating: Working towards 4.8.2: 527 of 676 done (77% complete), waiting on marketplace The marketplace operator had hung on leader election: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1449854837989576704/artifacts/e2e-aws-upgrade-rollback-oldest-supported/gather-extra/artifacts/pods/openshift-marketplace_marketplace-operator-744b969b6-2v69d_marketplace-operator.log | tail -n1 time="2021-10-17T23:46:17Z" level=info msg="Waiting to become leader." Marketplace adjusted their leader lease handling on September 9th [5], shipping in 4.8.12 on the 21st [6], and I suspect this is related. I'm bumping the floor to avoid marketplace's previous deadlock-prone leader system. [1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1438981580411375616 [3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1440431301499817984 [4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1449854837989576704 [5]: operator-framework/operator-marketplace#421 (comment) [6]: https://bugzilla.redhat.com/show_bug.cgi?id=1998938#c5
[ART PR BUILD NOTIFIER] This PR has been included in build marketplace-operator-container-v4.8.0-202311261141.p0.g3f3d7d1.assembly.stream for distgit marketplace-operator. |
Description of the change:
Motivation for the change:
Reviewer Checklist
/docs