Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-4.8] Bug 1998938: Use client-go's leader election implementation #421

Conversation

timflannagan
Copy link
Contributor

Description of the change:

Motivation for the change:

Reviewer Checklist

  • Implementation matches the proposed design, or proposal is updated to match implementation
  • Sufficient unit test coverage
  • Sufficient end-to-end test coverage
  • Docs updated or added to /docs
  • Commit messages sensible and descriptive

timflannagan and others added 4 commits August 30, 2021 00:52
…figMaps

Update the `marketplace-operator` ClusterRole RBAC and add an entry that
allows the marketplace operator to update ConfigMaps in order to get
client-go's leader election leader-for-lease implementation working
correctly.
Update the cmd,pkg files that house the cleanup/migration logic for
ensuring that any deprecated APIs, like the OperatorSource resource, are
transistioned to the CatalogSource resource. These APIs have been
deprecated for multiple OCP releases now, and the marketplace project is
largely a downstream project at this point as most of the heavy-lifting
functionality has been migrated over to upstream OLM, so these changes
should help cleanup the main package.
Update the cmd/manager/main.go package and use client-go's leader
election implementation instead of operator-sdk's implementation.

This type of leader election implementation is acquired through a
leasing mechanism vs. operator-sdk's leader for life, which can produce
a failed upgrade in some edge cases as the new operator version cannot
acquire leader election.

Alternatives:
- Using controller-runtime's implementation, but that
would likely include a sizable refactor in the driver function when
setting up the manager instance here, as that implementation only
attempts to gain leader election when mgr.Start() is called. This is
problematic in the case that we have several pre-requisites before
starting any controllers/informer caches/etc., like loading the default
CatalogSource YAML manifests into a cache.
- Avoid needing a leader election mechanism by spinning up a sidecar
container that recycles Pods when a new rollout is detected.
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2021

@timflannagan: This pull request references Bugzilla bug 1998938, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

6 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.z) matches configured target release for branch (4.8.z)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
  • dependent bug Bugzilla bug 1958888 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), CLOSED (CURRENTRELEASE))
  • dependent Bugzilla bug 1958888 targets the "4.9.0" release, which is one of the valid target releases: 4.9.0
  • bug has dependents

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (xzha@redhat.com), skipping review request.

In response to this:

[release-4.8] Bug 1998938: Use client-go's leader election implementation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 30, 2021
Comment on lines -178 to -186
clientGo, err := client.New(cfg, client.Options{Scheme: mgr.GetScheme()})
if err != nil && !k8sErrors.IsNotFound(err) {
logrus.Fatal(err, "Failed to instantiate the client for migrator")
}
migrator := migrator.New(clientGo)
err = migrator.Migrate()
if err != nil {
logrus.Error(err, "[migration] Error while migrating Marketplace away from OperatorSource API")
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note to reviewers: the master/release-4.9 PR automatic cherrypick failed when attempting to apply these commit(s) to the existing cmd/manager/main.go implementation. As a result, I added the commit for removing the migration logic that was added during the 4.9 release timeframe: bcb9289. Should be a safe change to sneak in given that OperatorSource APIs were deprecated in 4.5 and removed in 4.6+.

@@ -28,7 +30,9 @@ func Context() context.Context {

select {
case <-signalCtx.Done():
logrus.Info("received the done signal")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, these look like debug artifacts that were added in the initial, merged PR that made it's way through code review. I'm not entirely sure I see the value in logging the current state of signal handling, at least at the info-level log level, but not the end of the world. Happy to modify that cherrypick(-ed) commit and remove this additional logging if anyone has a strong opinion.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2021

@timflannagan: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-aws c4f9672 link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@kevinrizza
Copy link
Member

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 30, 2021
Copy link
Member

@dinhxuanvu dinhxuanvu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 30, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dinhxuanvu, kevinrizza, timflannagan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

5 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@timflannagan
Copy link
Contributor Author

/uncc

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

19 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@mfojtik mfojtik added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 10, 2021
@mfojtik
Copy link
Collaborator

mfojtik commented Sep 10, 2021

[patch-manager] 🚀 Approved for z-stream by score: 0.50

adding cherry-pick approved

@openshift-merge-robot openshift-merge-robot merged commit 3f3d7d1 into operator-framework:release-4.8 Sep 10, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 10, 2021

@timflannagan: All pull requests linked via external trackers have merged:

Bugzilla bug 1998938 has been moved to the MODIFIED state.

In response to this:

[release-4.8] Bug 1998938: Use client-go's leader election implementation

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@timflannagan timflannagan deleted the release-4.8-client-go-leader-election-bz branch September 10, 2021 16:19
@timflannagan
Copy link
Contributor Author

/cherrypick release-4.7

@openshift-cherrypick-robot

@timflannagan: #421 failed to apply on top of branch "release-4.7":

Applying: manifests: Ensure the marketplace-operator ClusterRole can update ConfigMaps
Applying: cmd,pkg: Remove migration logic for depreacted APIs
Using index info to reconstruct a base tree...
M	cmd/manager/main.go
M	pkg/defaults/defaults.go
M	pkg/migrator/migrator.go
Falling back to patching base and 3-way merge...
CONFLICT (modify/delete): pkg/migrator/migrator.go deleted in cmd,pkg: Remove migration logic for depreacted APIs and modified in HEAD. Version HEAD of pkg/migrator/migrator.go left in tree.
Auto-merging pkg/defaults/defaults.go
CONFLICT (content): Merge conflict in pkg/defaults/defaults.go
Auto-merging cmd/manager/main.go
CONFLICT (content): Merge conflict in cmd/manager/main.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0002 cmd,pkg: Remove migration logic for depreacted APIs
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherrypick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added a commit to wking/openshift-release that referenced this pull request Oct 19, 2021
…k to 4.8.12

[1] had been doing well for a while, but around September 17 [2] and
21 [3] it began perma-failing.  Digging into the recent [4], the
cluster-version operator blocked on marketplace:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1449854837989576704/artifacts/e2e-aws-upgrade-rollback-oldest-supported/gather-extra/artifacts/clusterversion.json | jq -r '.items[].status.conditions[] | select(.type == "Progressing") | .reason + ": " + .message'
  ClusterOperatorUpdating: Working towards 4.8.2: 527 of 676 done (77% complete), waiting on marketplace

The marketplace operator had hung on leader election:

  $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1449854837989576704/artifacts/e2e-aws-upgrade-rollback-oldest-supported/gather-extra/artifacts/pods/openshift-marketplace_marketplace-operator-744b969b6-2v69d_marketplace-operator.log | tail -n1
  time="2021-10-17T23:46:17Z" level=info msg="Waiting to become leader."

Marketplace adjusted their leader lease handling on September 9th [5],
shipping in 4.8.12 on the 21st [6], and I suspect this is related.
I'm bumping the floor to avoid marketplace's previous deadlock-prone
leader system.

[1]: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.8-informing#periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1438981580411375616
[3]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1440431301499817984
[4]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-upgrade-rollback-oldest-supported/1449854837989576704
[5]: operator-framework/operator-marketplace#421 (comment)
[6]: https://bugzilla.redhat.com/show_bug.cgi?id=1998938#c5
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build marketplace-operator-container-v4.8.0-202311261141.p0.g3f3d7d1.assembly.stream for distgit marketplace-operator.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants