Bug 1872906: pkg/start: Release leader lease on graceful shutdown #446

wking · 2020-08-27T04:24:53Z

Picking #349 and #424 back to 4.5 so we release the leader lock when we exit, which in turn allows the incoming CVO pod to pick up the leader lock and update the ClusterVersion status in a timely manner.

A number of mostly-minor manual tweaks to adjust to 4.5, which is missing #358's TLS metrics and #406 and #410's Context changes.

We've had it since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87), but it's redundant vs. "ListenAddr is not an empty string". I'm also switching to: o.ListenAddr != "" instead of: len(o.ListenAddr) > 0 because it seems slightly easier to understand, but obviously either will work. Cherry-picked from 07e5809 (openshift#349), around conflicts due to the lack of TLS metrics in 4.5.

Somewhat like the example in [1]. This pushes the server management down into a new RunMetrics method, which we then run in its own goroutine. This is initial groundwork; I expect we will port more of our child goroutines to this framework in follow-up work. Cherry-picked from b30aa0e (openshift#349), around conflicts due to the lack of TLS metrics in 4.5. [1]: https://golang.org/pkg/net/http/#Server.Shutdown

Pulling this up out of cvo.New() while working to decouple metrics handling from the core CVO goroutine.

These are not just for available updates, they're also for downloading signatures. Placing them in a separate file makes it easier to focus on the code that is specific to available updates.

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Decouple shutdownContext from the context passed into Options.run, to allow TestIntegrationCVO_gracefulStepDown to request a graceful shutdown. And remove Context.Start(), inlining the logic in Options.run so we can count and reap the goroutines it used to launch. This also allows us to be more targeted with the context for each goroutines: * Informers are now launched before the lease controller, so they're up and running by the time we acquire the lease. They remain running until the main operator CVO.Run() exits, after which we shut them down. Having informers running before we have a lease is somewhat expensive in terms of API traffic, but we should rarely have two CVO pods competing for leadership since we transitioned to the Recreate Deployment strategy in 078686d (install/0000_00_cluster-version-operator_03_deployment: Set 'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527 (install/0000_00_cluster-version-operator_03_deployment: Fix Recreate strategy, 2019-04-03, openshift#155). I don't see a way to block on their internal goroutine's completion, but maybe informers will grow an API for that in the future. * The metrics server also continues to run until CVO.Run() exits, where previously we began gracefully shutting it down at the same time we started shutting down CVO.Run(). This ensures we are around and publishing any last-minute CVO.Run() changes. * Leader election also continues to run until CVO.Run() exits. We don't want to release the lease while we're still controlling things. * CVO.Run() and AutoUpdate.Run() both stop immediately when the passed-in context is canceled or we call runCancel internally (because of a TERM, error from a goroutine, or loss of leadership). These are the only two goroutines that are actually writing to the API servers, so we want to shut them down as quickly as possible. Drop an unnecessary runCancel() from the "shutting down" branch of the error collector. I'd added it in b30aa0e, but you can only ever get into the "shutting down" branch if runCancel has already been called. And fix the scoping for the shutdownTimer variable so we don't clear it on each for-loop iteration (oops :p, bug from b30aa0e). Add some logging to the error collector, so it's easier to see where we are in the collection process from the operator logs. Also start logging collected goroutines by name, so we can figure out which may still be outstanding. Set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core Cherry picked from cc1921d (openshift#424), around conflicts due to the lack of TLS metrics and the Context pivots in 4.5.

…ap lock release From the godocs: $ grep -A5 '// HolderIdentity' vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go // HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and // all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not // attempt to acquire leases with empty identities and will wait for the full lease // interval to expire before attempting to reacquire. This value is set to empty when // a client voluntarily steps down. HolderIdentity string `json:"holderIdentity"` The previous assumption that the release would involve ConfigMap deletion was born with the test in 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). Cherry picked from dd09c3f (openshift#424), around conflicts due to the lack of Context pivots in 4.5.

Clayton wants these in each goroutine we launch [1]. Obviously there's no way to reach inside the informer Start()s and add it there. I'm also adding this to the FIXME comment for rerolling the auto-update worker goroutines; we'll get those straigtened out in future work. Cherry picked from 9c42a92 (openshift#424), around conflicts due to the lack of Context pivots in 4.5. [1]: openshift#424

Lala wanted the version included in the outgoing log line [1]. I'm not sure why you'd be wondering which version of the CVO code was running for that particular line, and not for other lines in the log, but including the version there is easy enough. While we're thinking about logging the CVO version, also remove the useless %s formatting from the opening log line, because we don't need to manipulate version.String at all. [1]: openshift#424 (comment)

openshift-ci-robot · 2020-08-27T04:24:57Z

@wking: This pull request references Bugzilla bug 1872906, which is invalid:

expected dependent Bugzilla bug 1843505 to be in one of the following states: VERIFIED, RELEASE_PENDING, CLOSED (ERRATA), but it is ON_QA instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

In response to this:

Bug 1872906: pkg/start: Release leader lease on graceful shutdown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Now that we have CI that cares about this (hooray!). Generated with: $ go fmt ./... using: $ go version go version go1.14.4 linux/amd64

wking · 2020-09-22T03:08:49Z

4.6 bug is VERIFIED, so we should be unblocked here:

/bugzilla refresh

openshift-ci-robot · 2020-09-22T03:08:55Z

@wking: This pull request references Bugzilla bug 1872906, which is valid.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.z) matches configured target release for branch (4.5.z)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 1843505 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
dependent Bugzilla bug 1843505 targets the "4.6.0" release, which is one of the valid target releases: 4.6.0, 4.6.z
bug has dependents

In response to this:

4.6 bug is VERIFIED, so we should be unblocked here:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LalatenduMohanty

/lgtm

openshift-ci-robot · 2020-09-22T12:08:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: LalatenduMohanty, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [LalatenduMohanty,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-09-25T23:07:38Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-25T23:20:37Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-26T00:38:38Z

/retest

Please review the full test history for this PR and help us cut down flakes.

knobunc · 2020-09-26T00:46:36Z

/retest

openshift-bot · 2020-09-26T03:14:41Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-30T22:14:20Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-09-30T23:58:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-10-01T00:57:43Z

e2e blew up on Kube API access, with lots of :6443: i/o timeout occurred.

openshift-bot · 2020-10-01T02:08:19Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T03:52:21Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T05:49:22Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T07:20:24Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T08:51:32Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T10:52:23Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T12:45:06Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T15:08:07Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T17:05:16Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T18:23:18Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T19:41:14Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T21:12:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-01T23:09:14Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-10-02T01:06:17Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-10-02T02:42:48Z

Maybe openshift/release#12391 helped.

/retest

wking · 2020-10-02T05:22:57Z

All green again after openshift/release#12391. Ready for relabeling when the patch-manger is ready.

sdodson · 2020-10-07T14:27:14Z

/bugzilla refresh

openshift-ci-robot · 2020-10-07T14:27:20Z

@sdodson: This pull request references Bugzilla bug 1872906, which is valid.

6 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.5.z) matches configured target release for branch (4.5.z)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
dependent bug Bugzilla bug 1843505 is in the state VERIFIED, which is one of the valid states (VERIFIED, RELEASE_PENDING, CLOSED (ERRATA))
dependent Bugzilla bug 1843505 targets the "4.6.0" release, which is one of the valid target releases: 4.6.0, 4.6.z
bug has dependents

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mfojtik · 2020-10-07T14:27:35Z

patch-manager notes: approved as this should fix frequent 4.5->4.6 upgrade failure

openshift-bot · 2020-10-07T16:56:59Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-10-07T18:29:37Z

@wking: All pull requests linked via external trackers have merged:

openshift/cluster-version-operator#446

Bugzilla bug 1872906 has been moved to the MODIFIED state.

In response to this:

Bug 1872906: pkg/start: Release leader lease on graceful shutdown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking added 8 commits August 26, 2020 15:07

pkg/start: Register metrics directly

f8774c0

Pulling this up out of cvo.New() while working to decouple metrics handling from the core CVO goroutine.

pkg/cvo/egress: Pull HTTPS/Proxy egress into separate file

d8ca134

These are not just for available updates, they're also for downloading signatures. Placing them in a separate file makes it easier to focus on the code that is specific to available updates.

openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Aug 27, 2020

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 27, 2020

openshift-ci-robot requested review from crawford and sdodson August 27, 2020 04:25

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 27, 2020

wking mentioned this pull request Aug 27, 2020

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #445

Closed

*: Wash through 'go fmt'

65bcffd

Now that we have CI that cares about this (hooray!). Generated with: $ go fmt ./... using: $ go version go version go1.14.4 linux/amd64

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Sep 22, 2020

LalatenduMohanty approved these changes Sep 22, 2020

View reviewed changes

openshift-ci-robot assigned LalatenduMohanty Sep 22, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 22, 2020

markmc added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Sep 25, 2020

mfojtik added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Oct 7, 2020

openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. and removed bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. labels Oct 7, 2020

openshift-merge-robot merged commit 2c849e5 into openshift:release-4.5 Oct 7, 2020

wking deleted the gracefully-release-leader-lease-4.5 branch October 7, 2020 19:15

wking mentioned this pull request Oct 9, 2020

Bug 1886939: vendor: Bump client-go to v0.18.9 #470

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #446

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #446

wking commented Aug 27, 2020

openshift-ci-robot commented Aug 27, 2020

wking commented Sep 22, 2020

openshift-ci-robot commented Sep 22, 2020

LalatenduMohanty left a comment

openshift-ci-robot commented Sep 22, 2020

openshift-bot commented Sep 25, 2020

openshift-bot commented Sep 25, 2020

openshift-bot commented Sep 26, 2020

knobunc commented Sep 26, 2020

openshift-bot commented Sep 26, 2020

openshift-bot commented Sep 30, 2020

openshift-bot commented Sep 30, 2020

wking commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 2, 2020

wking commented Oct 2, 2020

wking commented Oct 2, 2020

sdodson commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020

mfojtik commented Oct 7, 2020

openshift-bot commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #446

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #446

Conversation

wking commented Aug 27, 2020

openshift-ci-robot commented Aug 27, 2020

wking commented Sep 22, 2020

openshift-ci-robot commented Sep 22, 2020

LalatenduMohanty left a comment

Choose a reason for hiding this comment

openshift-ci-robot commented Sep 22, 2020

openshift-bot commented Sep 25, 2020

openshift-bot commented Sep 25, 2020

openshift-bot commented Sep 26, 2020

knobunc commented Sep 26, 2020

openshift-bot commented Sep 26, 2020

openshift-bot commented Sep 30, 2020

openshift-bot commented Sep 30, 2020

wking commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 1, 2020

openshift-bot commented Oct 2, 2020

wking commented Oct 2, 2020

wking commented Oct 2, 2020

sdodson commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020

mfojtik commented Oct 7, 2020

openshift-bot commented Oct 7, 2020

openshift-ci-robot commented Oct 7, 2020