Bug 1843505: pkg/start: Release leader lease on graceful shutdown #424

wking · 2020-08-04T04:01:59Z

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while:

I0802 10:06:01.056591       1 leaderelection.go:243] attempting to acquire leader lease  openshift-cluster-version/version...
...
I0802 10:07:42.632719       1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version

and time out the Cluster did not acknowledge request to upgrade in a reasonable time testcase. Using ReleaseOnCancel has been the plan since 2b81f47 (#87). I'm not clear on why it (sometimes?) doesn't work today.

The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout.

I'm also pushing the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole start invocation.

openshift-ci-robot · 2020-08-04T04:02:06Z

@wking: This pull request references Bugzilla bug 1843505, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target release (4.6.0) matches configured target release for branch (4.6.0)
bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

In response to this:

Bug 1843505: pkg/start: Release leader lease on graceful shutdown

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2020-08-04T04:13:16Z

error: unable to parse image registry.svc.ci.openshift.org/ci-op-tj2cw7w4/stable@sha256:4e250898455aa351e24b92c89c475c0098fd1dc5cf39267db816ad5e6e2e8f89: cannot retrieve image configuration for manifest, but images completed successfully. Must be some kind of CI-cluster flake.

/retest

wking · 2020-08-04T05:29:16Z

Tailing the CVO logs from the e2e-upgrade job (I grabbed the kubeconfig from the CI cluster console) gives:

I0804 05:00:54.430107       1 start.go:21] ClusterVersionOperator v1.0.0-337-g24893174-dirty
...
I0804 05:22:05.540840       1 start.go:172] Shutting down due to terminated
I0804 05:22:05.541090       1 task_graph.go:483] Canceled worker 3 while waiting for work
...
I0804 05:22:05.545382       1 task_graph.go:483] Canceled worker 0 while waiting for work
E0804 05:22:05.545604       1 metrics.go:160] Failed to gracefully shut down metrics server: accept tcp [::]:9099: use of closed network connection
...
I0804 05:22:05.558671       1 cvo.go:350] Shutting down ClusterVersionOperator
E0804 05:22:05.545078       1 task.go:81] error running apply for clusteroperator "machine-config" (510 of 608): Cluster operator machine-config is still updating
...
I0804 05:22:05.559098       1 task_graph.go:546] Result of work: [Cluster operator machine-config is still updating]
...
E0804 05:22:05.559401       1 sync_worker.go:339] unable to synchronize image (waiting 2m52.525702462s): Cluster operator machine-config is still updating
E0804 05:22:05.559623       1 start.go:221] close tcp [::]:9099: use of closed network connection
I0804 05:22:05.559722       1 start.go:238] Finished collecting operator goroutines.
I0804 05:22:05.559865       1 start.go:26] Graceful shutdown complete.

pkg/start/start.go

abhinavdahiya

LGTM

pkg/start/start.go

wking · 2020-08-04T23:19:38Z

integration fails with:

I0804 21:30:55.966164    9245 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 3.469557764s
I0804 21:30:55.969344    9245 start.go:238] Finished collecting operator goroutines.
I0804 21:30:55.969433    9245 leaderelection.go:243] attempting to acquire leader lease  e2e-cvo-d6v6nv/e2e-cvo-d6v6nv...
I0804 21:30:55.969456    9245 start.go:206] Stopped leading; shutting down.
I0804 21:30:55.969705    9245 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 3m37.636795318s
...
--- FAIL: TestIntegrationCVO_gracefulStepDown (60.04s)
    start_integration_test.go:542: the controller should create a lock record on a config map
    start_integration_test.go:563: timed out waiting for the condition
    start_integration_test.go:520: failed to delete cluster version e2e-cvo-d6v6nv: clusterversions.config.openshift.io "e2e-cvo-d6v6nv" not found

Because, when we were not launching a metrics server (which we don't for those integration tests), there was nothing blocking Options.run from an early exit. I've pushed 235116d -> a38f5e5 to get the leaderelection bit into the errorChannelCount collector, so no we will wait until it completes before existing Options.run.

wking · 2020-08-05T00:00:17Z

Hrm:

I0804 23:49:24.651118    8781 start.go:222] Run context completed; beginning two-minute graceful shutdown period.
I0804 23:49:24.651174    8781 start.go:217] Waiting on 1 outstanding goroutines.
I0804 23:49:24.651184    8781 start.go:222] Run context completed; beginning two-minute graceful shutdown period.
I0804 23:49:24.651190    8781 start.go:217] Waiting on 1 outstanding goroutines.
...

/hold

until I figure out why we aren't breaking over to the shutting down branch.

wking · 2020-08-05T04:40:21Z

/hold cancel

I'd initialized the shutdownTimer variable inside the for loop, so it was getting cleared on each loop iteration 🤦. Moved it outside with 6c4988e -> d82e2bf.

wking · 2020-08-05T05:17:50Z

Hrm, still not seeing a formal stopped leading release in there:

$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/424/pull-ci-openshift-cluster-version-operator-master-integration/1290870079050747904/build-log.txt |grep 'start\.go\|start_integration_test\.go\|leaderelection\|Starting ClusterVersionOperator with minimum reconcile period\|Shutting down ClusterVersionOperator'
I0805 04:41:53.258885    8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m18.782310567s
I0805 04:41:53.263473    8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 5.440919882s
I0805 04:41:53.263482    8994 start.go:218] Waiting on 1 outstanding goroutines.
I0805 04:41:53.263525    8994 leaderelection.go:243] attempting to acquire leader lease  e2e-cvo-wtq8nd/e2e-cvo-wtq8nd...
I0805 04:41:53.271143    8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m35.17222288s
I0805 04:41:53.274128    8994 leaderelection.go:253] successfully acquired lease e2e-cvo-wtq8nd/e2e-cvo-wtq8nd
I0805 04:41:53.274609    8994 cvo.go:316] Starting ClusterVersionOperator with minimum reconcile period 2m52.525702462s
I0805 04:41:53.471957    8994 start.go:222] Run context completed; beginning two-minute graceful shutdown period.
I0805 04:41:53.471990    8994 start.go:218] Waiting on 1 outstanding goroutines.
I0805 04:41:53.472174    8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:41:53.476123    8994 start.go:208] Stopped leading; shutting down.
I0805 04:41:53.476158    8994 start.go:244] Finished collecting operator goroutines.
I0805 04:41:53.479381    8994 cvo.go:350] Shutting down ClusterVersionOperator
    start_integration_test.go:542: the controller should create a lock record on a config map
    start_integration_test.go:566: verify the controller writes a leadership change event
    start_integration_test.go:575: after the context is closed, the lock record should be deleted quickly
    start_integration_test.go:592: timed out waiting for the condition
    start_integration_test.go:520: failed to delete cluster version e2e-cvo-wtq8nd: clusterversions.config.openshift.io "e2e-cvo-wtq8nd" not found
I0805 04:42:11.511270    8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:42:25.909230    8994 cvo.go:350] Shutting down ClusterVersionOperator
I0805 04:42:25.909230    8994 cvo.go:350] Shutting down ClusterVersionOperator

We see the Stopped leading; shutting down from our OnStoppedLeading. For some reason, cvo.go's Operator.Run seems to have a four Starting ClusterVersionOperator with... starts and five Shutting down ClusterVersionOperator stops. Maybe logs from other, parallel tests are getting mixed in there... And it would be nice if LeaderElector.release generated any logging, but it doesn't seem to.

deads2k · 2020-08-05T15:25:43Z

pkg/start/start.go

+				},
+				OnStoppedLeading: func() {
+					klog.Info("Stopped leading; shutting down.")
+					runCancel()


I think you still need to exit, don't you? How confident are you that this truly resets everything?

How confident are you that this truly resets everything?

If it doesn't, CI should turn it up, and we'll fix those bugs ;)

This doesn't sound like what I described on slack. If we lost lease, we exit immediately, no graceful step down. When we have lost our lease we should not be running.

If we lost lease, we exit immediately...

OnStoppedLeading is not just "lost lease", it is also "gracefully released lease". We could have logic here about checking postMainContext.Err() to guess about lost vs. released. But runCancel() should immediately block all cluster-object-writing activity, so I think this is sufficient as it stands to keep from fighting the new leader. For comparison, master gives a full 5s grace period after the cancel before forcing a hard exit. If reducing my current 2m grace period to 5s would make you happy with this line, I'm happy to make that change.

smarterclayton · 2020-08-05T16:37:18Z

From chat:

Top loop has to know about two different paths and treat them differently:

cancelled by interrupt - process needs to wait for cvo.Run to exit, then release lock, then exit
cancelled by losing lease - process should shutdown as fast as possible (by closing the context and then racing to os.Exit())

wking · 2020-08-05T19:49:24Z

cancelled by losing lease - process should shutdown as fast as possible (by closing the context and then racing to os.Exit())

I've pushed d82e2bf -> 122754f, removing Context.Start and inlining the goroutine launching logic in Options.run so we can count and reap the goroutines Context.Start used to launch, and launch them with more appropriate contexts. Now both TERM and leader-loss will lead to an as-soon-as-possible runCancel (like my previous version, but different from master, which had the immediate cancel() followed by 100ms sleep and all that). The canceled runContext should stop both CVO.Run() and, if enabled, AutoUpdate.Run() immediately, but there is nothing in start.go asserting that those two Run() functions respect the cancel and immediately stop applying cluster changes. Letting the remaining components (metrics server, informers, etc.) continue for a brief period of graceful shutdown after the TERM/leader-loss trigger seems reasonable to me, so I'm not yet sold on the race to os.Exit.

wking · 2020-08-05T20:01:39Z

Bunch of Entrypoint received interrupt: terminated.

/retest

wking · 2020-08-05T20:53:31Z

integration still failing. I've pushed a WIP aabb0d5 to skip the other tests so we have clearer logs to analyze for the step-down failure.

So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, openshift#179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, openshift#349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Decouple shutdownContext from the context passed into Options.run, to allow TestIntegrationCVO_gracefulStepDown to request a graceful shutdown. And remove Context.Start(), inlining the logic in Options.run so we can count and reap the goroutines it used to launch. This also allows us to be more targeted with the context for each goroutines: * Informers are now launched before the lease controller, so they're up and running by the time we acquire the lease. They remain running until the main operator CVO.Run() exits, after which we shut them down. Having informers running before we have a lease is somewhat expensive in terms of API traffic, but we should rarely have two CVO pods competing for leadership since we transitioned to the Recreate Deployment strategy in 078686d (install/0000_00_cluster-version-operator_03_deployment: Set 'strategy: Recreate', 2019-03-20, openshift#140) and 5d8a527 (install/0000_00_cluster-version-operator_03_deployment: Fix Recreate strategy, 2019-04-03, openshift#155). I don't see a way to block on their internal goroutine's completion, but maybe informers will grow an API for that in the future. * The metrics server also continues to run until CVO.Run() exits, where previously we began gracefully shutting it down at the same time we started shutting down CVO.Run(). This ensures we are around and publishing any last-minute CVO.Run() changes. * Leader election also continues to run until CVO.Run() exits. We don't want to release the lease while we're still controlling things. * CVO.Run() and AutoUpdate.Run() both stop immediately when the passed-in context is canceled or we call runCancel internally (because of a TERM, error from a goroutine, or loss of leadership). These are the only two goroutines that are actually writing to the API servers, so we want to shut them down as quickly as possible. Drop an unnecessary runCancel() from the "shutting down" branch of the error collector. I'd added it in b30aa0e, but you can only ever get into the "shutting down" branch if runCancel has already been called. And fix the scoping for the shutdownTimer variable so we don't clear it on each for-loop iteration (oops :p, bug from b30aa0e). Add some logging to the error collector, so it's easier to see where we are in the collection process from the operator logs. Also start logging collected goroutines by name, so we can figure out which may still be outstanding. Set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core Cherry picked from cc1921d (openshift#424), around conflicts due to the lack of TLS metrics and the Context pivots in 4.5.

…ap lock release From the godocs: $ grep -A5 '// HolderIdentity' vendor/k8s.io/client-go/tools/leaderelection/resourcelock/interface.go // HolderIdentity is the ID that owns the lease. If empty, no one owns this lease and // all callers may acquire. Versions of this library prior to Kubernetes 1.14 will not // attempt to acquire leases with empty identities and will wait for the full lease // interval to expire before attempting to reacquire. This value is set to empty when // a client voluntarily steps down. HolderIdentity string `json:"holderIdentity"` The previous assumption that the release would involve ConfigMap deletion was born with the test in 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, openshift#87). Cherry picked from dd09c3f (openshift#424), around conflicts due to the lack of Context pivots in 4.5.

Clayton wants these in each goroutine we launch [1]. Obviously there's no way to reach inside the informer Start()s and add it there. I'm also adding this to the FIXME comment for rerolling the auto-update worker goroutines; we'll get those straigtened out in future work. Cherry picked from 9c42a92 (openshift#424), around conflicts due to the lack of Context pivots in 4.5. [1]: openshift#424

Lala wanted the version included in the outgoing log line [1]. I'm not sure why you'd be wondering which version of the CVO code was running for that particular line, and not for other lines in the log, but including the version there is easy enough. While we're thinking about logging the CVO version, also remove the useless %s formatting from the opening log line, because we don't need to manipulate version.String at all. [1]: openshift#424 (comment)

Following cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), which logs overall shutdown in cmd/start.go, this commit will make it extremely clear in the CVO logs when the metrics goroutine is wrapping up.

Apparently there's something in the HTTPS server goroutine that can hang up even if we've called Shutdown() on the server [1]. Defend against that with a safety valve to abandon stuck goroutines if shutdownContext expires. Also pivot to resultChannel and asyncResult, so we can get names for the collected channels (and more easily identify the stuck channels by elimination), following the pattern set by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1891143#c1

Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10

a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches in Options.run 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944

a9e075a (pkg/cvo/cvo: Guard Operator.Run goroutine handling from early cancels, 2021-01-28, openshift#508) made us more robust to situations where we are canceled after acquiring the leader lock but before we got into Operator.Run's UntilWithContext. However, there was still a bug from cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424) where we had not acquired the leader lock [1]. postMainContext is used for metrics, informers, and the leader election loop. We used to only call postMainCancel after reaping the main goroutine, and obviously that will only work if we've launched the main goroutine. This commit adds a new launchedMain to track that. If launchedMain is true, we get the old handling. If launchedMain is still false when runContext.Done, we now call postMainCancel without waiting to reap a non-existent main goroutine. There's also a new postMainCancel when the shutdown timer expires. I don't expect us to ever need that, but it protects us from future bugs like this one. I've added launchedMain without guarding it behind a lock, and it is touched by both the main Options.run goroutine and the leader-election callback. So there's a racy chance of: 1. Options.run goroutine: runContext canceled, so runContext.Done() matches 2. Leader-election goroutine: Leader lock acquired 3. Options.run goroutine: !launchedMain, so we call postMainCancel() 4. Leader-election goroutine: launchedMain set true 5. Leader-election goroutine: launches the main goroutine via CVO.Run(runContext, ...) I'm trusting Operator.Run to respect runContext there and not do anything significant, so the fact that we are already tearing down all the post-main stuff won't cause problems. Previous fixes like a9e075a will help with that. But there could still be bugs in Operator.Run. A lock around launchedMain that avoided calling Operator.Run when runContext was already done would protect against that, but seems like overkill in an already complicated goroutine tangle. Without the lock, we just have to field and fix any future Operator.Run runContext issues as we find them. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1927944

Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). --- This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final status synchronization, 2021-01-19, openshift#517) back to 4.5. It's not a clean pick, because we're missing changes like: * b72e843 (Bug 1822844: Block z level upgrades if ClusterVersionOverridesSet set, 2020-04-30, openshift#364). * 1d1de3b (Use context to add timeout to cincinnati HTTP request, 2019-01-15, openshift#410). which also touched these lines. But we've gotten this far without backporting rhbz#1822844, and openshift#410 was never associated with a bug in the first place, so instead of pulling back more of 4.6 to get a clean pick, I've just manually reconciled the pick conflicts. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10

Address a bug introduced by cc1921d (pkg/start: Release leader lease on graceful shutdown, 2020-08-03, openshift#424), where canceling the Operator.Run context would leave the operator with no time to attempt the final sync [1]: E0119 22:24:15.924216 1 cvo.go:344] unable to perform final sync: context canceled With this commit, I'm piping through shutdownContext, which gets a two-minute grace period beyond runContext, to give the operator time to push out that final status (which may include important information like the fact that the incoming release image has completed verification). --- This commit picks c4ddf03 (pkg/cvo: Use shutdownContext for final status synchronization, 2021-01-19, openshift#517) back to 4.5. It's not a clean pick, because we're missing changes like: * b72e843 (Bug 1822844: Block z level upgrades if ClusterVersionOverridesSet set, 2020-04-30, openshift#364). * 1d1de3b (Use context to add timeout to cincinnati HTTP request, 2019-01-15, openshift#410). which also touched these lines. But we've gotten this far without backporting rhbz#1822844, and openshift#410 was never associated with a bug in the first place, so instead of pulling back more of 4.6 to get a clean pick, I've just manually reconciled the pick conflicts. Removing Start from pkg/start (again) fixes a buggy re-introduction in the manually-backported 20421b6 (*: Add lots of Context and options arguments, 2020-07-24, openshift#470). [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1916384#c10

This: - adds main/lease contexts to the controller - sets up a counter and channels to track goroutine completion - sets up a signal handler to catch when the controller is being terminated so we can cancel our contexts - gracefully shuts down the controller upon receipt of a SIGINT/SIGTERM The reason this does not use sync.WaitGroup instead is that sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how many', so the channels are more useful. Cribbed off of what the CVO did here: openshift/cluster-version-operator#424

This: - adds main/lease contexts to the operator - sets up a counter and channels to track goroutine completion - sets up a signal handler to catch when the operator is being terminated so we can cancel our contexts - gracefully shuts down the operator upon receipt of a SIGINT/SIGTERM The reason this does not use sync.WaitGroup instead is that sync.WaitGroup has no awareness of 'what' it's waiting for, just 'how many', so the channels are more useful. Cribbed off of what the CVO did here: openshift/cluster-version-operator#424

openshift-ci-robot added the bugzilla/severity-low Referenced Bugzilla bug's severity is low for the branch this PR is targeting. label Aug 4, 2020

openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Aug 4, 2020

openshift-ci-robot requested review from crawford and sdodson August 4, 2020 04:02

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2020

wking force-pushed the gracefully-release-leader-lease branch 2 times, most recently from 6438ffc to 621d0cd Compare August 4, 2020 04:07

abhinavdahiya reviewed Aug 4, 2020

View reviewed changes

pkg/start/start.go Outdated Show resolved Hide resolved

pkg/start/start.go Outdated Show resolved Hide resolved

pkg/start/start.go Outdated Show resolved Hide resolved

wking force-pushed the gracefully-release-leader-lease branch from 621d0cd to 235116d Compare August 4, 2020 21:27

abhinavdahiya reviewed Aug 4, 2020

View reviewed changes

pkg/start/start.go Outdated Show resolved Hide resolved

wking force-pushed the gracefully-release-leader-lease branch from 235116d to a38f5e5 Compare August 4, 2020 23:17

wking force-pushed the gracefully-release-leader-lease branch from a38f5e5 to 6c4988e Compare August 4, 2020 23:45

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2020

wking force-pushed the gracefully-release-leader-lease branch from 6c4988e to d82e2bf Compare August 5, 2020 04:39

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 5, 2020

deads2k reviewed Aug 5, 2020

View reviewed changes

wking force-pushed the gracefully-release-leader-lease branch 2 times, most recently from c5f88a1 to 122754f Compare August 5, 2020 19:42

openshift-ci-robot mentioned this pull request Aug 5, 2020

Bug 1843505: Have the leader election use faster defaults #426

Merged

This was referenced Aug 27, 2020

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #445

Closed

Bug 1872906: pkg/start: Release leader lease on graceful shutdown #446

Merged

wking mentioned this pull request Sep 9, 2020

pkg/cvo/metrics: Log metrics-server shutdown #453

Merged

wking mentioned this pull request Oct 23, 2020

Bug 1891143: pkg/cvo/metrics: Abandon child goroutines after shutdownContext expires #477

Merged

wking mentioned this pull request Jan 20, 2021

Bug 1916384: pkg/cvo: Use shutdownContext for final status synchronization #501

Merged

wking mentioned this pull request Feb 11, 2021

Bug 1927944: pkg/start: Fix shutdown deadlock when die before getting a leader lock #519

Merged

jkyros mentioned this pull request Jun 13, 2022

Bug 1817075: MCC & MCO don't free leader leases during shut down -> 10 minutes of leader election timeouts openshift/machine-config-operator#3185

Merged

jkyros mentioned this pull request Jun 17, 2022

[ WIP ] DO NOT MERGE Fully context-ifying the controller and operator for graceful shutdown openshift/machine-config-operator#3194

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug 1843505: pkg/start: Release leader lease on graceful shutdown #424

Bug 1843505: pkg/start: Release leader lease on graceful shutdown #424

wking commented Aug 4, 2020

openshift-ci-robot commented Aug 4, 2020

wking commented Aug 4, 2020

wking commented Aug 4, 2020

abhinavdahiya left a comment

wking commented Aug 4, 2020

wking commented Aug 5, 2020

wking commented Aug 5, 2020

wking commented Aug 5, 2020

deads2k Aug 5, 2020

wking Aug 5, 2020

smarterclayton Aug 6, 2020

wking Aug 6, 2020 •

edited

smarterclayton commented Aug 5, 2020

wking commented Aug 5, 2020

wking commented Aug 5, 2020

wking commented Aug 5, 2020

Bug 1843505: pkg/start: Release leader lease on graceful shutdown #424

Bug 1843505: pkg/start: Release leader lease on graceful shutdown #424

Conversation

wking commented Aug 4, 2020

openshift-ci-robot commented Aug 4, 2020

wking commented Aug 4, 2020

wking commented Aug 4, 2020

abhinavdahiya left a comment

Choose a reason for hiding this comment

wking commented Aug 4, 2020

wking commented Aug 5, 2020

wking commented Aug 5, 2020

wking commented Aug 5, 2020

deads2k Aug 5, 2020

Choose a reason for hiding this comment

wking Aug 5, 2020

Choose a reason for hiding this comment

smarterclayton Aug 6, 2020

Choose a reason for hiding this comment

wking Aug 6, 2020 • edited

Choose a reason for hiding this comment

smarterclayton commented Aug 5, 2020

wking commented Aug 5, 2020

wking commented Aug 5, 2020

wking commented Aug 5, 2020

wking Aug 6, 2020 •

edited