Handle termination gracefully for controller manager and scheduler #76452

mfojtik · 2019-04-11T16:47:47Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

This change will wire the stop channel baked by shutdown signal handler down to controller manager and scheduler. Doing this will cause these two properly close and release their ports used for serving connections.

This is causing problems if you run these in containers with host ports for example, where replacing old container with new container means you have to wait until kernel free up the TCP port for next process.

Credits to @sttts for most of this wiring :-)

Does this PR introduce a user-facing change?:

NONE

mfojtik · 2019-04-11T16:49:10Z

/assign @sttts

mfojtik · 2019-04-11T16:49:37Z

/sig apimachinery

/cc @kubernetes/sig-api-machinery-api-reviews

sttts · 2019-04-11T16:56:03Z

cmd/kube-scheduler/app/server.go

 	// If leader election is enabled, runCommand via LeaderElector until done and exit.
 	if cc.LeaderElection != nil {
 		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
-			OnStartedLeading: run,
+			OnStartedLeading: func(context.Context) {
+				sched.Run()


what does this mean when the context is closed? Will leaderElector.Run(ctx) below return ever if this call does not use the context?

sched.config.StopEverything

It seems it use this chan to synchronize over the provided context?

Storing a context or stop chan in a struct is generally not preferred. Passing the context into Scheduler.Run would be more idiomatic.

sttts · 2019-04-11T16:57:05Z

cmd/kube-scheduler/app/server.go

 	}

-	// Leader election is disabled, so runCommand inline until done.
-	run(ctx)
-	return fmt.Errorf("finished without leader elect")


to be clear here: we change the return value to be inline with kcm and ccm.

staging/src/k8s.io/apiserver/pkg/server/secure_serving.go

sttts · 2019-04-11T16:59:52Z

/assign @liggitt @stewart-yu @hzxuzhonghu

fejta-bot · 2019-04-11T18:43:57Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

fedebongio · 2019-04-11T20:19:34Z

/assign @cheftako

mfojtik · 2019-04-12T07:21:16Z

/retest

p0lyn0mial · 2019-09-03T12:37:49Z

/remove-lifecycle stale

cheftako · 2019-09-03T22:27:52Z

cmd/cloud-controller-manager/app/controllermanager.go

 	"k8s.io/apimachinery/pkg/util/sets"
 	"k8s.io/apimachinery/pkg/util/uuid"
 	"k8s.io/apimachinery/pkg/util/wait"
 	"k8s.io/apiserver/pkg/server"
+	genericapiserver "k8s.io/apiserver/pkg/server"


This seems redundant with the prior line. Do we need an additional alias (genericapiserver) for a package we are already pulling in? We could just have server.SetupSignalHandler() below.

cheftako · 2019-09-03T22:32:42Z

cmd/cloud-controller-manager/app/controllermanager.go

+		select {
+		case <-stopCh:
+			cancel()
+		case <-ctx.Done():


What is the purpose of having both a stop channel and a done channel here? Especially as a context is usually associated with a request and our main run method does not seem related to a request.

stop channel is controlled by the signals( SIGTERM and SIGINT) whereas done channel is controlled by Run method and allows for graceful termination. For example, done channel will be closed when the component cannot create HTTP{S} sockets, when it loses leadership or when it receives one of the signals. Note that closing one of the channels is equivalent to requesting closing the application.

cheftako · 2019-09-03T22:36:43Z

cmd/kube-controller-manager/app/controllermanager.go

 		Lock:          rl,
 		LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,
 		RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,
 		RetryPeriod:   c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,
 		Callbacks: leaderelection.LeaderCallbacks{
 			OnStartedLeading: run,
 			OnStoppedLeading: func() {
-				klog.Fatalf("leaderelection lost")
+				cancel()
+				utilruntime.HandleError(fmt.Errorf("leaderelection lost"))


This needs to be fatal and this change appears to make it non fatal.

no, this pull introduces a graceful shutdown for kcm, ccm and scheduler, in this case, it means that when the component loses leadership it notifies and waits for all dependant controllers and listeners before shutting down. For example, for kcm it means that it will wait for all its controllers as well as for HTTPS and HTTP listeners.

The fundamental guarantee we have for controllers right now is that we will not run them concurrently. (Or at least minimize then window where that might happen). This change violates that guarantee.

As soon as we are told we are not the leader anymore (OnStopLeading). We have to assume that another KCM has taken over the leadership role. We also know that we have other threads in this process which are continuing the role of active controllers. They must be stopped immediately to prevent them from making concurrent changes with the new KCM master. This needs to be fatal.

FYI I'm not saying that having the process kill itself on OnStoppedLeading is the ideal solution for controller concurrency. I think we can do better. However I believe this kill itself behavior is needed for HA clusters until we build a better solution of controller concurrency.

should we just change this to log and os.Exit(0), since this is an expected exit? cc @smarterclayton

yes, we probably should, I didn't know that KCM has such strong assumptions in this area especially the scheduler. Although it seems like the leader election library doesn't guarantee anything - it may happen that two KCMs will be running at the same time. Can someone confirm this?

I suspect that KCM cares about efficiency - since correctness will be checked by the API server (resourceVersion). The scheduler, on the other hand, seems to care about correctness - #76452 (review)

Currently the KCMs in HA configurations use the fatal on leader election to ensure that there are not two active KCMs running at the same time. While it would be nice to get additional efficiency by letting multiple KCMs process simultaneously, I do not think we have the necessary correctness guarantees in place for that to be safe. resourceVersion is not sufficient for all controllers to behave correctly. (Eg. Fairly sure things like the cron/job controllers will schedule too much work)

cheftako · 2019-09-03T22:48:39Z

cmd/cloud-controller-manager/app/controllermanager.go

 		Lock:          rl,
 		LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,
 		RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,
 		RetryPeriod:   c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,
 		Callbacks: leaderelection.LeaderCallbacks{
 			OnStartedLeading: run,
 			OnStoppedLeading: func() {
-				klog.Fatalf("leaderelection lost")
+				cancel()
+				utilruntime.HandleError(fmt.Errorf("leaderelection lost"))


This needs to be fatal and this change appears to make it non fatal.

please see my previous comment #76452 (comment)

p0lyn0mial · 2019-09-09T13:27:25Z

cmd/cloud-controller-manager/app/controllermanager.go

@@ -151,9 +163,13 @@ func Run(c *cloudcontrollerconfig.CompletedConfig, stopCh <-chan struct{}) error
 	if c.SecureServing != nil {
 		unsecuredMux := genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging, checks...)
 		handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, &c.Authorization, &c.Authentication)
-		// TODO: handle stoppedCh returned by c.SecureServing.Serve
-		if _, err := c.SecureServing.Serve(handler, 0, stopCh); err != nil {
+		if serverStoppedCh, err := c.SecureServing.Serve(handler, 0, stopCh); err != nil {


Serving should use ctx.Done()

p0lyn0mial · 2019-09-09T18:30:21Z

/retest

p0lyn0mial · 2019-09-09T18:59:23Z

Alright, I think this pull is ready for review, PTAL.

k8s-ci-robot · 2019-09-09T19:44:06Z

@mfojtik: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-local-e2e	`9f93300`	link	`/test pull-kubernetes-local-e2e`
pull-kubernetes-verify	`080fc8a`	link	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

misterikkit

This does some wiring of context/stopCh into scheduler, but only at the surface level. There are some goroutines in the scheduler which do not propagate cancellation because we operate on the assumption that the process will exit when leadership is lost.

With this change, those goroutines could cause bad behavior by competing with the new leader to do writes. e.g.

old leader selects node A for pod
new leader selects node B for pod
new leader successfully binds pod
old leader fails to bind pod to node A, and updates pod status with SchedulingFailed.
mayhem

misterikkit · 2019-09-10T00:18:13Z

cmd/kube-scheduler/app/server.go

 	// If leader election is enabled, runCommand via LeaderElector until done and exit.
 	if cc.LeaderElection != nil {
 		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
-			OnStartedLeading: run,
+			OnStartedLeading: func(context.Context) {
+				sched.Run()


Storing a context or stop chan in a struct is generally not preferred. Passing the context into Scheduler.Run would be more idiomatic.

misterikkit · 2019-09-10T00:23:42Z

cmd/kube-scheduler/app/server.go


 	// If leader election is enabled, runCommand via LeaderElector until done and exit.
 	if cc.LeaderElection != nil {
 		cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
-			OnStartedLeading: run,
+			OnStartedLeading: func(context.Context) {


So... Are we reusing the sched object each time this process. I don't see any code that would exit after graceful cleanup. I'm certain that attempting to re-use this object will fail. The leader election context is being ignored, and we still have a cancelled context in the sched struct. (This is why it would be preferable to pass context into sched.Run().

misterikkit · 2019-09-10T00:27:59Z

@ahg-g FYI

deads2k · 2019-10-07T13:04:44Z

cmd/cloud-controller-manager/app/controllermanager.go

 		Lock:          rl,
 		LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,
 		RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,
 		RetryPeriod:   c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,
 		Callbacks: leaderelection.LeaderCallbacks{
 			OnStartedLeading: run,
 			OnStoppedLeading: func() {
-				klog.Fatalf("leaderelection lost")


I think that most of this pull is useful even without this change. @mfojtik can you keep this fatal and solve the 90% case first?

fejta-bot · 2020-01-12T20:55:03Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

k8s-ci-robot · 2020-01-12T20:55:11Z

@mfojtik: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2020-03-08T20:44:25Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-04-07T21:27:25Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-04-07T21:27:47Z

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

mfojtik force-pushed the wire-term-signal branch from 985ac74 to 9f59092 Compare April 11, 2019 16:48

k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 11, 2019

k8s-ci-robot assigned sttts Apr 11, 2019

sttts reviewed Apr 11, 2019

View reviewed changes

staging/src/k8s.io/apiserver/pkg/server/secure_serving.go Outdated Show resolved Hide resolved

k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Apr 11, 2019

k8s-ci-robot assigned liggitt and stewart-yu Apr 11, 2019

mfojtik force-pushed the wire-term-signal branch 2 times, most recently from 7b4b9f1 to e11090f Compare April 11, 2019 17:04

k8s-ci-robot assigned cheftako Apr 11, 2019

mfojtik force-pushed the wire-term-signal branch from e11090f to a7ad685 Compare April 12, 2019 07:17

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2019

p0lyn0mial added 3 commits September 3, 2019 14:47

make it compile

e9ef7f7

wires signals for cloud-controller-manager

a87b112

go fmt

124aaf4

cheftako reviewed Sep 3, 2019

View reviewed changes

p0lyn0mial reviewed Sep 9, 2019

View reviewed changes

p0lyn0mial added 3 commits September 9, 2019 16:03

use ctx.Done() instead of stopCh

4e79bd6

scheduler exits gracefully

28622c8

scheduler waits for all caches to by synced

c6abdd2

bazel

080fc8a

misterikkit reviewed Sep 10, 2019

View reviewed changes

deads2k reviewed Oct 7, 2019

View reviewed changes

k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 12, 2020

liggitt removed their assignment Feb 7, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 8, 2020

k8s-ci-robot closed this Apr 7, 2020

SIG Release automation moved this from Backlog to Done (1.19) Apr 7, 2020

Handle termination gracefully for controller manager and scheduler #76452

Handle termination gracefully for controller manager and scheduler #76452

Conversation

mfojtik commented Apr 11, 2019 • edited

mfojtik commented Apr 11, 2019

mfojtik commented Apr 11, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sttts commented Apr 11, 2019 • edited

fejta-bot commented Apr 11, 2019

fedebongio commented Apr 11, 2019

mfojtik commented Apr 12, 2019

p0lyn0mial commented Sep 3, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cheftako Sep 9, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

p0lyn0mial commented Sep 9, 2019

p0lyn0mial commented Sep 9, 2019

k8s-ci-robot commented Sep 9, 2019 • edited

misterikkit left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

misterikkit commented Sep 10, 2019

Choose a reason for hiding this comment

fejta-bot commented Jan 12, 2020

k8s-ci-robot commented Jan 12, 2020

fejta-bot commented Mar 8, 2020

fejta-bot commented Apr 7, 2020

k8s-ci-robot commented Apr 7, 2020

mfojtik commented Apr 11, 2019 •

edited

sttts commented Apr 11, 2019 •

edited

cheftako Sep 9, 2019 •

edited

k8s-ci-robot commented Sep 9, 2019 •

edited