New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle termination gracefully for controller manager and scheduler #76452
Conversation
985ac74
to
9f59092
Compare
/assign @sttts |
/sig apimachinery /cc @kubernetes/sig-api-machinery-api-reviews |
// If leader election is enabled, runCommand via LeaderElector until done and exit. | ||
if cc.LeaderElection != nil { | ||
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{ | ||
OnStartedLeading: run, | ||
OnStartedLeading: func(context.Context) { | ||
sched.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean when the context is closed? Will leaderElector.Run(ctx)
below return ever if this call does not use the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sched.config.StopEverything
It seems it use this chan to synchronize over the provided context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing a context or stop chan in a struct is generally not preferred. Passing the context into Scheduler.Run
would be more idiomatic.
} | ||
|
||
// Leader election is disabled, so runCommand inline until done. | ||
run(ctx) | ||
return fmt.Errorf("finished without leader elect") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear here: we change the return value to be inline with kcm and ccm.
/assign @liggitt @stewart-yu @hzxuzhonghu |
7b4b9f1
to
e11090f
Compare
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
/assign @cheftako |
e11090f
to
a7ad685
Compare
/retest |
/remove-lifecycle stale |
"k8s.io/apimachinery/pkg/util/sets" | ||
"k8s.io/apimachinery/pkg/util/uuid" | ||
"k8s.io/apimachinery/pkg/util/wait" | ||
"k8s.io/apiserver/pkg/server" | ||
genericapiserver "k8s.io/apiserver/pkg/server" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems redundant with the prior line. Do we need an additional alias (genericapiserver) for a package we are already pulling in? We could just have server.SetupSignalHandler() below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, thx.
select { | ||
case <-stopCh: | ||
cancel() | ||
case <-ctx.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of having both a stop channel and a done channel here? Especially as a context is usually associated with a request and our main run method does not seem related to a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop
channel is controlled by the signals( SIGTERM and SIGINT) whereas done
channel is controlled by Run
method and allows for graceful termination. For example, done
channel will be closed when the component cannot create HTTP{S}
sockets, when it loses leadership or when it receives one of the signals. Note that closing one of the channels is equivalent to requesting closing the application.
Lock: rl, | ||
LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration, | ||
RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration, | ||
RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration, | ||
Callbacks: leaderelection.LeaderCallbacks{ | ||
OnStartedLeading: run, | ||
OnStoppedLeading: func() { | ||
klog.Fatalf("leaderelection lost") | ||
cancel() | ||
utilruntime.HandleError(fmt.Errorf("leaderelection lost")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be fatal and this change appears to make it non fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this pull introduces a graceful shutdown for kcm
, ccm
and scheduler, in this case, it means that when the component loses leadership it notifies and waits for all dependant controllers and listeners before shutting down. For example, for kcm
it means that it will wait for all its controllers as well as for HTTPS and HTTP listeners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental guarantee we have for controllers right now is that we will not run them concurrently. (Or at least minimize then window where that might happen). This change violates that guarantee.
As soon as we are told we are not the leader anymore (OnStopLeading). We have to assume that another KCM has taken over the leadership role. We also know that we have other threads in this process which are continuing the role of active controllers. They must be stopped immediately to prevent them from making concurrent changes with the new KCM master. This needs to be fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I'm not saying that having the process kill itself on OnStoppedLeading is the ideal solution for controller concurrency. I think we can do better. However I believe this kill itself behavior is needed for HA clusters until we build a better solution of controller concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just change this to log and os.Exit(0)
, since this is an expected exit? cc @smarterclayton
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we probably should, I didn't know that KCM has such strong assumptions in this area especially the scheduler. Although it seems like the leader election library doesn't guarantee anything - it may happen that two KCMs will be running at the same time. Can someone confirm this?
I suspect that KCM cares about efficiency - since correctness will be checked by the API server (resourceVersion
). The scheduler, on the other hand, seems to care about correctness - #76452 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the KCMs in HA configurations use the fatal on leader election to ensure that there are not two active KCMs running at the same time. While it would be nice to get additional efficiency by letting multiple KCMs process simultaneously, I do not think we have the necessary correctness guarantees in place for that to be safe. resourceVersion is not sufficient for all controllers to behave correctly. (Eg. Fairly sure things like the cron/job controllers will schedule too much work)
Lock: rl, | ||
LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration, | ||
RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration, | ||
RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration, | ||
Callbacks: leaderelection.LeaderCallbacks{ | ||
OnStartedLeading: run, | ||
OnStoppedLeading: func() { | ||
klog.Fatalf("leaderelection lost") | ||
cancel() | ||
utilruntime.HandleError(fmt.Errorf("leaderelection lost")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be fatal and this change appears to make it non fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see my previous comment #76452 (comment)
@@ -151,9 +163,13 @@ func Run(c *cloudcontrollerconfig.CompletedConfig, stopCh <-chan struct{}) error | |||
if c.SecureServing != nil { | |||
unsecuredMux := genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging, checks...) | |||
handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, &c.Authorization, &c.Authentication) | |||
// TODO: handle stoppedCh returned by c.SecureServing.Serve | |||
if _, err := c.SecureServing.Serve(handler, 0, stopCh); err != nil { | |||
if serverStoppedCh, err := c.SecureServing.Serve(handler, 0, stopCh); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serving
should use ctx.Done()
/retest |
Alright, I think this pull is ready for review, PTAL. |
@mfojtik: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does some wiring of context/stopCh into scheduler, but only at the surface level. There are some goroutines in the scheduler which do not propagate cancellation because we operate on the assumption that the process will exit when leadership is lost.
With this change, those goroutines could cause bad behavior by competing with the new leader to do writes. e.g.
- old leader selects node A for pod
- new leader selects node B for pod
- new leader successfully binds pod
- old leader fails to bind pod to node A, and updates pod status with SchedulingFailed.
- mayhem
// If leader election is enabled, runCommand via LeaderElector until done and exit. | ||
if cc.LeaderElection != nil { | ||
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{ | ||
OnStartedLeading: run, | ||
OnStartedLeading: func(context.Context) { | ||
sched.Run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing a context or stop chan in a struct is generally not preferred. Passing the context into Scheduler.Run
would be more idiomatic.
|
||
// If leader election is enabled, runCommand via LeaderElector until done and exit. | ||
if cc.LeaderElection != nil { | ||
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{ | ||
OnStartedLeading: run, | ||
OnStartedLeading: func(context.Context) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... Are we reusing the sched
object each time this process. I don't see any code that would exit after graceful cleanup. I'm certain that attempting to re-use this object will fail. The leader election context is being ignored, and we still have a cancelled context in the sched struct. (This is why it would be preferable to pass context into sched.Run()
.
@ahg-g FYI |
Lock: rl, | ||
LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration, | ||
RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration, | ||
RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration, | ||
Callbacks: leaderelection.LeaderCallbacks{ | ||
OnStartedLeading: run, | ||
OnStoppedLeading: func() { | ||
klog.Fatalf("leaderelection lost") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that most of this pull is useful even without this change. @mfojtik can you keep this fatal and solve the 90% case first?
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@mfojtik: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Rotten issues close after 30d of inactivity. Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
@fejta-bot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
This change will wire the stop channel baked by shutdown signal handler down to controller manager and scheduler. Doing this will cause these two properly close and release their ports used for serving connections.
This is causing problems if you run these in containers with host ports for example, where replacing old container with new container means you have to wait until kernel free up the TCP port for next process.
Credits to @sttts for most of this wiring :-)
Does this PR introduce a user-facing change?: