Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle termination gracefully for controller manager and scheduler #76452

Closed
wants to merge 12 commits into from

Conversation

mfojtik
Copy link
Contributor

@mfojtik mfojtik commented Apr 11, 2019

What type of PR is this?

/kind bug

What this PR does / why we need it:

This change will wire the stop channel baked by shutdown signal handler down to controller manager and scheduler. Doing this will cause these two properly close and release their ports used for serving connections.

This is causing problems if you run these in containers with host ports for example, where replacing old container with new container means you have to wait until kernel free up the TCP port for next process.

Credits to @sttts for most of this wiring :-)

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/bug Categorizes issue or PR as related to a bug. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2019
@k8s-ci-robot k8s-ci-robot added the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Apr 11, 2019
@mfojtik
Copy link
Contributor Author

mfojtik commented Apr 11, 2019

/assign @sttts

@mfojtik
Copy link
Contributor Author

mfojtik commented Apr 11, 2019

/sig apimachinery

/cc @kubernetes/sig-api-machinery-api-reviews

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API area/apiserver sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 11, 2019
// If leader election is enabled, runCommand via LeaderElector until done and exit.
if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStartedLeading: func(context.Context) {
sched.Run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean when the context is closed? Will leaderElector.Run(ctx) below return ever if this call does not use the context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sched.config.StopEverything

It seems it use this chan to synchronize over the provided context?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing a context or stop chan in a struct is generally not preferred. Passing the context into Scheduler.Run would be more idiomatic.

}

// Leader election is disabled, so runCommand inline until done.
run(ctx)
return fmt.Errorf("finished without leader elect")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear here: we change the return value to be inline with kcm and ccm.

@k8s-ci-robot k8s-ci-robot added sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Apr 11, 2019
@sttts
Copy link
Contributor

sttts commented Apr 11, 2019

/assign @liggitt @stewart-yu @hzxuzhonghu

@mfojtik mfojtik force-pushed the wire-term-signal branch 2 times, most recently from 7b4b9f1 to e11090f Compare April 11, 2019 17:04
@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

@fedebongio
Copy link
Contributor

/assign @cheftako

@mfojtik
Copy link
Contributor Author

mfojtik commented Apr 12, 2019

/retest

@p0lyn0mial
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 3, 2019
"k8s.io/apimachinery/pkg/util/sets"
"k8s.io/apimachinery/pkg/util/uuid"
"k8s.io/apimachinery/pkg/util/wait"
"k8s.io/apiserver/pkg/server"
genericapiserver "k8s.io/apiserver/pkg/server"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems redundant with the prior line. Do we need an additional alias (genericapiserver) for a package we are already pulling in? We could just have server.SetupSignalHandler() below.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, thx.

select {
case <-stopCh:
cancel()
case <-ctx.Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of having both a stop channel and a done channel here? Especially as a context is usually associated with a request and our main run method does not seem related to a request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stop channel is controlled by the signals( SIGTERM and SIGINT) whereas done channel is controlled by Run method and allows for graceful termination. For example, done channel will be closed when the component cannot create HTTP{S} sockets, when it loses leadership or when it receives one of the signals. Note that closing one of the channels is equivalent to requesting closing the application.

Lock: rl,
LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,
RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,
RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
klog.Fatalf("leaderelection lost")
cancel()
utilruntime.HandleError(fmt.Errorf("leaderelection lost"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fatal and this change appears to make it non fatal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, this pull introduces a graceful shutdown for kcm, ccm and scheduler, in this case, it means that when the component loses leadership it notifies and waits for all dependant controllers and listeners before shutting down. For example, for kcm it means that it will wait for all its controllers as well as for HTTPS and HTTP listeners.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fundamental guarantee we have for controllers right now is that we will not run them concurrently. (Or at least minimize then window where that might happen). This change violates that guarantee.

As soon as we are told we are not the leader anymore (OnStopLeading). We have to assume that another KCM has taken over the leadership role. We also know that we have other threads in this process which are continuing the role of active controllers. They must be stopped immediately to prevent them from making concurrent changes with the new KCM master. This needs to be fatal.

Copy link
Member

@cheftako cheftako Sep 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI I'm not saying that having the process kill itself on OnStoppedLeading is the ideal solution for controller concurrency. I think we can do better. However I believe this kill itself behavior is needed for HA clusters until we build a better solution of controller concurrency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just change this to log and os.Exit(0), since this is an expected exit? cc @smarterclayton

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we probably should, I didn't know that KCM has such strong assumptions in this area especially the scheduler. Although it seems like the leader election library doesn't guarantee anything - it may happen that two KCMs will be running at the same time. Can someone confirm this?

I suspect that KCM cares about efficiency - since correctness will be checked by the API server (resourceVersion). The scheduler, on the other hand, seems to care about correctness - #76452 (review)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the KCMs in HA configurations use the fatal on leader election to ensure that there are not two active KCMs running at the same time. While it would be nice to get additional efficiency by letting multiple KCMs process simultaneously, I do not think we have the necessary correctness guarantees in place for that to be safe. resourceVersion is not sufficient for all controllers to behave correctly. (Eg. Fairly sure things like the cron/job controllers will schedule too much work)

Lock: rl,
LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,
RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,
RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
klog.Fatalf("leaderelection lost")
cancel()
utilruntime.HandleError(fmt.Errorf("leaderelection lost"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be fatal and this change appears to make it non fatal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please see my previous comment #76452 (comment)

@@ -151,9 +163,13 @@ func Run(c *cloudcontrollerconfig.CompletedConfig, stopCh <-chan struct{}) error
if c.SecureServing != nil {
unsecuredMux := genericcontrollermanager.NewBaseHandler(&c.ComponentConfig.Generic.Debugging, checks...)
handler := genericcontrollermanager.BuildHandlerChain(unsecuredMux, &c.Authorization, &c.Authentication)
// TODO: handle stoppedCh returned by c.SecureServing.Serve
if _, err := c.SecureServing.Serve(handler, 0, stopCh); err != nil {
if serverStoppedCh, err := c.SecureServing.Serve(handler, 0, stopCh); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serving should use ctx.Done()

@p0lyn0mial
Copy link
Contributor

/retest

@p0lyn0mial
Copy link
Contributor

Alright, I think this pull is ready for review, PTAL.

@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Sep 9, 2019

@mfojtik: The following tests failed, say /retest to rerun them all:

Test name Commit Details Rerun command
pull-kubernetes-local-e2e 9f93300 link /test pull-kubernetes-local-e2e
pull-kubernetes-verify 080fc8a link /test pull-kubernetes-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link

@misterikkit misterikkit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does some wiring of context/stopCh into scheduler, but only at the surface level. There are some goroutines in the scheduler which do not propagate cancellation because we operate on the assumption that the process will exit when leadership is lost.

With this change, those goroutines could cause bad behavior by competing with the new leader to do writes. e.g.

  1. old leader selects node A for pod
  2. new leader selects node B for pod
  3. new leader successfully binds pod
  4. old leader fails to bind pod to node A, and updates pod status with SchedulingFailed.
  5. mayhem

// If leader election is enabled, runCommand via LeaderElector until done and exit.
if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStartedLeading: func(context.Context) {
sched.Run()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing a context or stop chan in a struct is generally not preferred. Passing the context into Scheduler.Run would be more idiomatic.


// If leader election is enabled, runCommand via LeaderElector until done and exit.
if cc.LeaderElection != nil {
cc.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStartedLeading: func(context.Context) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... Are we reusing the sched object each time this process. I don't see any code that would exit after graceful cleanup. I'm certain that attempting to re-use this object will fail. The leader election context is being ignored, and we still have a cancelled context in the sched struct. (This is why it would be preferable to pass context into sched.Run().

@misterikkit
Copy link

@ahg-g FYI

Lock: rl,
LeaseDuration: c.ComponentConfig.Generic.LeaderElection.LeaseDuration.Duration,
RenewDeadline: c.ComponentConfig.Generic.LeaderElection.RenewDeadline.Duration,
RetryPeriod: c.ComponentConfig.Generic.LeaderElection.RetryPeriod.Duration,
Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
klog.Fatalf("leaderelection lost")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that most of this pull is useful even without this change. @mfojtik can you keep this fatal and solve the 90% case first?

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 12, 2020
@k8s-ci-robot
Copy link
Contributor

@mfojtik: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@liggitt liggitt removed their assignment Feb 7, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 8, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

SIG Release automation moved this from Backlog to Done (1.19) Apr 7, 2020
@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closed this PR.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/apiserver cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/release Categorizes an issue or PR as relevant to SIG Release. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet