Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle termination gracefully for controller manager and scheduler #76452
Handle termination gracefully for controller manager and scheduler #76452
Changes from 1 commit
ebd8822
0768f31
768390d
ff005f5
d5de73f
e9ef7f7
a87b112
124aaf4
4e79bd6
28622c8
c6abdd2
080fc8a
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of having both a stop channel and a done channel here? Especially as a context is usually associated with a request and our main run method does not seem related to a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stop
channel is controlled by the signals( SIGTERM and SIGINT) whereasdone
channel is controlled byRun
method and allows for graceful termination. For example,done
channel will be closed when the component cannot createHTTP{S}
sockets, when it loses leadership or when it receives one of the signals. Note that closing one of the channels is equivalent to requesting closing the application.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serving
should usectx.Done()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that most of this pull is useful even without this change. @mfojtik can you keep this fatal and solve the 90% case first?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be fatal and this change appears to make it non fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please see my previous comment #76452 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be fatal and this change appears to make it non fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this pull introduces a graceful shutdown for
kcm
,ccm
and scheduler, in this case, it means that when the component loses leadership it notifies and waits for all dependant controllers and listeners before shutting down. For example, forkcm
it means that it will wait for all its controllers as well as for HTTPS and HTTP listeners.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fundamental guarantee we have for controllers right now is that we will not run them concurrently. (Or at least minimize then window where that might happen). This change violates that guarantee.
As soon as we are told we are not the leader anymore (OnStopLeading). We have to assume that another KCM has taken over the leadership role. We also know that we have other threads in this process which are continuing the role of active controllers. They must be stopped immediately to prevent them from making concurrent changes with the new KCM master. This needs to be fatal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI I'm not saying that having the process kill itself on OnStoppedLeading is the ideal solution for controller concurrency. I think we can do better. However I believe this kill itself behavior is needed for HA clusters until we build a better solution of controller concurrency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we just change this to log and
os.Exit(0)
, since this is an expected exit? cc @smarterclaytonThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we probably should, I didn't know that KCM has such strong assumptions in this area especially the scheduler. Although it seems like the leader election library doesn't guarantee anything - it may happen that two KCMs will be running at the same time. Can someone confirm this?
I suspect that KCM cares about efficiency - since correctness will be checked by the API server (
resourceVersion
). The scheduler, on the other hand, seems to care about correctness - #76452 (review)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the KCMs in HA configurations use the fatal on leader election to ensure that there are not two active KCMs running at the same time. While it would be nice to get additional efficiency by letting multiple KCMs process simultaneously, I do not think we have the necessary correctness guarantees in place for that to be safe. resourceVersion is not sufficient for all controllers to behave correctly. (Eg. Fairly sure things like the cron/job controllers will schedule too much work)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So... Are we reusing the
sched
object each time this process. I don't see any code that would exit after graceful cleanup. I'm certain that attempting to re-use this object will fail. The leader election context is being ignored, and we still have a cancelled context in the sched struct. (This is why it would be preferable to pass context intosched.Run()
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this mean when the context is closed? Will
leaderElector.Run(ctx)
below return ever if this call does not use the context?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it use this chan to synchronize over the provided context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing a context or stop chan in a struct is generally not preferred. Passing the context into
Scheduler.Run
would be more idiomatic.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear here: we change the return value to be inline with kcm and ccm.