New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance cooperation with scheduler binding cycle #81433
Comments
One option is to pass sched.config.StopEverything into pluginContext. |
@tedyu Yes, that could be a variant of option 2. Depending on whether we think it's generic enough to be included in every plugin, or just plugins in "binding cycle". |
It seems StopEverything is generic for all plugins. |
What do you think of making |
@draveness Incorporating |
Just took a look at the related code and found out there is a TODO "once Run() accepts a context, it should be used here": kubernetes/cmd/kube-scheduler/app/server.go Lines 243 to 258 in 0d579bf
I'd like to pull out a PR to do the refactor with |
Please hold the PR. Let's do a thorough discussion first. |
Thanks for it! I think we need to reach a consensus about it tries to solve the problem at which level:
|
I think the cleanest solution is to wait for the current schedule cycle to finish. That is, wait for I observed that Side question: where is |
Actually we don't "wait" here. Once
I'm not sure I like it. Current scheduler is adopting an "optmistic" strategy (
It's spread across our codebase: kubernetes/pkg/scheduler/factory/factory.go Lines 273 to 278 in 32dc42e
kubernetes/test/integration/scheduler/util.go Lines 293 to 295 in 32dc42e
|
@alculquicondor I'm not aware of any background on this. I think it's some legacy issue on using func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}, registryOptions ...Option) error {
...
// Prepare a reusable runCommand function.
run := func(ctx context.Context) {
sched.Run(ctx)
}
...
}
func (sched *Scheduler) Run(ctx context.Context) {
if !sched.config.WaitForCacheSync() {
return
}
wait.UntilWithContext(ctx, sched.scheduleOne, 0)
}
func (sched *Scheduler) scheduleOne(ctx context.Context) {
...
} |
@hex108 I'm leaning toward giving the control to each plugin - which means technically pass in the |
Reading through the conversations, it seems that there are 2 options at a high level:
I am also leaning towards Option 1. Basically we move the responsibility to each plugin. But I think this is probably the only way to get it right. |
If we leave the decision to the plugins, then I still think waiting for the whole "binding" routine to finish is a good idea.
My recommendation is that we not only close the channel, but wait for anything to finish processing. Closing the channel would start "stopping" the routines, but they are not guaranteed to finish.
I'm not saying we block on it for |
The decision is still owned by plugins. If they want to finish the whole "binding" routine, they can return "success" as the status; otherwise cancel internally and return "non-success".
Yup, that's a debugger case. The main scheduler process isn't associated with signal handling yet. |
@Huang-Wei +1, it makes sense. |
Agreed that leaving the control to the plugin is a good idea. They might consider that any operation they have initiated needs to be completed. But again, that implies that we should have a wait group for all the routines already spawned, so that we can reliably eliminate the flakes. |
Does that mean StopEverything is currently only closed in the tests? |
Yes. Additionally, we can explicitly stop this when master lost its lease (if it doesn't today). |
I don't think #81238 should be used as a motivation to what is being proposed to be solved in this issue. Each integration test creates a new scheduler object and the problem with that flake was sharing plugin objects between tests (hence effectively between scheduler instances), which shouldn't happen anyway in both tests and in practice. So, if the only use case we have for plumping the stop channel to the plugins is tests, then I don't think we should do anything because that use case can be solved by making sure tests don't share state, which is a reasonable thing to do anyway. If we have a production use case, my preference is to plump a parent Context to the plugins that they can use as a parent to any contexts they create. The simplest way is to have a reference to the Context created by the binary in the framework handler, the plugins can then use it to create their own contexts and to receive the stop signal. I also prefer to completely remove the StopEverything channel and only rely on that Context. |
@ahg-g I agree with you it may not benefit too much for a regular scheduler instance, esp. given that a HA-enabled scheduler setup would exit the whole process when it loses lease: kubernetes/cmd/kube-scheduler/app/server.go Line 264 in e2b29cd
(Graceful shutdown of scheduler may be a production usecase but it's hard to say how it's useful in practice.) So at this moment, consolidate the usage of context and channel should only be beneficial for programmatic users (tests, or vendored applications). |
Even for tests, shouldn't they be using separate scheduler instances and avoid shared state? I am not sure I understand the vendored applications use case, any specific examples? |
That definitely should be obeyed. However, even with that, each integration test is using the same apiserver instance, so if the escaping behavior isn't handled well, it may cause unexpected leftover in apiserver.
I meant applications consuming scheduler codebase, or generally custom scheduler using the new scheduler framework. |
I think we reached a consensus on using a stop channel and reply on /assign |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
What else do we want to do to consider this solved? |
This has been resolved. |
Motivation
As you may have known, scheduler internally has 2 primary cycles - one is scheduling cycle, the other is the binding cycle. Per the latest scheduler framework, the binding cycle includes plugins like Permit, Prebind, and Postbind. And the other fact is that the binding cycle is running as a separate goroutine:
kubernetes/pkg/scheduler/scheduler.go
Line 548 in d5bdb77
This "optimistic" strategy brings us great benefits like scheduling throughout, but on the other hand, can cause some tricky issues. For example: if the scheduler is being shutdown gracefully (via SIGTERM or programmatically), some goroutines may still be running, which may still send binding requests to apiserver. The recent flake #81238 is an example.
Solutions
There are several solutions:
In the binding goroutine, upon the completion of each plugin, instead of checking its return status, also check if
sched.config.StopEverything
is closed, and return appropriately.The pro is to keep the current framework interface as is, and the logic is simple; while the con is the control is a little high-level, if a plugin internally is waiting for some condition (e.g. PermitPlugin waits for 30 seconds for some case), we didn't intervene hance have to wait for its completion.
In the binding goroutine, pass in
sched.config.StopEverything
to each plugin so that each plugin can cancel its internal processing and return a reasonable status.The pro lies on fine-grained control, and the returned status is always consistent. The con is that change on current framework interface, and we have to add logic in each plugin (also future plugins) to deal with the
stopCh
.More options are welcome. Feel free to share your thoughts.
/sig scheduling
The text was updated successfully, but these errors were encountered: