Enhance cooperation with scheduler binding cycle #81433

Huang-Wei · 2019-08-14T20:18:15Z

Motivation

As you may have known, scheduler internally has 2 primary cycles - one is scheduling cycle, the other is the binding cycle. Per the latest scheduler framework, the binding cycle includes plugins like Permit, Prebind, and Postbind. And the other fact is that the binding cycle is running as a separate goroutine:

kubernetes/pkg/scheduler/scheduler.go

Line 548 in d5bdb77

go func() {

This "optimistic" strategy brings us great benefits like scheduling throughout, but on the other hand, can cause some tricky issues. For example: if the scheduler is being shutdown gracefully (via SIGTERM or programmatically), some goroutines may still be running, which may still send binding requests to apiserver. The recent flake #81238 is an example.

Solutions

There are several solutions:

In the binding goroutine, upon the completion of each plugin, instead of checking its return status, also check if sched.config.StopEverything is closed, and return appropriately.

The pro is to keep the current framework interface as is, and the logic is simple; while the con is the control is a little high-level, if a plugin internally is waiting for some condition (e.g. PermitPlugin waits for 30 seconds for some case), we didn't intervene hance have to wait for its completion.
In the binding goroutine, pass in sched.config.StopEverything to each plugin so that each plugin can cancel its internal processing and return a reasonable status.

The pro lies on fine-grained control, and the returned status is always consistent. The con is that change on current framework interface, and we have to add logic in each plugin (also future plugins) to deal with the stopCh.
More options are welcome. Feel free to share your thoughts.

/sig scheduling

The text was updated successfully, but these errors were encountered:

Huang-Wei · 2019-08-14T20:19:23Z

cc/ @ahg-g @liu-cong @bsalamat @hex108 @draveness @alculquicondor

tedyu · 2019-08-15T00:10:53Z

One option is to pass sched.config.StopEverything into pluginContext.
This way, the plugin's would have access to this channel.

Huang-Wei · 2019-08-15T00:23:39Z

@tedyu Yes, that could be a variant of option 2. Depending on whether we think it's generic enough to be included in every plugin, or just plugins in "binding cycle".

tedyu · 2019-08-15T00:30:14Z

It seems StopEverything is generic for all plugins.

draveness · 2019-08-15T01:34:56Z

What do you think of making pluginContext conform to the context.Context interface and use it to synchronize the stop signal?

Huang-Wei · 2019-08-15T02:34:02Z

@draveness Incorporating context.Context is possible. But current codebase just employs channel (stopCh) from the start of cmd/scheduler, so some refactoring needs to done if it's appropriate.

draveness · 2019-08-15T03:03:20Z

@draveness Incorporating context.Context is possible. But current codebase just employs channel (stopCh) from the start of cmd/scheduler, so some refactoring needs to done if it's appropriate.

Just took a look at the related code and found out there is a TODO "once Run() accepts a context, it should be used here":

kubernetes/cmd/kube-scheduler/app/server.go

Lines 243 to 258 in 0d579bf

    
           // Prepare a reusable runCommand function. 
        
           run := func(ctx context.Context) { 
        
           	sched.Run() 
        
           	<-ctx.Done() 
        
           } 
        
           ctx, cancel := context.WithCancel(context.TODO()) // TODO once Run() accepts a context, it should be used here 
        
           defer cancel() 
        
           go func() { 
        
           	select { 
        
           	case <-stopCh: 
        
           		cancel() 
        
           	case <-ctx.Done(): 
        
           	} 
        
           }()

I'd like to pull out a PR to do the refactor with context.Context and we could see whether it is appropriate, what do you think?

Huang-Wei · 2019-08-15T05:47:14Z

Please hold the PR. Let's do a thorough discussion first.

hex108 · 2019-08-15T06:45:34Z

Thanks for it!

I think we need to reach a consensus about it tries to solve the problem at which level:

If the scheduler receives a shutdown message, plugins need to exit immediately, then there is no escaped goroutine. Then we need pass a channel or something similar to plugins, when plugins need to wait something, it also waits the channel at the same time.
If the scheduler receives a shutdown message, plugins should not have more side effects to any outside object(e.g. send write/delete/... requests to apiserver). Then we need pass a channel or something similar to plugins, and plugins need to check the channel at every step that they might do some side effects. For this case, the code might be ugly. But I think the issue might also want to solve this problem, stop plugins from doing more side effects.

To some extent, flake TestPreemptWithPermitPlugin flakes #81238 is this kind, however it only influences scheduler's internal status, so it will only be a problem for test cases.
If the scheduler receives a shutdown message, scheduler could wait for current plugin to finish, then exit. Option 1 is for this case.

alculquicondor · 2019-08-15T13:51:48Z

I think the cleanest solution is to wait for the current schedule cycle to finish. That is, wait for scheduleOne and its side effects to finish for the current pod.

I observed that sched.Run doesn't block. Does anyone know the motivation for this? This means that nothing is tracking scheduleOne to finish running. And then scheduleOne spawns the go routine that @Huang-Wei mentioned, which is also not tracked. How do you guys feel about adding a wait group for all this routines? Then we can wait for this wait group when we trigger a programmatic stop.

Side question: where is stopCh aka sched.config.StopEverything being closed? I couldn't find any code location (at least in the binary code, I didn't check the tests).

Huang-Wei · 2019-08-15T16:50:33Z

I think the cleanest solution is to wait for the current schedule cycle to finish. That is, wait for scheduleOne and its side effects to finish for the current pod.

Actually we don't "wait" here. Once scheduleOne() finishes the "scheduling cycle", it spawns a goroutine to perform the "binding cycle" in the background, and continue to the next cycle. So in other words, it's possible there are more than one goroutines flying in the air when the scheduler is being shutdown.

How do you guys feel about adding a wait group for all this routines?

I'm not sure I like it. Current scheduler is adopting an "optmistic" strategy (assume()) to make sure the binding cycle doesn't block main scheduling cycle, which IMO is a good practice.

Side question: where is stopCh aka sched.config.StopEverything being closed? I couldn't find any code location (at least in the binary code, I didn't check the tests).

It's spread across our codebase:

kubernetes/pkg/scheduler/factory/factory.go

Lines 273 to 278 in 32dc42e

    
           debugger.ListenForSignal(c.StopEverything) 
        
           go func() { 
        
           	<-c.StopEverything 
        
           	c.podQueue.Close() 
        
           }()

kubernetes/test/integration/scheduler/util.go

Lines 293 to 295 in 32dc42e

    
           func cleanupTest(t *testing.T, context *testContext) { 
        
           	// Kill the scheduler. 
        
           	close(context.stopCh)

Huang-Wei · 2019-08-15T17:30:40Z

I observed that sched.Run doesn't block. Does anyone know the motivation for this?

@alculquicondor I'm not aware of any background on this. I think it's some legacy issue on using stopCh and ctx.Context there, we can use this chance to consolidate them, for example:

func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}, registryOptions ...Option) error {
	...
	// Prepare a reusable runCommand function.
	run := func(ctx context.Context) {
		sched.Run(ctx)
	}
	...
}

func (sched *Scheduler) Run(ctx context.Context) {
	if !sched.config.WaitForCacheSync() {
		return
	}

	wait.UntilWithContext(ctx, sched.scheduleOne, 0)
}

func (sched *Scheduler) scheduleOne(ctx context.Context) {
	...
}

Huang-Wei · 2019-08-15T17:39:23Z

@hex108 I'm leaning toward giving the control to each plugin - which means technically pass in the ctx.Context to each plugin, and plugin decides how to deal with it: either at the end of their own cycle, or immediately upon receival of ctx.Done(), or just ignore the event, or complex internal cases leveraging cancel() propagation.

liu-cong · 2019-08-15T18:38:18Z

Reading through the conversations, it seems that there are 2 options at a high level:

Passing the stop signal (be it a context or a channel) to each Plugin (using PluginContext or some other form). Each plugins decides how to handle it.
The framework itself handles the stop signal and plugins don't.
-2.a Upon completion of each plugin, check the signal and return aggressively. Some other plugins may be still running.
-2.b Once receives a stop signal, wait for all plugins to finish and then gracefully exit.

I am also leaning towards Option 1. Basically we move the responsibility to each plugin. But I think this is probably the only way to get it right.
Option 2.a can end up in an undesired state (some plugins make some changes while others haven't finished). Option 2.b seems a bit dangerous, what if a plugin takes a very long time to finish? I guess we'd like to stop quickly once stop signal is fired.

alculquicondor · 2019-08-15T18:39:15Z

If we leave the decision to the plugins, then I still think waiting for the whole "binding" routine to finish is a good idea.

ListenForSignal doesn't close the channel. I don't think we are ever closing it, except for tests. However, I assume that is what we are concerned about in the case of the flake.

My recommendation is that we not only close the channel, but wait for anything to finish processing. Closing the channel would start "stopping" the routines, but they are not guaranteed to finish.

I'm not sure I like it. Current scheduler is adopting an "optmistic" strategy (assume()) to make sure the binding cycle doesn't block main scheduling cycle, which IMO is a good practice.

I'm not saying we block on it for scheduleOne, but only if we closed StopEverything.

Huang-Wei · 2019-08-15T19:53:30Z

If we leave the decision to the plugins, then I still think waiting for the whole "binding" routine to finish is a good idea.

The decision is still owned by plugins. If they want to finish the whole "binding" routine, they can return "success" as the status; otherwise cancel internally and return "non-success".

ListenForSignal doesn't close the channel.

Yup, that's a debugger case. The main scheduler process isn't associated with signal handling yet.

hex108 · 2019-08-16T02:16:26Z

@hex108 I'm leaning toward giving the control to each plugin - which means technically pass in the ctx.Context to each plugin, and plugin decides how to deal with it: either at the end of their own cycle, or immediately upon receival of ctx.Done(), or just ignore the event, or complex internal cases leveraging cancel() propagation.

@Huang-Wei +1, it makes sense.

alculquicondor · 2019-08-16T12:48:53Z

Agreed that leaving the control to the plugin is a good idea. They might consider that any operation they have initiated needs to be completed.

But again, that implies that we should have a wait group for all the routines already spawned, so that we can reliably eliminate the flakes.

ahg-g · 2019-08-19T14:59:49Z

If we leave the decision to the plugins, then I still think waiting for the whole "binding" routine to finish is a good idea.

The decision is still owned by plugins. If they want to finish the whole "binding" routine, they can return "success" as the status; otherwise cancel internally and return "non-success".

ListenForSignal doesn't close the channel.

Yup, that's a debugger case. The main scheduler process isn't associated with signal handling yet.

Does that mean StopEverything is currently only closed in the tests?

Huang-Wei · 2019-08-19T16:44:53Z

Does that mean StopEverything is currently only closed in the tests?

Yes. Additionally, we can explicitly stop this when master lost its lease (if it doesn't today).

ahg-g · 2019-08-19T17:26:59Z

I don't think #81238 should be used as a motivation to what is being proposed to be solved in this issue. Each integration test creates a new scheduler object and the problem with that flake was sharing plugin objects between tests (hence effectively between scheduler instances), which shouldn't happen anyway in both tests and in practice.

So, if the only use case we have for plumping the stop channel to the plugins is tests, then I don't think we should do anything because that use case can be solved by making sure tests don't share state, which is a reasonable thing to do anyway.

If we have a production use case, my preference is to plump a parent Context to the plugins that they can use as a parent to any contexts they create. The simplest way is to have a reference to the Context created by the binary in the framework handler, the plugins can then use it to create their own contexts and to receive the stop signal.

I also prefer to completely remove the StopEverything channel and only rely on that Context.

Huang-Wei · 2019-08-20T00:34:59Z

@ahg-g I agree with you it may not benefit too much for a regular scheduler instance, esp. given that a HA-enabled scheduler setup would exit the whole process when it loses lease:

kubernetes/cmd/kube-scheduler/app/server.go

Line 264 in e2b29cd

klog.Fatalf("leaderelection lost")

(Graceful shutdown of scheduler may be a production usecase but it's hard to say how it's useful in practice.)

So at this moment, consolidate the usage of context and channel should only be beneficial for programmatic users (tests, or vendored applications).

ahg-g · 2019-08-20T17:16:09Z

Even for tests, shouldn't they be using separate scheduler instances and avoid shared state?

I am not sure I understand the vendored applications use case, any specific examples?

Huang-Wei · 2019-08-20T20:27:34Z

Even for tests, shouldn't they be using separate scheduler instances and avoid shared state?

That definitely should be obeyed. However, even with that, each integration test is using the same apiserver instance, so if the escaping behavior isn't handled well, it may cause unexpected leftover in apiserver.

I am not sure I understand the vendored applications use case, any specific examples?

I meant applications consuming scheduler codebase, or generally custom scheduler using the new scheduler framework.

draveness · 2019-08-28T11:15:21Z

I think we reached a consensus on using a stop channel and reply on context.Context instead, I sent a PR to implement the change #82072

/assign

fejta-bot · 2019-11-26T11:21:01Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

alculquicondor · 2019-11-26T14:52:40Z

What else do we want to do to consider this solved?

Huang-Wei · 2019-12-02T22:35:20Z

This has been resolved.

Huang-Wei added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 14, 2019

k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Aug 14, 2019

liu-cong mentioned this issue Aug 21, 2019

REQUEST: New membership for liu-cong kubernetes/org#1127

Closed

6 tasks

k8s-ci-robot assigned draveness Aug 28, 2019

draveness mentioned this issue Sep 6, 2019

Fix filter plugins are not been called during preemption #81876

Merged

draveness mentioned this issue Sep 10, 2019

feat(scheduler): use context in scheduler package #82072

Merged

draveness mentioned this issue Oct 24, 2019

feat: remove WaitForCacheSync from scheduler factory #84268

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2019

Huang-Wei closed this as completed Dec 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance cooperation with scheduler binding cycle #81433

Enhance cooperation with scheduler binding cycle #81433

Huang-Wei commented Aug 14, 2019

Huang-Wei commented Aug 14, 2019

tedyu commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

tedyu commented Aug 15, 2019

draveness commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

draveness commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

hex108 commented Aug 15, 2019 •

edited

alculquicondor commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019 •

edited

liu-cong commented Aug 15, 2019

alculquicondor commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

hex108 commented Aug 16, 2019

alculquicondor commented Aug 16, 2019

ahg-g commented Aug 19, 2019

Huang-Wei commented Aug 19, 2019

ahg-g commented Aug 19, 2019 •

edited

Huang-Wei commented Aug 20, 2019 •

edited

ahg-g commented Aug 20, 2019

Huang-Wei commented Aug 20, 2019

draveness commented Aug 28, 2019

fejta-bot commented Nov 26, 2019

alculquicondor commented Nov 26, 2019

Huang-Wei commented Dec 2, 2019

Enhance cooperation with scheduler binding cycle #81433

Enhance cooperation with scheduler binding cycle #81433

Comments

Huang-Wei commented Aug 14, 2019

Motivation

Solutions

Huang-Wei commented Aug 14, 2019

tedyu commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

tedyu commented Aug 15, 2019

draveness commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

draveness commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

hex108 commented Aug 15, 2019 • edited

alculquicondor commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019 • edited

liu-cong commented Aug 15, 2019

alculquicondor commented Aug 15, 2019

Huang-Wei commented Aug 15, 2019

hex108 commented Aug 16, 2019

alculquicondor commented Aug 16, 2019

ahg-g commented Aug 19, 2019

Huang-Wei commented Aug 19, 2019

ahg-g commented Aug 19, 2019 • edited

Huang-Wei commented Aug 20, 2019 • edited

ahg-g commented Aug 20, 2019

Huang-Wei commented Aug 20, 2019

draveness commented Aug 28, 2019

fejta-bot commented Nov 26, 2019

alculquicondor commented Nov 26, 2019

Huang-Wei commented Dec 2, 2019

hex108 commented Aug 15, 2019 •

edited

Huang-Wei commented Aug 15, 2019 •

edited

ahg-g commented Aug 19, 2019 •

edited

Huang-Wei commented Aug 20, 2019 •

edited