Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhance cooperation with scheduler binding cycle #81433

Closed
Huang-Wei opened this issue Aug 14, 2019 · 28 comments
Closed

Enhance cooperation with scheduler binding cycle #81433

Huang-Wei opened this issue Aug 14, 2019 · 28 comments
Assignees
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@Huang-Wei
Copy link
Member

Motivation

As you may have known, scheduler internally has 2 primary cycles - one is scheduling cycle, the other is the binding cycle. Per the latest scheduler framework, the binding cycle includes plugins like Permit, Prebind, and Postbind. And the other fact is that the binding cycle is running as a separate goroutine:

This "optimistic" strategy brings us great benefits like scheduling throughout, but on the other hand, can cause some tricky issues. For example: if the scheduler is being shutdown gracefully (via SIGTERM or programmatically), some goroutines may still be running, which may still send binding requests to apiserver. The recent flake #81238 is an example.

Solutions

There are several solutions:

  1. In the binding goroutine, upon the completion of each plugin, instead of checking its return status, also check if sched.config.StopEverything is closed, and return appropriately.

    The pro is to keep the current framework interface as is, and the logic is simple; while the con is the control is a little high-level, if a plugin internally is waiting for some condition (e.g. PermitPlugin waits for 30 seconds for some case), we didn't intervene hance have to wait for its completion.

  2. In the binding goroutine, pass in sched.config.StopEverything to each plugin so that each plugin can cancel its internal processing and return a reasonable status.

    The pro lies on fine-grained control, and the returned status is always consistent. The con is that change on current framework interface, and we have to add logic in each plugin (also future plugins) to deal with the stopCh.

  3. More options are welcome. Feel free to share your thoughts.

/sig scheduling

@Huang-Wei Huang-Wei added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 14, 2019
@k8s-ci-robot k8s-ci-robot added the sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. label Aug 14, 2019
@Huang-Wei
Copy link
Member Author

@tedyu
Copy link
Contributor

tedyu commented Aug 15, 2019

One option is to pass sched.config.StopEverything into pluginContext.
This way, the plugin's would have access to this channel.

@Huang-Wei
Copy link
Member Author

@tedyu Yes, that could be a variant of option 2. Depending on whether we think it's generic enough to be included in every plugin, or just plugins in "binding cycle".

@tedyu
Copy link
Contributor

tedyu commented Aug 15, 2019

It seems StopEverything is generic for all plugins.

@draveness
Copy link
Contributor

What do you think of making pluginContext conform to the context.Context interface and use it to synchronize the stop signal?

@Huang-Wei
Copy link
Member Author

@draveness Incorporating context.Context is possible. But current codebase just employs channel (stopCh) from the start of cmd/scheduler, so some refactoring needs to done if it's appropriate.

@draveness
Copy link
Contributor

@draveness Incorporating context.Context is possible. But current codebase just employs channel (stopCh) from the start of cmd/scheduler, so some refactoring needs to done if it's appropriate.

Just took a look at the related code and found out there is a TODO "once Run() accepts a context, it should be used here":

// Prepare a reusable runCommand function.
run := func(ctx context.Context) {
sched.Run()
<-ctx.Done()
}
ctx, cancel := context.WithCancel(context.TODO()) // TODO once Run() accepts a context, it should be used here
defer cancel()
go func() {
select {
case <-stopCh:
cancel()
case <-ctx.Done():
}
}()

I'd like to pull out a PR to do the refactor with context.Context and we could see whether it is appropriate, what do you think?

@Huang-Wei
Copy link
Member Author

Please hold the PR. Let's do a thorough discussion first.

@hex108
Copy link
Contributor

hex108 commented Aug 15, 2019

Thanks for it!

I think we need to reach a consensus about it tries to solve the problem at which level:

  • If the scheduler receives a shutdown message, plugins need to exit immediately, then there is no escaped goroutine. Then we need pass a channel or something similar to plugins, when plugins need to wait something, it also waits the channel at the same time.

  • If the scheduler receives a shutdown message, plugins should not have more side effects to any outside object(e.g. send write/delete/... requests to apiserver). Then we need pass a channel or something similar to plugins, and plugins need to check the channel at every step that they might do some side effects. For this case, the code might be ugly. But I think the issue might also want to solve this problem, stop plugins from doing more side effects.

    To some extent, flake TestPreemptWithPermitPlugin flakes #81238 is this kind, however it only influences scheduler's internal status, so it will only be a problem for test cases.

  • If the scheduler receives a shutdown message, scheduler could wait for current plugin to finish, then exit. Option 1 is for this case.

@alculquicondor
Copy link
Member

I think the cleanest solution is to wait for the current schedule cycle to finish. That is, wait for scheduleOne and its side effects to finish for the current pod.

I observed that sched.Run doesn't block. Does anyone know the motivation for this? This means that nothing is tracking scheduleOne to finish running. And then scheduleOne spawns the go routine that @Huang-Wei mentioned, which is also not tracked. How do you guys feel about adding a wait group for all this routines? Then we can wait for this wait group when we trigger a programmatic stop.

Side question: where is stopCh aka sched.config.StopEverything being closed? I couldn't find any code location (at least in the binary code, I didn't check the tests).

@Huang-Wei
Copy link
Member Author

I think the cleanest solution is to wait for the current schedule cycle to finish. That is, wait for scheduleOne and its side effects to finish for the current pod.

Actually we don't "wait" here. Once scheduleOne() finishes the "scheduling cycle", it spawns a goroutine to perform the "binding cycle" in the background, and continue to the next cycle. So in other words, it's possible there are more than one goroutines flying in the air when the scheduler is being shutdown.

How do you guys feel about adding a wait group for all this routines?

I'm not sure I like it. Current scheduler is adopting an "optmistic" strategy (assume()) to make sure the binding cycle doesn't block main scheduling cycle, which IMO is a good practice.

Side question: where is stopCh aka sched.config.StopEverything being closed? I couldn't find any code location (at least in the binary code, I didn't check the tests).

It's spread across our codebase:

debugger.ListenForSignal(c.StopEverything)
go func() {
<-c.StopEverything
c.podQueue.Close()
}()

func cleanupTest(t *testing.T, context *testContext) {
// Kill the scheduler.
close(context.stopCh)

@Huang-Wei
Copy link
Member Author

I observed that sched.Run doesn't block. Does anyone know the motivation for this?

@alculquicondor I'm not aware of any background on this. I think it's some legacy issue on using stopCh and ctx.Context there, we can use this chance to consolidate them, for example:

func Run(cc schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}, registryOptions ...Option) error {
	...
	// Prepare a reusable runCommand function.
	run := func(ctx context.Context) {
		sched.Run(ctx)
	}
	...
}

func (sched *Scheduler) Run(ctx context.Context) {
	if !sched.config.WaitForCacheSync() {
		return
	}

	wait.UntilWithContext(ctx, sched.scheduleOne, 0)
}

func (sched *Scheduler) scheduleOne(ctx context.Context) {
	...
}

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Aug 15, 2019

@hex108 I'm leaning toward giving the control to each plugin - which means technically pass in the ctx.Context to each plugin, and plugin decides how to deal with it: either at the end of their own cycle, or immediately upon receival of ctx.Done(), or just ignore the event, or complex internal cases leveraging cancel() propagation.

@liu-cong
Copy link
Contributor

Reading through the conversations, it seems that there are 2 options at a high level:

  1. Passing the stop signal (be it a context or a channel) to each Plugin (using PluginContext or some other form). Each plugins decides how to handle it.
  2. The framework itself handles the stop signal and plugins don't.
    -2.a Upon completion of each plugin, check the signal and return aggressively. Some other plugins may be still running.
    -2.b Once receives a stop signal, wait for all plugins to finish and then gracefully exit.

I am also leaning towards Option 1. Basically we move the responsibility to each plugin. But I think this is probably the only way to get it right.
Option 2.a can end up in an undesired state (some plugins make some changes while others haven't finished). Option 2.b seems a bit dangerous, what if a plugin takes a very long time to finish? I guess we'd like to stop quickly once stop signal is fired.

@alculquicondor
Copy link
Member

If we leave the decision to the plugins, then I still think waiting for the whole "binding" routine to finish is a good idea.

ListenForSignal doesn't close the channel. I don't think we are ever closing it, except for tests. However, I assume that is what we are concerned about in the case of the flake.

My recommendation is that we not only close the channel, but wait for anything to finish processing. Closing the channel would start "stopping" the routines, but they are not guaranteed to finish.

I'm not sure I like it. Current scheduler is adopting an "optmistic" strategy (assume()) to make sure the binding cycle doesn't block main scheduling cycle, which IMO is a good practice.

I'm not saying we block on it for scheduleOne, but only if we closed StopEverything.

@Huang-Wei
Copy link
Member Author

If we leave the decision to the plugins, then I still think waiting for the whole "binding" routine to finish is a good idea.

The decision is still owned by plugins. If they want to finish the whole "binding" routine, they can return "success" as the status; otherwise cancel internally and return "non-success".

ListenForSignal doesn't close the channel.

Yup, that's a debugger case. The main scheduler process isn't associated with signal handling yet.

@hex108
Copy link
Contributor

hex108 commented Aug 16, 2019

@hex108 I'm leaning toward giving the control to each plugin - which means technically pass in the ctx.Context to each plugin, and plugin decides how to deal with it: either at the end of their own cycle, or immediately upon receival of ctx.Done(), or just ignore the event, or complex internal cases leveraging cancel() propagation.

@Huang-Wei +1, it makes sense.

@alculquicondor
Copy link
Member

Agreed that leaving the control to the plugin is a good idea. They might consider that any operation they have initiated needs to be completed.

But again, that implies that we should have a wait group for all the routines already spawned, so that we can reliably eliminate the flakes.

@ahg-g
Copy link
Member

ahg-g commented Aug 19, 2019

If we leave the decision to the plugins, then I still think waiting for the whole "binding" routine to finish is a good idea.

The decision is still owned by plugins. If they want to finish the whole "binding" routine, they can return "success" as the status; otherwise cancel internally and return "non-success".

ListenForSignal doesn't close the channel.

Yup, that's a debugger case. The main scheduler process isn't associated with signal handling yet.

Does that mean StopEverything is currently only closed in the tests?

@Huang-Wei
Copy link
Member Author

Does that mean StopEverything is currently only closed in the tests?

Yes. Additionally, we can explicitly stop this when master lost its lease (if it doesn't today).

@ahg-g
Copy link
Member

ahg-g commented Aug 19, 2019

I don't think #81238 should be used as a motivation to what is being proposed to be solved in this issue. Each integration test creates a new scheduler object and the problem with that flake was sharing plugin objects between tests (hence effectively between scheduler instances), which shouldn't happen anyway in both tests and in practice.

So, if the only use case we have for plumping the stop channel to the plugins is tests, then I don't think we should do anything because that use case can be solved by making sure tests don't share state, which is a reasonable thing to do anyway.

If we have a production use case, my preference is to plump a parent Context to the plugins that they can use as a parent to any contexts they create. The simplest way is to have a reference to the Context created by the binary in the framework handler, the plugins can then use it to create their own contexts and to receive the stop signal.

I also prefer to completely remove the StopEverything channel and only rely on that Context.

@Huang-Wei
Copy link
Member Author

Huang-Wei commented Aug 20, 2019

@ahg-g I agree with you it may not benefit too much for a regular scheduler instance, esp. given that a HA-enabled scheduler setup would exit the whole process when it loses lease:

klog.Fatalf("leaderelection lost")

(Graceful shutdown of scheduler may be a production usecase but it's hard to say how it's useful in practice.)

So at this moment, consolidate the usage of context and channel should only be beneficial for programmatic users (tests, or vendored applications).

@ahg-g
Copy link
Member

ahg-g commented Aug 20, 2019

Even for tests, shouldn't they be using separate scheduler instances and avoid shared state?

I am not sure I understand the vendored applications use case, any specific examples?

@Huang-Wei
Copy link
Member Author

Even for tests, shouldn't they be using separate scheduler instances and avoid shared state?

That definitely should be obeyed. However, even with that, each integration test is using the same apiserver instance, so if the escaping behavior isn't handled well, it may cause unexpected leftover in apiserver.

I am not sure I understand the vendored applications use case, any specific examples?

I meant applications consuming scheduler codebase, or generally custom scheduler using the new scheduler framework.

@draveness
Copy link
Contributor

I think we reached a consensus on using a stop channel and reply on context.Context instead, I sent a PR to implement the change #82072

/assign

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 26, 2019
@alculquicondor
Copy link
Member

What else do we want to do to consider this solved?

@Huang-Wei
Copy link
Member Author

This has been resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

No branches or pull requests

9 participants