-
Notifications
You must be signed in to change notification settings - Fork 38.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kube-scheduler crashes and restarts with panics in DefaultPreemption plugin #101548
Comments
@yuanchen8911: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig scheduling |
/cc @Huang-Wei @ahg-g |
/assign |
@yuanchen8911 Are you running the vanilla scheduler? or a customized one that leverages the utilities in default_preemption.go? |
@Huang-Wei It's a custom scheduler with some internal out-of-tree plugins (just like Which utility functions are you referring to? Thanks. |
are those custom plugins filter plugins? do they maintain state? If yes, then one hypothesis is that those custom filters filter the node in the filter phase, but not when executed in the preemption phase, and so this will result in a candidate node with no victims? |
Yes, it includes filter plugins and uses cycle state. |
you need to make sure that those filter plugins play nicely when executed again in the preemption phase in the same cycle: i.e., produce the same result. |
@ahg-g Thanks What do you mean by performing filter in the preemption phase? would you mind elaborating it a little? |
What additional logic is needed to handle the preemption case? |
in the preemption phase we run the filters again to check if removing lower priority pods make the node schedulable: kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go Line 543 in d9839a3
we later add them one by one to reduce the set of victim pods to the absolute minimum: kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go Line 558 in d9839a3
My hypothesis is that somehow your filter returns success when adding all the pods back, and so you end up with zero victims and a candidate node. This shouldn't happen because if the pod fits the node without removing any victims, then we shouldn't be running preemption in the first place. So the theory is that the custom filter is returning false when run in the filter phase, but returns true when executed in the preemption phase perhaps because it makes some assumptions about cyclestate that makes it behave this way (e.g. that a the filter will be executed once per node in a scheduling cycle). |
Disabled the filter plugin, but still see the same issues. |
are you able to change the scheduler code and run the test again? I can send a patch tomorrow to add some debugging messages to help us root cause the issue. |
yes, thanks a lot! really appreciate it! |
@ahg-g You are right. We still use a deprecated Predicate extender, which causes the problem. After disabling it, it's working fine. It used to work with 1.18 though. We are retiring it. Thank you so much!!! |
If the test cases can help debug scheduler extender Webhook (Predicate) too, I'd like to try it. Thanks again! |
How can we improve the preemption logic to prevent bugs and issues in custom Filters or Predicates crashing the scheduler? |
for starter, we should check the list len, the scheduler should not panic in all cases. Other than that, I think we need to clearly document what the preemption logic does (that it executes the prefilter extensions points and the filter plugins multiple times) and the expectations from filter plugins (that they may get executed more than once in the same cycle) |
Is it a predicate extender or preemption extender? If it's a predicate extender, I don't quite think we found the root cause - b/c predicate extenders are not invoked during preemption (#86942 (comment)). It makes more sense if it's a preemption extender - if the preemption extender mutates the kubernetes/pkg/scheduler/framework/plugins/defaultpreemption/default_preemption.go Line 157 in c6e6507
|
That's always true :) |
filed a PR to mitigate it. #101560 |
It's a predicate extender. I'll debug it more.
|
Added some debug info. after
To remove the panic, a simple fix is to check if
Alternatively, a change to
Here is the log with debug info.
|
This looks more promising. Back to digging into the root cause, it seems |
Yes, |
The following code in
Also, if |
It can prevent How about adding the following check to CallExtender?
|
Updated the PR based on the finding and discussions. #101560 |
Yes, we can use a flag to mark if all extenders don't support preemption. If yes, simply return the candidates. |
Technically this can exist due to faulty extender implementation. If we really want to guard it, instead of logging the error and continue, I'm more inclined to return the error immediately as this is a fatal error - the victimsMap cannot be used either for latter extender or further preemptor nominating. |
@Huang-Wei Is my understanding correct? In
|
The only case victimsMap can be empty is empty candidates, and so return either doesn't quite matter, right? |
You are right. |
As long as one extender returns an invalid victimsMap (with empty pods), What about |
I think so. Because the result it returned is faulty, and we don't want to continue based on it
The same. We may don't want to continue the scheduling cycle based on faulty result. |
What happened:
Kubernetes 1.19 (confirmed with 1.19.7 and 1.19.10)
kube-scheduler crashes and restarts with the following errors in
default_preemption.go
.The problem code is line 389. When
victims
is nil orvictims.Pods
is empty, the error happens. If we skipGetPodPriority
when it's nil or empty, the error is gone.452 latestStartTime := util.GetEarliestPodStartTime(nodesToVictims[minNodes2[0]])
The problem is
nodesToVictims[minNodes2[0]]
does not exist and returns nil sometimes. Simply skipping it won't solve the problem. The scheduler will reach either line 456 or 341.341 klog.Errorf("None candidate can be picked from %v.", candidates)
What you expected to happen:
The scheduler works without failures.
How to reproduce it (as minimally and precisely as possible):
When there are preemptions in a cluster.
Anything else we need to know?:
Environment:
kubectl version
): v1.19.10 or v1.19.7cat /etc/os-release
): centos 7uname -a
): Linux 5.4.77-7.el7pie Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Sat Nov 21 01:16:27 UTC 2020 x86_64 x86_64 x86_64 GNU/LinuxThe text was updated successfully, but these errors were encountered: