-
Notifications
You must be signed in to change notification settings - Fork 39k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prevent scheduler crashing in default preemption plugin #101560
Conversation
@yuanchen8911: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cc @Huang-Wei @ahg-g |
/sig scheduling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned in #101548 (comment), I'd like to understand the root cause, and then come up with all defensive checking at once in this PR.
var ( | ||
victims *extenderv1.Victims | ||
found bool | ||
latestStartTime *metav1.Time | ||
) | ||
if victims, found = nodesToVictims[minNodes2[0]]; found { | ||
latestStartTime = util.GetEarliestPodStartTime(victims) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neater:
var latestStartTime *metav1.Time
if victims, found := nodesToVictims[minNodes2[0]]; found {
latestStartTime = util.GetEarliestPodStartTime(victims)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed it.
/retest |
Note that there are test cases that return candidates with empty victims. It makes the simple change
|
/retest |
@Huang-Wei updated the description and PR. PTAL! |
var victims extenderv1.Victims | ||
// do not create victims without pods | ||
if len(pods) != 0 { | ||
victims = extenderv1.Victims{ | ||
Pods: pods, | ||
NumPDBViolations: int64(numPDBViolations), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should treat len(pods) == 0
as an internal error, right? Why not just do if status.IsSuccess() && len(pods) != 0
in L338?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's my original version but the unit test failed as there are test cases that return empty victims with fit= true. Please see my comment. #101560 (comment) Shall I update the unit test file instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, please update those tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should status.IsSuccess() && len(pods) == 0
be considered an error or just ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's craft it as an internal error.
@@ -404,8 +408,14 @@ func CallExtenders(extenders []framework.Extender, pod *v1.Pod, nodeLister frame | |||
|
|||
var newCandidates []Candidate | |||
for nodeName := range victimsMap { | |||
// check if victims.Pods is empty | |||
victims := victimsMap[nodeName] | |||
if len(victims.Pods) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to run the check around L391: if an extender returns illegal result and extender.IsIgnorable() == false
, return it as an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to run the check around L391: if an extender returns illegal result and
extender.IsIgnorable() == false
, return it as an error.
Check the validness of nodeNameToVictims
returned by PrecessPreemption
before L399?
for nodeName := range nodeNameToVictims {
// check if victims.Pods is empty
victims := victimsMap[nodeName]
if len(victims.Pods) == 0 {
delete(victims, nodeName)
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
before L399, like this:
for nodeName, victims := range nodeNameToVictims {
if victims != nil && len(victims.Pods) != 0 {
continue
}
if extender.IsIgnorable() {
delete(nodeNameToVictims, nodeName)
// klog....
continue
} else {
return nil, framework.AsStatus(/* build an internal err */)
}
}
Updated the PR as suggested, please review it. @Huang-Wei Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits. LGTM otherwise.
return | ||
} | ||
if status.IsSuccess() && len(pods) == 0 { | ||
status = framework.NewStatus(framework.Unschedulable, fmt.Sprintf("invalid victims on node %q returned by selectVictimsOnNode", nodeInfoCopy.Node().Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status = framework.NewStatus(framework.Unschedulable, fmt.Sprintf("invalid victims on node %q returned by selectVictimsOnNode", nodeInfoCopy.Node().Name)) | |
status = framework.AsStatus(fmt.Errorf("invalid victims on node %q returned by selectVictimsOnNode", nodeInfoCopy.Node().Name)) |
@@ -391,6 +394,18 @@ func CallExtenders(extenders []framework.Extender, pod *v1.Pod, nodeLister frame | |||
} | |||
return nil, framework.AsStatus(err) | |||
} | |||
// Check if the returned victims are valid. | |||
for nodeName, victims := range nodeNameToVictims { | |||
if victims != nil && len(victims.Pods) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfer if victims == nil || len(victims.Pods) == 0
so that we can catch nil victims.
if s == nil || len(s.Name()) == 0 { | ||
return | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the first sub-test "No node needs preemption".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted the test case.
@ahg-g could you review it when you get a chance? Thanks. |
84086b9
to
9c22e00
Compare
return | ||
} | ||
if status.IsSuccess() && len(pods) == 0 { | ||
status = framework.AsStatus(fmt.Errorf("invalid victims on node %q returned by selectVictimsOnNode", nodeInfoCopy.Node().Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the error msg be more explicit: should not happen: the pod fits with no victim pods? Same below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the error status (set in L358) indicate "it should not happen"?
/retest |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: Huang-Wei, yuanchen8911 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
if victims == nil || len(victims.Pods) == 0 { | ||
if extender.IsIgnorable() { | ||
delete(nodeNameToVictims, nodeName) | ||
klog.Warningf("Ignoring victims without pods on node %q returned by extender", nodeName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warnings aren't recommended anymore, either error or info. In this case info probably fits better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
warnings aren't recommended anymore, either error or info. In this case info probably fits better.
Changed it to InfoS
.
return | ||
} | ||
if status.IsSuccess() && len(pods) == 0 { | ||
status = framework.AsStatus(fmt.Errorf("invalid victims on node %q returned by selectVictimsOnNode", nodeInfoCopy.Node().Name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, be more explicit, like "expected at least one victim on node $q"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, be more explicit, like "expected at least one victim on node $q"
Updated the error msg. as suggested.
/release-note-none |
/lgtm |
What type of PR is this?
/kind bug
What this PR does / why we need it:
FindCandidates
orCallExtenders
indefaultpreemption
plugin can return victim without pods (e.g., because of issues in Filter plugins or extenders). As a result, the scheduler will panic and crash when accessing an non-existing pod. This PR adds additional checks to prevent it from crashing the scheduler.Which issue(s) this PR fixes:
Fixes #101548
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: