New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support to take nominated pods into account during scheduling to avoid starvation of higher priority pods #55933
Conversation
cc/ @davidopp |
One problem that I found during this work is that scheduler does not handle the following scenario well (even before this PR):
|
f5ded2a
to
5427bf6
Compare
plugin/pkg/scheduler/util/utils.go
Outdated
// PodPriorityEnabled indicates whether pod priority feature is enabled. | ||
func PodPriorityEnabled() bool { | ||
// bsalamat: FIX THIS BEFORE MERGING!!! | ||
return true || feature.DefaultFeatureGate.Enabled(features.PodPriority) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have intentionally force-enabled the feature to make sure that our tests pass with PriorityQueue. I will disable it before merging.
fe35dc2
to
645d422
Compare
} | ||
potentialNodes := nodesWherePreemptionMightHelp(pod, allNodes, fitError.FailedPredicates) | ||
if len(potentialNodes) == 0 { | ||
glog.V(3).Infof("Preemption will not help schedule pod %v on any node.", pod.Name) | ||
return nil, nil, nil | ||
// In this case, we should clean-up any existing nominated node name of the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would you be trying to schedule a pod that has a nominated node name? If it has a nominated node name, isn't it already essentially scheduled and just waiting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nominated pods may have to wait minutes before the preempted pods are actually cleaned up and leave the node. If in the meantime other pods terminate, or new nodes are added to the cluster, we would like to schedule the nominated pods in the opened space sooner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Would it make sense to remove the nominated node name when you move the pod from unschedulable to active, rather than here, since at that point you're trying to schedule it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we should keep nominated node annotation. When scheduler tries to schedule a nominated pod and fails, it shouldn't preempt more pods if the victims preempted by the pod are still in their grace period. The way scheduler knows that the pod is not eligible for more preemption is by looking at the nominated node annotation of the pod. So, we shouldn't remove it at the time of moving pods to the activeQ.
} | ||
passes, pErr := nodePassesExtendersForPreemption(pod, node.Name, nodeToPods[node], g.cachedNodeInfoMap, g.extenders) | ||
if passes && pErr == nil { | ||
return node, nodeToPods[node], err | ||
nominatedPods := g.getLowerPriorityNominatedPods(pod, node.Name) | ||
return node, nodeToPods[node], nominatedPods, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have to move nominatedPods to active queue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It happens automatically. When scheduler removes nominated node names of the pods, the pods receive an update event which moves them to the active queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, forgot about that. Maybe add a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for predicateKey, predicate := range predicateFuncs { | ||
// If equivalenceCache is available | ||
if eCacheAvailable { | ||
// PredicateWithECache will returns it's cached predicate results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its (I realize this was in the original)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
if err != nil { | ||
return false, []algorithm.PredicateFailureReason{}, err | ||
podsAdded := false | ||
// We run predicates twice. Once we run them when pods nominated to run on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say "We run the predicates twice in some cases." It will prevent people panicking without reading the rest of the comment and the code first, like me. :)
s/when/with/
nodeInfoToUse := info | ||
if i == 0 { | ||
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue) | ||
} else if !podsAdded || len(failedPredicates) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this "else if" clause is enforcing the following:
- If there are no nominated pods for this node, then use the feasibility result from the previous iteration of the loop, which tested the pod against the node with no nominated pods (of course)
- If there are nominated pods for this node, and the pod is infeasible with the nominated pods added (which was determined on the previous iteration of the loop), then declare the pod to be infeasible
Is that right?
If so, I am wondering about the second thing.
What is the high-level rule you are trying to enforce? It seems like a reasonable rule might be "The pod is feasible if it will be feasible once all of the victims are gone and the preemptors have started running." But IIUC the rule you are enforcing is "The pod is feasible if it is feasible with all preemptors and victims running simultaneously AND it is feasible in the current state (victims running, preemptors not running yet)." Can you explain the reason behind this rule? This seems to be testing a condition that will never happen (victims and preemptors running simultaneously). I understand you care about the preemptors because of pod affinity but IIUC you are testing against a configuration of the node that will never actually exist. Would testing just the pod affinity predicate help? (Though I'm not sure what configuration you'd use it with.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC this "else if" clause is enforcing the following:
If there are no nominated pods for this node, then use the feasibility result from the previous iteration of the loop, which tested the pod against the node with no nominated pods (of course)
If there are nominated pods for this node, and the pod is infeasible with the nominated pods added (which was determined on the previous iteration of the loop), then declare the pod to be infeasible
Is that right?
Yes, your understanding is right.
What is the high-level rule you are trying to enforce? It seems like a reasonable rule might be "The pod is feasible if it will be feasible once all of the victims are gone and the preemptors have started running." But IIUC the rule you are enforcing is "The pod is feasible if it is feasible with all preemptors and victims running simultaneously AND it is feasible in the current state (victims running, preemptors not running yet)." Can you explain the reason behind this rule? This seems to be testing a condition that will never happen (victims and preemptors running simultaneously). I understand you care about the preemptors because of pod affinity but IIUC you are testing against a configuration of the node that will never actually exist. Would testing just the pod affinity predicate help? (Though I'm not sure what configuration you'd use it with.)
Please note that this function is called from two different places: Schedule and Preempt.
When it is called in Schedule, we want to test whether the pod is schedulable on the node with all the existing pods on the node plus higher (and equal) priority pods nominated to run on the node.
When it is called from Preempt, we should do what you said. We should remove the victims and add the nominated pods. Removal of the victims is done by SelectVictimsOnNode(). It removes victims from meta
and NodeInfo
before calling this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asking something a bit different, but it would be good to add a comment with that information anyway.
(The thing I was meaning to ask I've covered in my first new comment in this function.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you add a comment with this (at the top of this function)? If not, please do
this function is called from two different places: Schedule and Preempt.
When it is called in Schedule, we want to test whether the pod is schedulable on the node with all the existing pods on the node plus higher (and equal) priority pods nominated to run on the node.
When it is called from Preempt, we should do what you said. We should remove the victims and add the nominated pods. Removal of the victims is done by SelectVictimsOnNode(). It removes victims from meta and NodeInfo before calling this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Thanks, @davidopp! PTAL. |
} | ||
potentialNodes := nodesWherePreemptionMightHelp(pod, allNodes, fitError.FailedPredicates) | ||
if len(potentialNodes) == 0 { | ||
glog.V(3).Infof("Preemption will not help schedule pod %v on any node.", pod.Name) | ||
return nil, nil, nil | ||
// In this case, we should clean-up any existing nominated node name of the pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Would it make sense to remove the nominated node name when you move the pod from unschedulable to active, rather than here, since at that point you're trying to schedule it again?
} | ||
passes, pErr := nodePassesExtendersForPreemption(pod, node.Name, nodeToPods[node], g.cachedNodeInfoMap, g.extenders) | ||
if passes && pErr == nil { | ||
return node, nodeToPods[node], err | ||
nominatedPods := g.getLowerPriorityNominatedPods(pod, node.Name) | ||
return node, nodeToPods[node], nominatedPods, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right, forgot about that. Maybe add a comment here?
return false, []algorithm.PredicateFailureReason{}, err | ||
podsAdded := false | ||
// We run predicates twice in some cases. If the node has nominated pods, we | ||
// run them when pods nominated to run on the node added to meta and nodeInfo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be a little more precise, I'd write the second sentence "If the node has greater or equal priority nominated pods, we run them when those pods are added to meta and nodeInfo."
Also, add a comment explaining why you do all already running pods, but only greater-or-equal-priority nominated pods. (One would think you might apply the same policy to both -- all of both, or only greater-or-equal-priority or both.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
nodeInfoToUse := info | ||
if i == 0 { | ||
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue) | ||
} else if !podsAdded || len(failedPredicates) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asking something a bit different, but it would be good to add a comment with that information anyway.
(The thing I was meaning to ask I've covered in my first new comment in this function.)
// update equivalence cache with newly computed fit & reasons | ||
// TODO(resouer) should we do this in another thread? any race? | ||
ecache.UpdateCachedPredicateItem(pod.GetName(), info.Node().GetName(), predicateKey, fit, reasons, equivalenceHash) | ||
// TODO(bsalamat): When one predicate fails and fit is false, why do we continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is so we can return all of the fit failure reasons, rather than just the first one that fails. (Unless I'm misunderstanding your question.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are probably right, but it costs us a lot of cycles. For example, we run something as sophisticated as affinity/anti-affinity for a node that doesn't have enough memory, just to return failure reasons. I think it is fine to return that this node is not feasible because it didn't have enough memory and leave other predicates.
} | ||
// Get the pod again; it may have changed/been scheduled already. | ||
getBackoff := initialGetBackoff | ||
for { | ||
pod, err := factory.client.CoreV1().Pods(podID.Namespace).Get(podID.Name, metav1.GetOptions{}) | ||
if err == nil { | ||
if len(pod.Spec.NodeName) == 0 { | ||
podQueue.AddIfNotPresent(pod) | ||
podQueue.AddUnschedulableIfNotPresent(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you add this to unschedulable instead of active q?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called when scheduler fails to schedule the pod. So, we place it in unschedulable queue, but unschedulable queue has more logic to decide whether it should go there or should go to active queue. It checks pod's status and whether a move request is received to decide between the two queues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it seems probably safer to put it in the active queue, but if you're confident about this then it's fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine. If we put pods in the active queue here, then no pod will go to unschedulable queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And in fact, putting the pod in active queue would cause a problem, because a very high priority pod which was not schedulable would block the scheduling queue (as it is now a priority queue).
return fmt.Errorf("pod %v/%v has no annotation", pod.Namespace, pod.Name) | ||
} | ||
if _, exists := podCopy.Annotations[core.NominatedNodeAnnotationKey]; !exists { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a little odd to return error if there are no annotations, but return nil if there are annotations but not the nominated node annotation. These seem like equivalent situations from the standpoint of any code that would call this function. I would suggest to pick whichever one makes more sense and do the same for both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I changed it to return no error in both cases.
if _, exists := podCopy.Annotations[core.NominatedNodeAnnotationKey]; !exists { | ||
return nil | ||
} | ||
// Note: Deleting the entry from the annotations and passing it Patch() will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems weird. Can you ask the API machinery folks about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like Patch() only make those changes present in the patchData. When there is only one annotation (nominated node name) and we delete it, the annotations become empty and patchData becomes empty. So, patch doesn't make any changes. I will check with the API folks tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with the API folks and my guess was right. When data is not present in patchData, Patch() interprets it as "no change" to the fields.
// It returns the node where preemption happened, a list of preempted pods, and error if any. | ||
Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, err error) | ||
// It returns the node where preemption happened, a list of preempted pods, a | ||
// list of pods whose nominated node name should be removed, and error if any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add to this comment (or somewhere else if it makes more sense) why you remove nominated node name from the lower-priority pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to Preempt() in generic_scheduler.go where it collects and returns pods for removal of their nomination.
plugin/pkg/scheduler/scheduler.go
Outdated
sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName) | ||
} | ||
} | ||
for _, p := range nominatedPodsToClear { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can you end up with nominated pods to clear, if node == nil?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can happen when a pod with nominated node name is eligible for preemption again, but preemption logic does not find any node for it. In that case Preempt() function of generic_scheduler.go returns the pod itself for removal of the annotation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment.
(BTW by "eligible for preemption" I assume you mean "eligible to preempt")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Yes, I mean "eligible to preempt".
bbe44ba
to
cccb8fe
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I addressed all the comments except the eCache. I am working on that one.
if !entry.TryWait(backoff.MaxDuration()) { | ||
glog.Warningf("Request for pod %v already in flight, abandoning", podID) | ||
return | ||
if !util.PodPriorityEnabled() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// It returns the node where preemption happened, a list of preempted pods, and error if any. | ||
Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, err error) | ||
// It returns the node where preemption happened, a list of preempted pods, a | ||
// list of pods whose nominated node name should be removed, and error if any. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment to Preempt() in generic_scheduler.go where it collects and returns pods for removal of their nomination.
for predicateKey, predicate := range predicateFuncs { | ||
// If equivalenceCache is available | ||
if eCacheAvailable { | ||
// PredicateWithECache will returns it's cached predicate results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return fmt.Errorf("pod %v/%v has no annotation", pod.Namespace, pod.Name) | ||
} | ||
if _, exists := podCopy.Annotations[core.NominatedNodeAnnotationKey]; !exists { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I changed it to return no error in both cases.
} | ||
passes, pErr := nodePassesExtendersForPreemption(pod, node.Name, nodeToPods[node], g.cachedNodeInfoMap, g.extenders) | ||
if passes && pErr == nil { | ||
return node, nodeToPods[node], err | ||
nominatedPods := g.getLowerPriorityNominatedPods(pod, node.Name) | ||
return node, nodeToPods[node], nominatedPods, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
return false, []algorithm.PredicateFailureReason{}, err | ||
podsAdded := false | ||
// We run predicates twice in some cases. If the node has nominated pods, we | ||
// run them when pods nominated to run on the node added to meta and nodeInfo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
// run them when pods nominated to run on the node added to meta and nodeInfo. | ||
// If all predicates succeed in this pass, we run them again when these | ||
// nominated pods are not added. This second pass is necessary because some | ||
// predicates such as inter-pod affinity may not pass without the nominated pods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
} | ||
if !fit { | ||
// eCache is available and valid, and predicates result is unfit, record the fail reasons | ||
failedPredicates = append(failedPredicates, reasons...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not addressed yet. I'm working on it...
// pods that are already tried and are determined to be unschedulable. The latter | ||
// is called unschedulableQ. | ||
// FIFO is here for flag-gating purposes and allows us to use the traditional | ||
// scheduling queue when Pod Priority flag is false. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
cccb8fe
to
6b44dab
Compare
693b562
to
66dfb09
Compare
66dfb09
to
7787346
Compare
f9871e9
to
a462fd3
Compare
a462fd3
to
2f5ccdd
Compare
2f5ccdd
to
1c6b4dd
Compare
/kind feature |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few small remaining comments.
} | ||
// Get the pod again; it may have changed/been scheduled already. | ||
getBackoff := initialGetBackoff | ||
for { | ||
pod, err := factory.client.CoreV1().Pods(podID.Namespace).Get(podID.Name, metav1.GetOptions{}) | ||
if err == nil { | ||
if len(pod.Spec.NodeName) == 0 { | ||
podQueue.AddIfNotPresent(pod) | ||
podQueue.AddUnschedulableIfNotPresent(pod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, it seems probably safer to put it in the active queue, but if you're confident about this then it's fine with me.
plugin/pkg/scheduler/scheduler.go
Outdated
sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName) | ||
} | ||
} | ||
for _, p := range nominatedPodsToClear { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment.
(BTW by "eligible for preemption" I assume you mean "eligible to preempt")
} | ||
passes, pErr := nodePassesExtendersForPreemption(pod, node.Name, nodeToPods[node], g.cachedNodeInfoMap, g.extenders) | ||
if passes && pErr == nil { | ||
return node, nodeToPods[node], err | ||
// Lower priority pods nominated to run on this node, may no longer fit on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be lower or equal priority pods? (In the comment and the implementation.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should be lower priority pods only. Equal priority pods are scheduled in the order received. A pod cannot preempt another pod with equal priority. Similarly, it cannot take the space emptied for another equal priority nominated pod.
nodeInfoToUse := info | ||
if i == 0 { | ||
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue) | ||
} else if !podsAdded || len(failedPredicates) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you add a comment with this (at the top of this function)? If not, please do
this function is called from two different places: Schedule and Preempt.
When it is called in Schedule, we want to test whether the pod is schedulable on the node with all the existing pods on the node plus higher (and equal) priority pods nominated to run on the node.
When it is called from Preempt, we should do what you said. We should remove the victims and add the nominated pods. Removal of the victims is done by SelectVictimsOnNode(). It removes victims from meta and NodeInfo before calling this function.
…running yet. Add tests for the new logic.
…re not running yet. Add tests for the new logic.
1c6b4dd
to
eda3df8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, @davidopp! Addressed comments. PTAL.
nodeInfoToUse := info | ||
if i == 0 { | ||
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue) | ||
} else if !podsAdded || len(failedPredicates) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
plugin/pkg/scheduler/scheduler.go
Outdated
sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName) | ||
} | ||
} | ||
for _, p := range nominatedPodsToClear { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Yes, I mean "eligible to preempt".
/lgtm BTW did you remember to remove the thing you had put in to enable it by default so it would run enabled for the tests? (that you had a TODO to remove it before submitting) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bsalamat, davidopp Associated issue: 54501 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these OWNERS Files:
You can indicate your approval by writing |
[MILESTONENOTIFIER] Milestone Pull Request Current @bsalamat @davidopp @jayunit100 @timothysc Note: If this pull request is not resolved or labeled as Pull Request Labels
|
/retest Review the full test history for this PR. |
@davidopp Yes, I have already removed it. Thanks a lot for reviewing the PRs! |
Automatic merge from submit-queue (batch tested with PRs 54316, 53400, 55933, 55786, 55794). If you want to cherry-pick this change to another branch, please follow the instructions here. |
What this PR does / why we need it:
When a pod preempts lower priority pods, the preemptor gets a "nominated node name" annotation. We call such a pod a nominated pod. This PR adds the logic to take such nominated pods into account when scheduling other pods on the same node that the nominated pod is expected to run. This is needed to avoid starvation of preemptor pods. Otherwise, lower priority pods may fill up the space freed after preemption before the preemptor gets a chance to get scheduled.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #54501
Special notes for your reviewer: This PR is built on top of #55109 and includes all the changes there as well.
Release note:
/sig scheduling
ref/ #47604