Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support to take nominated pods into account during scheduling to avoid starvation of higher priority pods #55933

Merged
merged 3 commits into from Nov 21, 2017

Conversation

bsalamat
Copy link
Member

What this PR does / why we need it:
When a pod preempts lower priority pods, the preemptor gets a "nominated node name" annotation. We call such a pod a nominated pod. This PR adds the logic to take such nominated pods into account when scheduling other pods on the same node that the nominated pod is expected to run. This is needed to avoid starvation of preemptor pods. Otherwise, lower priority pods may fill up the space freed after preemption before the preemptor gets a chance to get scheduled.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #54501

Special notes for your reviewer: This PR is built on top of #55109 and includes all the changes there as well.

Release note:

Add support to take nominated pods into account during scheduling to avoid starvation of higher priority pods.

/sig scheduling
ref/ #47604

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Nov 17, 2017
@bsalamat
Copy link
Member Author

cc/ @davidopp

@bsalamat
Copy link
Member Author

One problem that I found during this work is that scheduler does not handle the following scenario well (even before this PR):

  1. Scheduler picks a pod to schedule.
  2. The pod is updated by another component or a user while scheduler is trying to schedule the pod.
  3. When scheduler is done, scheduler tries to update the stale pod.

@bsalamat bsalamat force-pushed the starvation3 branch 2 times, most recently from f5ded2a to 5427bf6 Compare November 17, 2017 18:17
// PodPriorityEnabled indicates whether pod priority feature is enabled.
func PodPriorityEnabled() bool {
// bsalamat: FIX THIS BEFORE MERGING!!!
return true || feature.DefaultFeatureGate.Enabled(features.PodPriority)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have intentionally force-enabled the feature to make sure that our tests pass with PriorityQueue. I will disable it before merging.

@bsalamat bsalamat force-pushed the starvation3 branch 2 times, most recently from fe35dc2 to 645d422 Compare November 18, 2017 01:42
}
potentialNodes := nodesWherePreemptionMightHelp(pod, allNodes, fitError.FailedPredicates)
if len(potentialNodes) == 0 {
glog.V(3).Infof("Preemption will not help schedule pod %v on any node.", pod.Name)
return nil, nil, nil
// In this case, we should clean-up any existing nominated node name of the pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you be trying to schedule a pod that has a nominated node name? If it has a nominated node name, isn't it already essentially scheduled and just waiting?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nominated pods may have to wait minutes before the preempted pods are actually cleaned up and leave the node. If in the meantime other pods terminate, or new nodes are added to the cluster, we would like to schedule the nominated pods in the opened space sooner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Would it make sense to remove the nominated node name when you move the pod from unschedulable to active, rather than here, since at that point you're trying to schedule it again?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we should keep nominated node annotation. When scheduler tries to schedule a nominated pod and fails, it shouldn't preempt more pods if the victims preempted by the pod are still in their grace period. The way scheduler knows that the pod is not eligible for more preemption is by looking at the nominated node annotation of the pod. So, we shouldn't remove it at the time of moving pods to the activeQ.

}
passes, pErr := nodePassesExtendersForPreemption(pod, node.Name, nodeToPods[node], g.cachedNodeInfoMap, g.extenders)
if passes && pErr == nil {
return node, nodeToPods[node], err
nominatedPods := g.getLowerPriorityNominatedPods(pod, node.Name)
return node, nodeToPods[node], nominatedPods, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have to move nominatedPods to active queue?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It happens automatically. When scheduler removes nominated node names of the pods, the pods receive an update event which moves them to the active queue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, forgot about that. Maybe add a comment here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for predicateKey, predicate := range predicateFuncs {
// If equivalenceCache is available
if eCacheAvailable {
// PredicateWithECache will returns it's cached predicate results
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its (I realize this was in the original)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

if err != nil {
return false, []algorithm.PredicateFailureReason{}, err
podsAdded := false
// We run predicates twice. Once we run them when pods nominated to run on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "We run the predicates twice in some cases." It will prevent people panicking without reading the rest of the comment and the code first, like me. :)

s/when/with/

nodeInfoToUse := info
if i == 0 {
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue)
} else if !podsAdded || len(failedPredicates) != 0 {
Copy link
Member

@davidopp davidopp Nov 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this "else if" clause is enforcing the following:

  • If there are no nominated pods for this node, then use the feasibility result from the previous iteration of the loop, which tested the pod against the node with no nominated pods (of course)
  • If there are nominated pods for this node, and the pod is infeasible with the nominated pods added (which was determined on the previous iteration of the loop), then declare the pod to be infeasible

Is that right?

If so, I am wondering about the second thing.

What is the high-level rule you are trying to enforce? It seems like a reasonable rule might be "The pod is feasible if it will be feasible once all of the victims are gone and the preemptors have started running." But IIUC the rule you are enforcing is "The pod is feasible if it is feasible with all preemptors and victims running simultaneously AND it is feasible in the current state (victims running, preemptors not running yet)." Can you explain the reason behind this rule? This seems to be testing a condition that will never happen (victims and preemptors running simultaneously). I understand you care about the preemptors because of pod affinity but IIUC you are testing against a configuration of the node that will never actually exist. Would testing just the pod affinity predicate help? (Though I'm not sure what configuration you'd use it with.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC this "else if" clause is enforcing the following:
If there are no nominated pods for this node, then use the feasibility result from the previous iteration of the loop, which tested the pod against the node with no nominated pods (of course)
If there are nominated pods for this node, and the pod is infeasible with the nominated pods added (which was determined on the previous iteration of the loop), then declare the pod to be infeasible
Is that right?

Yes, your understanding is right.

What is the high-level rule you are trying to enforce? It seems like a reasonable rule might be "The pod is feasible if it will be feasible once all of the victims are gone and the preemptors have started running." But IIUC the rule you are enforcing is "The pod is feasible if it is feasible with all preemptors and victims running simultaneously AND it is feasible in the current state (victims running, preemptors not running yet)." Can you explain the reason behind this rule? This seems to be testing a condition that will never happen (victims and preemptors running simultaneously). I understand you care about the preemptors because of pod affinity but IIUC you are testing against a configuration of the node that will never actually exist. Would testing just the pod affinity predicate help? (Though I'm not sure what configuration you'd use it with.)

Please note that this function is called from two different places: Schedule and Preempt.
When it is called in Schedule, we want to test whether the pod is schedulable on the node with all the existing pods on the node plus higher (and equal) priority pods nominated to run on the node.
When it is called from Preempt, we should do what you said. We should remove the victims and add the nominated pods. Removal of the victims is done by SelectVictimsOnNode(). It removes victims from meta and NodeInfo before calling this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking something a bit different, but it would be good to add a comment with that information anyway.

(The thing I was meaning to ask I've covered in my first new comment in this function.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add a comment with this (at the top of this function)? If not, please do

this function is called from two different places: Schedule and Preempt.
When it is called in Schedule, we want to test whether the pod is schedulable on the node with all the existing pods on the node plus higher (and equal) priority pods nominated to run on the node.
When it is called from Preempt, we should do what you said. We should remove the victims and add the nominated pods. Removal of the victims is done by SelectVictimsOnNode(). It removes victims from meta and NodeInfo before calling this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@bsalamat
Copy link
Member Author

Thanks, @davidopp! PTAL.

}
potentialNodes := nodesWherePreemptionMightHelp(pod, allNodes, fitError.FailedPredicates)
if len(potentialNodes) == 0 {
glog.V(3).Infof("Preemption will not help schedule pod %v on any node.", pod.Name)
return nil, nil, nil
// In this case, we should clean-up any existing nominated node name of the pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Would it make sense to remove the nominated node name when you move the pod from unschedulable to active, rather than here, since at that point you're trying to schedule it again?

}
passes, pErr := nodePassesExtendersForPreemption(pod, node.Name, nodeToPods[node], g.cachedNodeInfoMap, g.extenders)
if passes && pErr == nil {
return node, nodeToPods[node], err
nominatedPods := g.getLowerPriorityNominatedPods(pod, node.Name)
return node, nodeToPods[node], nominatedPods, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, forgot about that. Maybe add a comment here?

return false, []algorithm.PredicateFailureReason{}, err
podsAdded := false
// We run predicates twice in some cases. If the node has nominated pods, we
// run them when pods nominated to run on the node added to meta and nodeInfo.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be a little more precise, I'd write the second sentence "If the node has greater or equal priority nominated pods, we run them when those pods are added to meta and nodeInfo."

Also, add a comment explaining why you do all already running pods, but only greater-or-equal-priority nominated pods. (One would think you might apply the same policy to both -- all of both, or only greater-or-equal-priority or both.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

nodeInfoToUse := info
if i == 0 {
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue)
} else if !podsAdded || len(failedPredicates) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asking something a bit different, but it would be good to add a comment with that information anyway.

(The thing I was meaning to ask I've covered in my first new comment in this function.)

// update equivalence cache with newly computed fit & reasons
// TODO(resouer) should we do this in another thread? any race?
ecache.UpdateCachedPredicateItem(pod.GetName(), info.Node().GetName(), predicateKey, fit, reasons, equivalenceHash)
// TODO(bsalamat): When one predicate fails and fit is false, why do we continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is so we can return all of the fit failure reasons, rather than just the first one that fails. (Unless I'm misunderstanding your question.)

Copy link
Member Author

@bsalamat bsalamat Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are probably right, but it costs us a lot of cycles. For example, we run something as sophisticated as affinity/anti-affinity for a node that doesn't have enough memory, just to return failure reasons. I think it is fine to return that this node is not feasible because it didn't have enough memory and leave other predicates.

}
// Get the pod again; it may have changed/been scheduled already.
getBackoff := initialGetBackoff
for {
pod, err := factory.client.CoreV1().Pods(podID.Namespace).Get(podID.Name, metav1.GetOptions{})
if err == nil {
if len(pod.Spec.NodeName) == 0 {
podQueue.AddIfNotPresent(pod)
podQueue.AddUnschedulableIfNotPresent(pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you add this to unschedulable instead of active q?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called when scheduler fails to schedule the pod. So, we place it in unschedulable queue, but unschedulable queue has more logic to decide whether it should go there or should go to active queue. It checks pod's status and whether a move request is received to decide between the two queues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it seems probably safer to put it in the active queue, but if you're confident about this then it's fine with me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine. If we put pods in the active queue here, then no pod will go to unschedulable queue.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And in fact, putting the pod in active queue would cause a problem, because a very high priority pod which was not schedulable would block the scheduling queue (as it is now a priority queue).

return fmt.Errorf("pod %v/%v has no annotation", pod.Namespace, pod.Name)
}
if _, exists := podCopy.Annotations[core.NominatedNodeAnnotationKey]; !exists {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little odd to return error if there are no annotations, but return nil if there are annotations but not the nominated node annotation. These seem like equivalent situations from the standpoint of any code that would call this function. I would suggest to pick whichever one makes more sense and do the same for both.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I changed it to return no error in both cases.

if _, exists := podCopy.Annotations[core.NominatedNodeAnnotationKey]; !exists {
return nil
}
// Note: Deleting the entry from the annotations and passing it Patch() will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems weird. Can you ask the API machinery folks about this?

Copy link
Member Author

@bsalamat bsalamat Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like Patch() only make those changes present in the patchData. When there is only one annotation (nominated node name) and we delete it, the annotations become empty and patchData becomes empty. So, patch doesn't make any changes. I will check with the API folks tomorrow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with the API folks and my guess was right. When data is not present in patchData, Patch() interprets it as "no change" to the fields.

// It returns the node where preemption happened, a list of preempted pods, and error if any.
Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, err error)
// It returns the node where preemption happened, a list of preempted pods, a
// list of pods whose nominated node name should be removed, and error if any.
Copy link
Member

@davidopp davidopp Nov 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add to this comment (or somewhere else if it makes more sense) why you remove nominated node name from the lower-priority pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to Preempt() in generic_scheduler.go where it collects and returns pods for removal of their nomination.

sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
}
}
for _, p := range nominatedPodsToClear {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can you end up with nominated pods to clear, if node == nil?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can happen when a pod with nominated node name is eligible for preemption again, but preemption logic does not find any node for it. In that case Preempt() function of generic_scheduler.go returns the pod itself for removal of the annotation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment.

(BTW by "eligible for preemption" I assume you mean "eligible to preempt")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Yes, I mean "eligible to preempt".

@k8s-github-robot k8s-github-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 19, 2017
Copy link
Member Author

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I addressed all the comments except the eCache. I am working on that one.

if !entry.TryWait(backoff.MaxDuration()) {
glog.Warningf("Request for pod %v already in flight, abandoning", podID)
return
if !util.PodPriorityEnabled() {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// It returns the node where preemption happened, a list of preempted pods, and error if any.
Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, err error)
// It returns the node where preemption happened, a list of preempted pods, a
// list of pods whose nominated node name should be removed, and error if any.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to Preempt() in generic_scheduler.go where it collects and returns pods for removal of their nomination.

for predicateKey, predicate := range predicateFuncs {
// If equivalenceCache is available
if eCacheAvailable {
// PredicateWithECache will returns it's cached predicate results
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return fmt.Errorf("pod %v/%v has no annotation", pod.Namespace, pod.Name)
}
if _, exists := podCopy.Annotations[core.NominatedNodeAnnotationKey]; !exists {
return nil
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I changed it to return no error in both cases.

}
passes, pErr := nodePassesExtendersForPreemption(pod, node.Name, nodeToPods[node], g.cachedNodeInfoMap, g.extenders)
if passes && pErr == nil {
return node, nodeToPods[node], err
nominatedPods := g.getLowerPriorityNominatedPods(pod, node.Name)
return node, nodeToPods[node], nominatedPods, err
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

return false, []algorithm.PredicateFailureReason{}, err
podsAdded := false
// We run predicates twice in some cases. If the node has nominated pods, we
// run them when pods nominated to run on the node added to meta and nodeInfo.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

// run them when pods nominated to run on the node added to meta and nodeInfo.
// If all predicates succeed in this pass, we run them again when these
// nominated pods are not added. This second pass is necessary because some
// predicates such as inter-pod affinity may not pass without the nominated pods.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}
if !fit {
// eCache is available and valid, and predicates result is unfit, record the fail reasons
failedPredicates = append(failedPredicates, reasons...)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not addressed yet. I'm working on it...

// pods that are already tried and are determined to be unschedulable. The latter
// is called unschedulableQ.
// FIFO is here for flag-gating purposes and allows us to use the traditional
// scheduling queue when Pod Priority flag is false.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@k8s-github-robot k8s-github-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Nov 20, 2017
@k8s-github-robot k8s-github-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2017
@bsalamat bsalamat force-pushed the starvation3 branch 2 times, most recently from 693b562 to 66dfb09 Compare November 20, 2017 05:12
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2017
@k8s-github-robot k8s-github-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2017
@k8s-github-robot k8s-github-robot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2017
@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 20, 2017
@bsalamat
Copy link
Member Author

/kind feature
/priority important-soon

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Nov 21, 2017
Copy link
Member

@davidopp davidopp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small remaining comments.

}
// Get the pod again; it may have changed/been scheduled already.
getBackoff := initialGetBackoff
for {
pod, err := factory.client.CoreV1().Pods(podID.Namespace).Get(podID.Name, metav1.GetOptions{})
if err == nil {
if len(pod.Spec.NodeName) == 0 {
podQueue.AddIfNotPresent(pod)
podQueue.AddUnschedulableIfNotPresent(pod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, it seems probably safer to put it in the active queue, but if you're confident about this then it's fine with me.

sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
}
}
for _, p := range nominatedPodsToClear {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment.

(BTW by "eligible for preemption" I assume you mean "eligible to preempt")

}
passes, pErr := nodePassesExtendersForPreemption(pod, node.Name, nodeToPods[node], g.cachedNodeInfoMap, g.extenders)
if passes && pErr == nil {
return node, nodeToPods[node], err
// Lower priority pods nominated to run on this node, may no longer fit on
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be lower or equal priority pods? (In the comment and the implementation.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be lower priority pods only. Equal priority pods are scheduled in the order received. A pod cannot preempt another pod with equal priority. Similarly, it cannot take the space emptied for another equal priority nominated pod.

nodeInfoToUse := info
if i == 0 {
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue)
} else if !podsAdded || len(failedPredicates) != 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you add a comment with this (at the top of this function)? If not, please do

this function is called from two different places: Schedule and Preempt.
When it is called in Schedule, we want to test whether the pod is schedulable on the node with all the existing pods on the node plus higher (and equal) priority pods nominated to run on the node.
When it is called from Preempt, we should do what you said. We should remove the victims and add the nominated pods. Removal of the victims is done by SelectVictimsOnNode(). It removes victims from meta and NodeInfo before calling this function.

Copy link
Member Author

@bsalamat bsalamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @davidopp! Addressed comments. PTAL.

nodeInfoToUse := info
if i == 0 {
podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue)
} else if !podsAdded || len(failedPredicates) != 0 {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
}
}
for _, p := range nominatedPodsToClear {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
Yes, I mean "eligible to preempt".

@davidopp
Copy link
Member

/lgtm

BTW did you remember to remove the thing you had put in to enable it by default so it would run enabled for the tests? (that you had a TODO to remove it before submitting)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 21, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, davidopp

Associated issue: 54501

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

[MILESTONENOTIFIER] Milestone Pull Request Current

@bsalamat @davidopp @jayunit100 @timothysc

Note: If this pull request is not resolved or labeled as priority/critical-urgent by Wed, Nov 22 it will be moved out of the v1.9 milestone.

Pull Request Labels
  • sig/scheduling: Pull Request will be escalated to these SIGs if needed.
  • priority/important-soon: Escalate to the pull request owners and SIG owner; move out of milestone after several unsuccessful escalation attempts.
  • kind/feature: New functionality.
Help

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@bsalamat
Copy link
Member Author

@davidopp Yes, I have already removed it. Thanks a lot for reviewing the PRs!

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 54316, 53400, 55933, 55786, 55794). If you want to cherry-pick this change to another branch, please follow the instructions here.

@k8s-github-robot k8s-github-robot merged commit 34b258c into kubernetes:master Nov 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix starvation problem in pod preemption
7 participants