Make descheduler modify pod spec to make better scheduling decisions #261

ravisantoshgudimetla · 2020-04-08T02:15:24Z

Problem:
One of the primary goals of descheduler has been to do proper rescheduling of replacement pods

As of now, while the descheduler identifies and evicts the pods, there is no guarantee that the evicted pods would land onto a different node.

Use-cases:

The topology manager in the kubelet is actually rejecting pods and since the topology predicates are not available in the scheduler code, the pod keeps landing on to the same node.

Solutions:

So, I want to propose some ideas to ensure descheduler would evict the pods and ensure they land onto different nodes.

Descheduler actually runs the entire set of predicates and priorities from kube-scheduler and updates pod.Spec.NodeName and does apiserver binding. The downside to this approach, there could be some race conditions between the kube-scheduler and descheduler(in the small window when the podSpec has been modified)
Descheduler won't do the actual node assignment and binding but it influences the scheduling decision by adding nodeAffinity or podAffinity/Anti-affinity to the pod Spec. This way we can ensure that the pod lands onto different node. One of the hard problems here is how to ensure that the failed node is ready to be used again so that we can remove the nodeAffinity or podAffinity. One possible solution I can think of is having an event or prometheus metric which tells the node is ready to be used now

In any case, I would like to know what others feel.

ravisantoshgudimetla · 2020-04-08T02:16:19Z

/cc @damemi @seanmalloy @kubernetes-sigs/kubernetes-sig-scheduling

ravisantoshgudimetla · 2020-04-08T02:17:09Z

/cc @soltysh

seanmalloy · 2020-04-08T05:47:14Z

Descheduler actually runs the entire set of predicates and priorities from kube-scheduler and updates pod.Spec.NodeName and does apiserver binding. The downside to this approach, there could be some race conditions between the kube-scheduler and descheduler(in the small window when the podSpec has been modified)

In my opinion this is the better solution. Although I'm not convinced that having descheduler update pod.Spec.NodeName is the best approach. Possibly all that is needed is to look at the scheduling score to more precisely choose which pods to evict. See discussion in #238.

/kind feature

ravisantoshgudimetla · 2020-04-08T13:25:47Z

In my opinion this is the better solution. Although I'm not convinced that having descheduler update pod.Spec.NodeName is the best approach. Possibly all that is needed is to look at the scheduling score to more precisely choose which pods to evict. See discussion in #238.

Even if we see that there is a better node(depending on the score), how can we ensure that pod lands onto a different node as the node scores are normalized to 10. Consider another scenario where the node scores are same because we did not express our constraints as priority functions in kube-scheduler. In that scenario, we will end up with the same score for both the nodes and the chance of pod landing on the same node is high. So, to me these are complementary solutions.

damemi · 2020-04-08T17:48:42Z

I agree that we should work toward incorporating the default scheduler plugins but have a couple questions about the proposals here.

For the first proposal, if we run the scheduler plugins to re-score the available nodes, we don't actually want to run all of them, but just those that are configured for the scheduler right? Otherwise we can get variations between the kube-scheduler's scores and ours (this gets more complicated with out-of-tree plugins and custom schedulers, but we can ignore that for now).

I'm also not sure what you mean by the race condition between scheduler and descheduler in the case that we're directly updating spec.nodeName. We would only be evicting pods that are already scheduled and the scheduler just looks at pods with no node assigned, right?

For the second proposed solution, I'm wondering if modifying affinity without user input could lead to confusing effects as the pod is rescheduled, and if it would interact with future pods that are scheduled on top of that. Perhaps just an annotation that encodes the information from the descheduler (why it was evicted, what node it was previously on, etc) that could be incorporated into the scheduler would be clearer, and would also help us with the 2nd part of this which is denoting when a node has become ready again (which varies for each descheduler strategy).

I'd like to hear from some more of the sig-scheduling people, because I think this functionality would essentially make the descheduler act like a second scheduler in the cluster. So it should be clear to users how that works and how it interacts with the default scheduler.

MarSik · 2020-04-09T07:34:49Z

Hi, we were told about this effort as it seems to be solving something similar to what we need. Our goal is to temporarily prevent scheduling Pod to a node where it was rejected by TopologyManager.

And after I read the discussion, I agree with @damemi:

Neither descheduler (nor TopologyManager in our case) should act as a second scheduler I think. You will always have issues with being in sync. And do not forget the main scheduler supports extensions and you can even replace the stock scheduler for specific workload.
Modifying the affinity rules in spec directly will affect the Pod for all future (maybe another descheduling later on) unless you remove the extra rules after a successful placement again.

To me the most obvious solution at this moment (no unnecessary API changes) would be to encode the "tabu" [1] rules as annotations or labels (to allow search) and write a scheduler extension (in or out of tree) that tells the main scheduler about what those mean. The annotation can then be cleaned up, either by a "cleanup controller" (that scans all successfully placed pods with such label) or by whoever does the next eviction.

[1] similar to what https://en.wikipedia.org/wiki/Tabu_search does

damemi · 2020-04-09T12:59:09Z

To me the most obvious solution at this moment (no unnecessary API changes) would be to encode the "tabu" [1] rules as annotations or labels (to allow search) and write a scheduler extension (in or out of tree) that tells the main scheduler about what those mean.

That's an interesting idea, we could annotate the pods how we want and then write a "descheduler" plugin for the kube-scheduler that knows how to handle them. Properly weighted in the scheduler config as a plugin, this could achieve the desired behavior of minimizing the chance that the node is chosen again for rescheduling. It also provides good user configurability.

fejta-bot · 2020-07-08T13:14:49Z