-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make descheduler modify pod spec to make better scheduling decisions #261
Comments
/cc @damemi @seanmalloy @kubernetes-sigs/kubernetes-sig-scheduling |
/cc @soltysh |
In my opinion this is the better solution. Although I'm not convinced that having descheduler update /kind feature |
Even if we see that there is a better node(depending on the score), how can we ensure that pod lands onto a different node as the node scores are normalized to 10. Consider another scenario where the node scores are same because we did not express our constraints as priority functions in kube-scheduler. In that scenario, we will end up with the same score for both the nodes and the chance of pod landing on the same node is high. So, to me these are complementary solutions. |
I agree that we should work toward incorporating the default scheduler plugins but have a couple questions about the proposals here. For the first proposal, if we run the scheduler plugins to re-score the available nodes, we don't actually want to run all of them, but just those that are configured for the scheduler right? Otherwise we can get variations between the kube-scheduler's scores and ours (this gets more complicated with out-of-tree plugins and custom schedulers, but we can ignore that for now). I'm also not sure what you mean by the race condition between scheduler and descheduler in the case that we're directly updating For the second proposed solution, I'm wondering if modifying affinity without user input could lead to confusing effects as the pod is rescheduled, and if it would interact with future pods that are scheduled on top of that. Perhaps just an annotation that encodes the information from the descheduler (why it was evicted, what node it was previously on, etc) that could be incorporated into the scheduler would be clearer, and would also help us with the 2nd part of this which is denoting when a node has become ready again (which varies for each descheduler strategy). I'd like to hear from some more of the sig-scheduling people, because I think this functionality would essentially make the descheduler act like a second scheduler in the cluster. So it should be clear to users how that works and how it interacts with the default scheduler. |
Hi, we were told about this effort as it seems to be solving something similar to what we need. Our goal is to temporarily prevent scheduling Pod to a node where it was rejected by TopologyManager. And after I read the discussion, I agree with @damemi:
To me the most obvious solution at this moment (no unnecessary API changes) would be to encode the "tabu" [1] rules as annotations or labels (to allow search) and write a scheduler extension (in or out of tree) that tells the main scheduler about what those mean. The annotation can then be cleaned up, either by a "cleanup controller" (that scans all successfully placed pods with such label) or by whoever does the next eviction. [1] similar to what https://en.wikipedia.org/wiki/Tabu_search does |
That's an interesting idea, we could annotate the pods how we want and then write a "descheduler" plugin for the kube-scheduler that knows how to handle them. Properly weighted in the scheduler config as a plugin, this could achieve the desired behavior of minimizing the chance that the node is chosen again for rescheduling. It also provides good user configurability. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Problem:
One of the primary goals of descheduler has been to do proper rescheduling of replacement pods
As of now, while the descheduler identifies and evicts the pods, there is no guarantee that the evicted pods would land onto a different node.
Use-cases:
The topology manager in the kubelet is actually rejecting pods and since the topology predicates are not available in the scheduler code, the pod keeps landing on to the same node.
Solutions:
So, I want to propose some ideas to ensure descheduler would evict the pods and ensure they land onto different nodes.
In any case, I would like to know what others feel.
The text was updated successfully, but these errors were encountered: