Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make descheduler modify pod spec to make better scheduling decisions #261

Closed
ravisantoshgudimetla opened this issue Apr 8, 2020 · 19 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@ravisantoshgudimetla
Copy link
Contributor

ravisantoshgudimetla commented Apr 8, 2020

Problem:
One of the primary goals of descheduler has been to do proper rescheduling of replacement pods

As of now, while the descheduler identifies and evicts the pods, there is no guarantee that the evicted pods would land onto a different node.

Use-cases:

The topology manager in the kubelet is actually rejecting pods and since the topology predicates are not available in the scheduler code, the pod keeps landing on to the same node.

Solutions:

So, I want to propose some ideas to ensure descheduler would evict the pods and ensure they land onto different nodes.

  • Descheduler actually runs the entire set of predicates and priorities from kube-scheduler and updates pod.Spec.NodeName and does apiserver binding. The downside to this approach, there could be some race conditions between the kube-scheduler and descheduler(in the small window when the podSpec has been modified)
  • Descheduler won't do the actual node assignment and binding but it influences the scheduling decision by adding nodeAffinity or podAffinity/Anti-affinity to the pod Spec. This way we can ensure that the pod lands onto different node. One of the hard problems here is how to ensure that the failed node is ready to be used again so that we can remove the nodeAffinity or podAffinity. One possible solution I can think of is having an event or prometheus metric which tells the node is ready to be used now

In any case, I would like to know what others feel.

@ravisantoshgudimetla
Copy link
Contributor Author

/cc @damemi @seanmalloy @kubernetes-sigs/kubernetes-sig-scheduling

@ravisantoshgudimetla
Copy link
Contributor Author

/cc @soltysh

@seanmalloy
Copy link
Member

Descheduler actually runs the entire set of predicates and priorities from kube-scheduler and updates pod.Spec.NodeName and does apiserver binding. The downside to this approach, there could be some race conditions between the kube-scheduler and descheduler(in the small window when the podSpec has been modified)

In my opinion this is the better solution. Although I'm not convinced that having descheduler update pod.Spec.NodeName is the best approach. Possibly all that is needed is to look at the scheduling score to more precisely choose which pods to evict. See discussion in #238.

/kind feature

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 8, 2020
@ravisantoshgudimetla
Copy link
Contributor Author

ravisantoshgudimetla commented Apr 8, 2020

In my opinion this is the better solution. Although I'm not convinced that having descheduler update pod.Spec.NodeName is the best approach. Possibly all that is needed is to look at the scheduling score to more precisely choose which pods to evict. See discussion in #238.

Even if we see that there is a better node(depending on the score), how can we ensure that pod lands onto a different node as the node scores are normalized to 10. Consider another scenario where the node scores are same because we did not express our constraints as priority functions in kube-scheduler. In that scenario, we will end up with the same score for both the nodes and the chance of pod landing on the same node is high. So, to me these are complementary solutions.

@damemi
Copy link
Contributor

damemi commented Apr 8, 2020

I agree that we should work toward incorporating the default scheduler plugins but have a couple questions about the proposals here.

For the first proposal, if we run the scheduler plugins to re-score the available nodes, we don't actually want to run all of them, but just those that are configured for the scheduler right? Otherwise we can get variations between the kube-scheduler's scores and ours (this gets more complicated with out-of-tree plugins and custom schedulers, but we can ignore that for now).

I'm also not sure what you mean by the race condition between scheduler and descheduler in the case that we're directly updating spec.nodeName. We would only be evicting pods that are already scheduled and the scheduler just looks at pods with no node assigned, right?

For the second proposed solution, I'm wondering if modifying affinity without user input could lead to confusing effects as the pod is rescheduled, and if it would interact with future pods that are scheduled on top of that. Perhaps just an annotation that encodes the information from the descheduler (why it was evicted, what node it was previously on, etc) that could be incorporated into the scheduler would be clearer, and would also help us with the 2nd part of this which is denoting when a node has become ready again (which varies for each descheduler strategy).

I'd like to hear from some more of the sig-scheduling people, because I think this functionality would essentially make the descheduler act like a second scheduler in the cluster. So it should be clear to users how that works and how it interacts with the default scheduler.

@MarSik
Copy link

MarSik commented Apr 9, 2020

Hi, we were told about this effort as it seems to be solving something similar to what we need. Our goal is to temporarily prevent scheduling Pod to a node where it was rejected by TopologyManager.

And after I read the discussion, I agree with @damemi:

  • Neither descheduler (nor TopologyManager in our case) should act as a second scheduler I think. You will always have issues with being in sync. And do not forget the main scheduler supports extensions and you can even replace the stock scheduler for specific workload.

  • Modifying the affinity rules in spec directly will affect the Pod for all future (maybe another descheduling later on) unless you remove the extra rules after a successful placement again.

To me the most obvious solution at this moment (no unnecessary API changes) would be to encode the "tabu" [1] rules as annotations or labels (to allow search) and write a scheduler extension (in or out of tree) that tells the main scheduler about what those mean. The annotation can then be cleaned up, either by a "cleanup controller" (that scans all successfully placed pods with such label) or by whoever does the next eviction.

[1] similar to what https://en.wikipedia.org/wiki/Tabu_search does

@damemi
Copy link
Contributor

damemi commented Apr 9, 2020

To me the most obvious solution at this moment (no unnecessary API changes) would be to encode the "tabu" [1] rules as annotations or labels (to allow search) and write a scheduler extension (in or out of tree) that tells the main scheduler about what those mean.

That's an interesting idea, we could annotate the pods how we want and then write a "descheduler" plugin for the kube-scheduler that knows how to handle them. Properly weighted in the scheduler config as a plugin, this could achieve the desired behavior of minimizing the chance that the node is chosen again for rescheduling. It also provides good user configurability.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2020
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 8, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 6, 2020
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 8, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 6, 2021
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 12, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 12, 2021
@seanmalloy
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 12, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 11, 2021
@k8s-ci-robot k8s-ci-robot added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 11, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

7 participants