Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Throttle pod eviction when past evictions led to the same node being chosen by the scheduler #424

Closed
codablock opened this issue Oct 7, 2020 · 8 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@codablock
Copy link

Is your feature request related to a problem? Please describe.
Currently, multiple missing features in descheduler cause re-scheduling of pods onto the same nodes again and again. Two such examples are #335 and #335 (comment). The descheduler currently never gives up.

Describe the solution you'd like
pod eviction should be throttled when a configurable number of past evictions led to the re-scheduling on the same node. This should be configurable via options (default values) and via annotations per pod.

I assume that there are more cases where pods are re-scheduled to the same node. I already encountered 2 such cases on the first run of descheduler. I also assume that it will take some time until all these cases are fixed, so I'd suggest to implement this feature first so that there is some sensible default/fallback behavior for such cases.

@codablock codablock added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 7, 2020
@damemi
Copy link
Contributor

damemi commented Oct 7, 2020

I think this would be related to our long-term goal of incorporating the actual scheduler framework into the descheduler to better inform eviction decisions (see related issues like #261, #238).

That work is making some progress, but currently at the step of breaking the scheduler framework out of core kubernetes (see upstream issue kubernetes/kubernetes#89930).

Like you mention, there are a lot of cases where an evicted pod will end up on the same node. The descheduler is pretty naïve and optimistic, so trying to cover all of these now would be a lot like re-writing existing scheduler logic.

We could potentially add some sort of "throttling" (maybe an annotation like last-evicted-time/count?) that could de-prioritize pods for eviction, but ultimately I think the ask here is a known issue of incorporating more scheduler logic.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2021
@damemi
Copy link
Contributor

damemi commented Jan 5, 2021

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 5, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2021
@seanmalloy
Copy link
Member

remove-lifecycle stale

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 6, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

5 participants