Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extending taint-based-eviction schedule by lengthening tolerationSeconds is not possible. #102993

Closed
dbenque opened this issue Jun 18, 2021 · 6 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.

Comments

@dbenque
Copy link

dbenque commented Jun 18, 2021

What happened:

Working with taint based eviction, the tolerationSeconds parameter is taken into account when the eviction schedule is created.
Once an eviction schedule is create, updating the tolerationSeconds only works if it creates an earlier schedule.

This does not give a good user experience because the behavior appear to be not consistent. If tolerationSeconds is updated for a group of pods associated with N nodes the behavior will differ depending if some eviction schedule are set or not on some nodes.

Let's not that the modification of tolerationSeconds in the pod toleration is always accepted by the APIServer.

Let's take an example:

Initial condition:
Fleet of 100 pods all running on a different node, so we have 100 nodes. All pods have a tolerationSeconds of 1hours against the taint node.kubernetes.io/not-ready:NoExecute.

State change (T0)
12 of the 100 nodes becomes not-ready and are being tainted. The controller manager will schedule 12 evictions to be triggered in 1h

User decision:
The user decided tolerationSeconds should be updated on the fleet of pods to 1day. Let' imagine that this extension of toleration is decided to give more time to the team to do some remediations and fixes on the cluster that is not healthy (reason for the notReady nodes potentially) .
All tolerationSeconds are updated to 1day. APIServer accept all the modification. Behind the scene the controller manager observe that change but do not reschedule the 12 pending evictions.

State change(T1):
other 13 nodes becomes not-ready. We have now 25 not-ready nodes. 13 new eviction schedules are created and they will trigger in 24h.

What happen later:
at T0+1h: 12 pods are going to be evicted!!! (not really what a user would expect)
at T1+24h: 13 pods are going to be evicted (expected)

What you expected to happen:

If an eviction schedule is pending, if the tolerationSeconds is updated, the new value is taken into account in all cases: schedule moved later or earlier.

How to reproduce it (as minimally and precisely as possible):

1- Taint a node with NoExecute taint.
2- Extend the tolerationSeconds value of an impacted pod by 1 day

you will notice that the pod is evicted after 5 minutes (which is the default tolerationSeconds value)

Anything else we need to know?:

The code is made to prevent such extension, but there is not clear explanation for the that in the documentation or in the code. The only think that reflect that intention is this unit-test:

description: "lengthening toleration shouldn't work",

I will propose a simple PR that allow moving the schedule earlier or later depending on the modification done to tolerationSeconds, but of course that would break the unit-test above.

Is there any reason for blocking the extension of an established eviction schedule and not respecting tolerationSeconds update?

For more context: we are we trying to extends eviction schedule because under some catastrophic scenario (many/all kubelet losing connection to apiserver) eviction schedule are created. Of course we could play with the rate of eviction, but this is clearly not satisfying, the main reasons are:
1- we cannot change it while the CM is running
2- thresholds are defined for at cluster level while we would like to work per group of nodes, or application, or namespace, or whatever dimension
3- extending/pushing eviction schedule is the only way to give time to operation people to deal with remediation without impacting workload

Environment:

  • Kubernetes version (use kubectl version): 1.21
  • Cloud provider or hardware configuration: aws
  • OS (e.g: cat /etc/os-release): ubuntu
  • Kernel (e.g. uname -a):5.8.0
  • Install tools: -
  • Network plugin and version (if this is a network-related bug): -
  • Others:
@dbenque dbenque added the kind/bug Categorizes issue or PR as related to a bug. label Jun 18, 2021
@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Jun 18, 2021
@k8s-ci-robot
Copy link
Contributor

@dbenque: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neolit123
Copy link
Member

/sig scheduling

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jun 21, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 19, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Oct 19, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants