New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extending taint-based-eviction schedule by lengthening tolerationSeconds
is not possible.
#102993
Comments
@dbenque: This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig scheduling |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What happened:
Working with taint based eviction, the
tolerationSeconds
parameter is taken into account when the eviction schedule is created.Once an eviction schedule is create, updating the
tolerationSeconds
only works if it creates an earlier schedule.This does not give a good user experience because the behavior appear to be not consistent. If
tolerationSeconds
is updated for a group of pods associated with N nodes the behavior will differ depending if some eviction schedule are set or not on some nodes.Let's not that the modification of
tolerationSeconds
in the pod toleration is always accepted by the APIServer.Let's take an example:
Initial condition:
Fleet of 100 pods all running on a different node, so we have 100 nodes. All pods have a tolerationSeconds of 1hours against the taint
node.kubernetes.io/not-ready:NoExecute
.State change (T0)
12 of the 100 nodes becomes not-ready and are being tainted. The controller manager will schedule 12 evictions to be triggered in 1h
User decision:
The user decided
tolerationSeconds
should be updated on the fleet of pods to 1day. Let' imagine that this extension of toleration is decided to give more time to the team to do some remediations and fixes on the cluster that is not healthy (reason for the notReady nodes potentially) .All
tolerationSeconds
are updated to 1day. APIServer accept all the modification. Behind the scene the controller manager observe that change but do not reschedule the 12 pending evictions.State change(T1):
other 13 nodes becomes not-ready. We have now 25 not-ready nodes. 13 new eviction schedules are created and they will trigger in 24h.
What happen later:
at T0+1h: 12 pods are going to be evicted!!! (not really what a user would expect)
at T1+24h: 13 pods are going to be evicted (expected)
What you expected to happen:
If an eviction schedule is pending, if the
tolerationSeconds
is updated, the new value is taken into account in all cases: schedule moved later or earlier.How to reproduce it (as minimally and precisely as possible):
1- Taint a node with NoExecute taint.
2- Extend the tolerationSeconds value of an impacted pod by 1 day
you will notice that the pod is evicted after 5 minutes (which is the default tolerationSeconds value)
Anything else we need to know?:
The code is made to prevent such extension, but there is not clear explanation for the that in the documentation or in the code. The only think that reflect that intention is this unit-test:
kubernetes/pkg/controller/nodelifecycle/scheduler/taint_manager_test.go
Line 295 in fddb3ad
I will propose a simple PR that allow moving the schedule earlier or later depending on the modification done to
tolerationSeconds
, but of course that would break the unit-test above.Is there any reason for blocking the extension of an established eviction schedule and not respecting
tolerationSeconds
update?For more context: we are we trying to extends eviction schedule because under some catastrophic scenario (many/all kubelet losing connection to apiserver) eviction schedule are created. Of course we could play with the rate of eviction, but this is clearly not satisfying, the main reasons are:
1- we cannot change it while the CM is running
2- thresholds are defined for at cluster level while we would like to work per group of nodes, or application, or namespace, or whatever dimension
3- extending/pushing eviction schedule is the only way to give time to operation people to deal with remediation without impacting workload
Environment:
kubectl version
): 1.21cat /etc/os-release
): ubuntuuname -a
):5.8.0The text was updated successfully, but these errors were encountered: