We'd like to graduate taints, tolerations, and dedicated nodes to Beta in 1.4. (Assuming we get them into 1.3 as alpha.)
Taints/tolerations to Beta
Taints/tolerations to v1
Dedicated nodes to Beta
cc/ @kevin-wangzefeng @bgrant0607
ref/ #24134 #17190 #18263
Working on node side to support kubelet enforced feature of taints/toleration. This requires a pod admitter in the future which is in sig-node TODO list.
@davidopp Cloud you explain why add forgiveness to Toleration? If we don't want to tolerate some taints, just remove toleration is not enough?
The canonical example for forgiveness is: let's say we represent node problems using taints (which we actually do plan to do). So for example Then a pod can use toleration to represent how long it wants to stay bound to a nod before it should be evicted. You want to add the taint immediately when the node starts having the problem, because some pods might want to be evicted right away. But the ones that don't want to be evicted right away, can use toleration to specify how long they want to stay bound. (Of course this assumes the TaintEffect is TaintEffectNoScheduleNoAdmitNoExecute.)
BTW @kevin-wangzefeng is going to work on adding forgiveness to Toleration.
Thanks, its much clearer.
I would to help implement some tasks for v1.4. Can your mark tasks being taken by somebody already? Or can folks who is taking over mark themselves here?
BTW, what's "proper defaulting"?
Yes, sorry, to clarify, we haven't done any planning for 1.4, and we intend to do it jointly with the community when it happens. But @kevin-wangzefeng implemented taints and tolerations (#24134) and had already told me he was planning to add forgiveness since it was part of the original design.
@hongchaodeng Except for forgiveness and what I am working on, e.g. NoScheduleNoAdmit NoScheduleNoAdmitNoExecute, it seems other tasks are all available. Would be excited to see if you can pick some up.
Is anyone working on Move to GeneralPredicates ? If no, I can take it.
Move to GeneralPredicates
@lukaszo I don't think anyone is working on it.
Move PodToleratesNodeTaints to GeneralPredicates
Related issue: #25320
@davidopp is this list still correct(ish) for getting to beta? Is there any better place to look?
I think this list is more or less accurate. @kevin-wangzefeng is working on some of the items for 1.5. I don't think we will have time to make it Beta in 1.5 though. We could maybe split this up into taints/tolerations and dedicated nodes, and consider the latter a separate feature; that would allow us to remove the security-related items from the list (like ensuring Kubelet comes up with the right taints when it joins, and preventing unprivileged users from modifying taints/toleations). Also we definitely don't need NoScheduleNoAdmit for Beta, as it can be added in a backward-compatible way.
/cc @ConnorDoyle @jayunit100
Figure out what to do with NodeStatus.NodeSpec "unschedulable" field
@kevin-wangzefeng @davidopp Could you please share what specific items kevin is working on for 1.5, any links or anything for reference? Also I am happy to help on other items not being worked on by kevin, perhaps to help it move closer to beta.
@aveshagarwal please coordinate via the @kubernetes/sig-scheduling to ensure folks aren't overlapping. xref: http://goo.gl/CsXJLu
I am trying to summarize the tasks here to capture the current state based on what I have found as its not clear what is being worked on, is already done or is yet to do. As @davidopp suggested to separate taints/tolerations and dedicated nodes features in one of his previous comments, I am updating the list for taints/tolerations and dedicated nodes features to graduate to beta here (please feel free to correct me, and also update as needed ):
Taints/Toleration to Beta:
Being worked on/or done:
Taints/tolerations to v1:
We can discuss more about it in today's sig-scheduling meeting.
@davidopp @kevin-wangzefeng As discussed in the yesterday's sig-scheduling meeting, would like to have feedback on my last comment or in general.
@aveshagarwal things to keep in mind re(alpha-beta) : #30819 (comment)
Sorry for the delay. I think my list was way too long. Those are all features we want eventually but I don't think they're all needed for 1.5.
The remaining things I think we need for 1.5 are
@kevin-wangzefeng is working on these and said he should have a PR out soon.
As for graduating it to Beta, we should do that in a followup PR. I hope we can get it done for 1.5. I think we need to implement "make toleration key wildcardable" before (or simultaneous with) graduating to Beta because I'm not sure we can add it later in a backward compatible way. (imagine someone creates an alpha toleration with key X and then we decide that in beta key X means wildcard; then when they upgrade from alpha to beta their toleration suddenly means something new)
There is discussion ongoing in #30819 about how to graduate things to Beta in general.
We can ignore dedicated notes and just consider it a separate feature to be implemented in the future.
I can take a stab at "make toleration key wildcardable" implementation if no one is working on it. Here is my understanding.
When a toleration key is wild card:
Case 1: Operator=Equal, Value=v, Effect=one of NoSchedule, PreferNoSchedule, NoExecute, NoScheduleNoAdmit
It means that a pod with the above toleration could be scheduled to any node that has a taint with value v (irregardless of key) according to the effects.
Case 2: Operator=Exits, Value=v, Effect=one of NoSchedule, PreferNoSchedule, NoExecute, NoScheduleNoAdmit
In this case, the value is wildcard (As Operator=Exits) too. As both key and value are wild cards, it seems equivalent to no toleration.
I think @kevin-wangzefeng may have started working on it, so you should check with him.
What I proposed was that empty toleration key (empty string) means "match all keys"
I think it would only make sense to use it with Exists operator (so it means "match all values and all keys"), so we would check that in validation.
@davidopp thanks, its ok with me if @kevin-wangzefeng is working on it, so I will leave it to him.
Is someone working on "NoScheduleNoAdmit" implementation too, as I haven't seen it any PR? In case no, I would be happy to help with it.
@aveshagarwal I think it's #34400 (previously #26501), but I didn't get time to take a look yet.
@kevin-wangzefeng Yes #34400 implements NoScheduleNoAdmit.
Is this the expectation with NoScheduleNoAdmit?
a) NoSchedule part of NoScheduleNoAdmit is exactly similar to existing NoSchedule and is for pods going via scheduler.
b) NoAdmit part of NoScheduleNoAdmit is for static pods (created directly by kubelet, i..e the pods not going through the scheduler).
Unfortunately we're not going to have time to get #34400 into 1.5.
But as for what you described @aveshagarwal, it is correct except I'm not sure the term "static pods" is correct (or maybe I just don't understand the definition you're using). But as you said, it is for pods that are submitted directly to Kubelet, e.g. by creating a pod with NodeName already filled in.
@davidopp I just picked the word static pod from here: http://kubernetes.io/docs/admin/static-pods/
Yeah I'm not sure whether NoAdmit would affect static pods. The intended use case was a regular pod that sets NodeName directly. Static pods are a bit different as they are created from a manifest file (and in theory by HTTP but nobody does that).
I have copied the list from #35518 here for taints/tolerations too :
As we discussed in yesterday's sig-sched meeting, I will start looking into it once the existing PRs are in better/working shape probably for 1.6 time frame.
Once functionality has reached parity with the annotation implement, promote from alpha->beta.
@aveshagarwal I'd like to sync sometime on your test plan(s) around this. Right now everything is lite at best.
but +1 to the plan.
We have an use-case where, we'd like a pod managed by a DaemonSet to detect when the underlying node is being put in maintenance mode. Currently, this means kubectl cordon or kubectl drain are ran and set the NodeStatus.NodeSpec.Unschedulable field. But because the pod is part of a DaemonSet, it is not destroyed and there is no way to differentiate between the two operations using the API. However, it actually has quite some importance to us because in the case where the node is drained, it shows the operator's intent to shutdown/restart the node (or realize some other heavy duty task), in which case we want to gracefully migrate our workload/tasks/resources to another working pod/node. In the other hand, if the node is simply cordoned, the risk is limited (pods restarting is fine) and it is not worth migrating.
Unfortunately, as mentioned above, we currently can't tell and therefore, we are forced to migrate everything as soon as the node is set as unschedulable, even if it's a very short maintenance/outage.
It seems that the NoScheduleNoAdmit and NoExecute taints fit the use-case. However, I didn't clearly see any mention about how kubectl cordon and kubectl drain will be reworked to adopt these. Intuitively, I'd say that kubectl cordon would set NoScheduleNoAdmit while kubectl drain would set NoScheduleNoAdmit + NoExecute, without actively deleting any pods but instead letting the schedule do it. Tolerations and Forgiveness could then be used to make my DaemonSet's pods stop and execute the necessary migration steps using the preStop hook - While at the same time, DaemonSets such as kube-proxy would tolerate these taints and keep running indefinitely.
Does it make sense? Am I missing anything? I believe these are planned for 1.6?
@Quentin-M yes the current plan for NoScheduleNoAdmit and NoExecute implementation is 1.6. Here are related PRs: #34825 , #34400 .
Each pod in a DaemonSet is supposed to be tied to the node it is running on. There shouldn't be any need to migrate anything when the machine goes down. So it sounds like you are using DaemonSet in an unexpected way.
But anyway, what you and @aveshagarwal is indeed the plan; we will switch cordon/uncordon/drain to use taints, and then DaemonSet pods can decide for themselves whether they want to be deleted when these actions happen. We were hoping to get this into 1.5 but it didn't make it, but it should be in 1.6. #29178 covers the "represent unschedulable as a taint" part; i don't think we have an issue filed for making kubectl drain use taints for eviection but we can just leave your comment here about it as a reminder.
Actually I don't think we can use taints for kubectl drain, because we want to respect PodDisruptionBudget when doing the evictions for a drain, and eviction due to NoExecute taint presumably won't. I could imagine some scheme where the NoExecute taint could specify whether the eviction should respect PDB (go through pod /eviction subresource) or not, but this seems complicated. So I don't think eviction for taint will respect PDB. If we find some important use cases where we do need it, then we could do something like what I described, but it's complication we should avoid unless it's truly necessary.
We should consider whether to add taint/toleration check to GeneralPredicates (so Kubelet checks them) when moving to Beta.
We need to write documentation. The documentation should emphasize that this is not really intended to be used by users directly, but by higher-level tools that would apply the appropriate taints and configure an admission controller to add the appropriate tolerations. We should also mention that you need to pair a toleration with a node affinity in some scenarios.
Once we think the work is done, we need to verify that no instances remain in the codebase of the strings "TolerationsAnnotationKey", "api.TolerationsAnnotationKey", "scheduler.alpha.kubernetes.io/tolerations", "TaintsAnnotationKey", "api.TaintsAnnotationKey", "scheduler.alpha.kubernetes.io/taints"
@davidopp Yeah sure. Though this PR #38957 has already removed them. So I will update this PR after forgiveness related work is merged.