Requirements for taints, tolerations, dedicated nodes to graduate to Beta and then v1 #25320

Open
davidopp opened this Issue May 8, 2016 · 42 comments

Projects

None yet

10 participants

@davidopp
Member
davidopp commented May 8, 2016 edited

We'd like to graduate taints, tolerations, and dedicated nodes to Beta in 1.4. (Assuming we get them into 1.3 as alpha.)

Taints/tolerations to Beta

  • Add forgiveness to Toleration
  • make taints unique by <key, effect> on a node #29362
  • Change annotation key from alpha to beta
  • Maybe make Toleration key wildcard-able
  • Move to GeneralPredicates
  • Proper defaulting (when it's a field)
  • Implement NoScheduleNoAdmit and NoExecute
    (previously NoScheduleNoAdmitNoExecute see #28082)
  • Prevent users from modifying taints/tolerations (see e.g. #17549)
  • Ensure that a new node does not become "Ready" until it has been configured with its taints. One way to do this is to have an admission controller that adds the taint whenever a Node object is created.
  • Before allowing users to use it, be sure we won't need to roll back to binary version that doesn't support it

Taints/tolerations to v1

  • Move from annotations on Pod and Node to fields of PodSpec and NodeSpec

Dedicated nodes to Beta

  • More sophisticated policy for adding tolerations, and maybe easier way for adding taints (special-cased to dedicated nodes use case)

cc/ @kevin-wangzefeng @bgrant0607

ref/ #24134 #17190 #18263

@davidopp davidopp added this to the next-candidate milestone May 8, 2016
@resouer
Member
resouer commented May 27, 2016 edited

Working on node side to support kubelet enforced feature of taints/toleration. This requires a pod admitter in the future which is in sig-node TODO list.

@resouer
Member
resouer commented Jun 2, 2016

@davidopp Cloud you explain why add forgiveness to Toleration? If we don't want to tolerate some taints, just remove toleration is not enough?

@davidopp
Member
davidopp commented Jun 2, 2016

The canonical example for forgiveness is: let's say we represent node problems using taints (which we actually do plan to do). So for example Then a pod can use toleration to represent how long it wants to stay bound to a nod before it should be evicted. You want to add the taint immediately when the node starts having the problem, because some pods might want to be evicted right away. But the ones that don't want to be evicted right away, can use toleration to specify how long they want to stay bound. (Of course this assumes the TaintEffect is TaintEffectNoScheduleNoAdmitNoExecute.)

BTW @kevin-wangzefeng is going to work on adding forgiveness to Toleration.

@resouer
Member
resouer commented Jun 2, 2016

Thanks, its much clearer.

@hongchaodeng
Member

@davidopp

I would to help implement some tasks for v1.4. Can your mark tasks being taken by somebody already? Or can folks who is taking over mark themselves here?

BTW, what's "proper defaulting"?

@davidopp
Member
davidopp commented Jun 2, 2016

Yes, sorry, to clarify, we haven't done any planning for 1.4, and we intend to do it jointly with the community when it happens. But @kevin-wangzefeng implemented taints and tolerations (#24134) and had already told me he was planning to add forgiveness since it was part of the original design.

@resouer
Member
resouer commented Jun 2, 2016 edited

@hongchaodeng Except for forgiveness and what I am working on, e.g. NoScheduleNoAdmit NoScheduleNoAdmitNoExecute, it seems other tasks are all available. Would be excited to see if you can pick some up.

@lukaszo
Member
lukaszo commented Jul 14, 2016

Is anyone working on Move to GeneralPredicates ? If no, I can take it.

@davidopp
Member

@lukaszo I don't think anyone is working on it.

@lukaszo lukaszo added a commit to lukaszo/kubernetes that referenced this issue Jul 18, 2016
@lukaszo lukaszo Move PodToleratesNodeTaints to GeneralPredicates
Related issue: #25320
c52c817
@kevin-wangzefeng
Member

Some updates:

  • added an new item make taints unique by <key, effect> on a node #29362, to clarify what make a taint unique on a node.
  • updated item Implement NoScheduleNoAdmit and NoExecute (previously NoScheduleNoAdmitNoExecute, see #28082) NoScheduleNoAdmitNoExecute is now to use NoScheduleNoAdmit + NoExecute instead
@lukaszo lukaszo added a commit to lukaszo/kubernetes that referenced this issue Aug 9, 2016
@lukaszo lukaszo Move PodToleratesNodeTaints to GeneralPredicates
Related issue: #25320
3ef55ea
@lukaszo lukaszo added a commit to lukaszo/kubernetes that referenced this issue Aug 10, 2016
@lukaszo lukaszo Move PodToleratesNodeTaints to GeneralPredicates
Related issue: #25320
55d32b2
@eparis
Member
eparis commented Sep 7, 2016

@davidopp is this list still correct(ish) for getting to beta? Is there any better place to look?

@davidopp
Member
davidopp commented Sep 7, 2016

I think this list is more or less accurate. @kevin-wangzefeng is working on some of the items for 1.5. I don't think we will have time to make it Beta in 1.5 though. We could maybe split this up into taints/tolerations and dedicated nodes, and consider the latter a separate feature; that would allow us to remove the security-related items from the list (like ensuring Kubelet comes up with the right taints when it joins, and preventing unprivileged users from modifying taints/toleations). Also we definitely don't need NoScheduleNoAdmit for Beta, as it can be added in a backward-compatible way.

@davidopp
Member

Figure out what to do with NodeStatus.NodeSpec "unschedulable" field

@aveshagarwal
Member

@kevin-wangzefeng @davidopp Could you please share what specific items kevin is working on for 1.5, any links or anything for reference? Also I am happy to help on other items not being worked on by kevin, perhaps to help it move closer to beta.

@derekwaynecarr

@timothysc
Member

@aveshagarwal please coordinate via the @kubernetes/sig-scheduling to ensure folks aren't overlapping. xref: http://goo.gl/CsXJLu

@aveshagarwal
Member

I am trying to summarize the tasks here to capture the current state based on what I have found as its not clear what is being worked on, is already done or is yet to do. As @davidopp suggested to separate taints/tolerations and dedicated nodes features in one of his previous comments, I am updating the list for taints/tolerations and dedicated nodes features to graduate to beta here (please feel free to correct me, and also update as needed ):

Taints/Toleration to Beta:

To-do:

  • Change annotation key from alpha to beta
  • Maybe make Toleration key wildcard-able
  • Before allowing users to use it, be sure we won't need to roll back to binary version that doesn't support it

Being worked on/or done:

Taints/tolerations to v1:

  • Implement NoScheduleNoAdmit
  • Move from annotations on Pod and Node to fields of PodSpec and NodeSpec
  • Proper defaulting (when it's a field)

Dedicated nodes to Beta

  • Prevent users from modifying taints/tolerations (see e.g. #17549) (Ref #27330)
  • Ensure that a new node does not become "Ready" until it has been configured with its taints. One way to do this is to have an admission controller that adds the taint whenever a Node object is created.
  • More sophisticated policy for adding tolerations, and maybe easier way for adding taints (special-cased to dedicated nodes use case)

We can discuss more about it in today's sig-scheduling meeting.

@aveshagarwal
Member

@davidopp @kevin-wangzefeng As discussed in the yesterday's sig-scheduling meeting, would like to have feedback on my last comment or in general.

@davidopp davidopp referenced this issue in kubernetes/features Oct 1, 2016
Open

Taints/tolerations #108

0 of 23 tasks complete
@timothysc
Member

@aveshagarwal things to keep in mind re(alpha-beta) : #30819 (comment)

@davidopp
Member

Sorry for the delay. I think my list was way too long. Those are all features we want eventually but I don't think they're all needed for 1.5.

The remaining things I think we need for 1.5 are

  • add forgiveness to Toleration
  • Implement NoScheduleNoAdmit and NoExecute
    (previously NoScheduleNoAdmitNoExecute see #28082)

@kevin-wangzefeng is working on these and said he should have a PR out soon.

As for graduating it to Beta, we should do that in a followup PR. I hope we can get it done for 1.5. I think we need to implement "make toleration key wildcardable" before (or simultaneous with) graduating to Beta because I'm not sure we can add it later in a backward compatible way. (imagine someone creates an alpha toleration with key X and then we decide that in beta key X means wildcard; then when they upgrade from alpha to beta their toleration suddenly means something new)

There is discussion ongoing in #30819 about how to graduate things to Beta in general.

We can ignore dedicated notes and just consider it a separate feature to be implemented in the future.

@aveshagarwal
Member

I can take a stab at "make toleration key wildcardable" implementation if no one is working on it. Here is my understanding.

When a toleration key is wild card:
Case 1: Operator=Equal, Value=v, Effect=one of NoSchedule, PreferNoSchedule, NoExecute, NoScheduleNoAdmit

It means that a pod with the above toleration could be scheduled to any node that has a taint with value v (irregardless of key) according to the effects.

Case 2: Operator=Exits, Value=v, Effect=one of NoSchedule, PreferNoSchedule, NoExecute, NoScheduleNoAdmit

In this case, the value is wildcard (As Operator=Exits) too. As both key and value are wild cards, it seems equivalent to no toleration.

@aveshagarwal aveshagarwal referenced this issue Oct 14, 2016
Open

WIP: implement forgiveness phase 1 #34825

8 of 8 tasks complete
@davidopp
Member

I think @kevin-wangzefeng may have started working on it, so you should check with him.

What I proposed was that empty toleration key (empty string) means "match all keys"
I think it would only make sense to use it with Exists operator (so it means "match all values and all keys"), so we would check that in validation.

@aveshagarwal
Member

@davidopp thanks, its ok with me if @kevin-wangzefeng is working on it, so I will leave it to him.

Is someone working on "NoScheduleNoAdmit" implementation too, as I haven't seen it any PR? In case no, I would be happy to help with it.

@kevin-wangzefeng
Member

@aveshagarwal I think it's #34400 (previously #26501), but I didn't get time to take a look yet.

@aveshagarwal
Member

@kevin-wangzefeng Yes #34400 implements NoScheduleNoAdmit.

@aveshagarwal
Member

Is this the expectation with NoScheduleNoAdmit?

a) NoSchedule part of NoScheduleNoAdmit is exactly similar to existing NoSchedule and is for pods going via scheduler.

b) NoAdmit part of NoScheduleNoAdmit is for static pods (created directly by kubelet, i..e the pods not going through the scheduler).

@davidopp
Member

Unfortunately we're not going to have time to get #34400 into 1.5.

But as for what you described @aveshagarwal, it is correct except I'm not sure the term "static pods" is correct (or maybe I just don't understand the definition you're using). But as you said, it is for pods that are submitted directly to Kubelet, e.g. by creating a pod with NodeName already filled in.

@aveshagarwal
Member

@davidopp I just picked the word static pod from here: http://kubernetes.io/docs/admin/static-pods/

@davidopp
Member

Yeah I'm not sure whether NoAdmit would affect static pods. The intended use case was a regular pod that sets NodeName directly. Static pods are a bit different as they are created from a manifest file (and in theory by HTTP but nobody does that).

@aveshagarwal
Member

I have copied the list from #35518 here for taints/tolerations too :

  1. Remove the existing implementation in annotations.
  2. Implement existing stuff in api fields (xref: #34508).
  3. Port tests.

As we discussed in yesterday's sig-sched meeting, I will start looking into it once the existing PRs are in better/working shape probably for 1.6 time frame.

Once functionality has reached parity with the annotation implement, promote from alpha->beta.

@kubernetes/sig-scheduling thoughts?

@timothysc
Member

@aveshagarwal I'd like to sync sometime on your test plan(s) around this. Right now everything is lite at best.

but +1 to the plan.

@Quentin-M
Contributor

We have an use-case where, we'd like a pod managed by a DaemonSet to detect when the underlying node is being put in maintenance mode. Currently, this means kubectl cordon or kubectl drain are ran and set the NodeStatus.NodeSpec.Unschedulable field. But because the pod is part of a DaemonSet, it is not destroyed and there is no way to differentiate between the two operations using the API. However, it actually has quite some importance to us because in the case where the node is drained, it shows the operator's intent to shutdown/restart the node (or realize some other heavy duty task), in which case we want to gracefully migrate our workload/tasks/resources to another working pod/node. In the other hand, if the node is simply cordoned, the risk is limited (pods restarting is fine) and it is not worth migrating.

Unfortunately, as mentioned above, we currently can't tell and therefore, we are forced to migrate everything as soon as the node is set as unschedulable, even if it's a very short maintenance/outage.

It seems that the NoScheduleNoAdmit and NoExecute taints fit the use-case. However, I didn't clearly see any mention about how kubectl cordon and kubectl drain will be reworked to adopt these. Intuitively, I'd say that kubectl cordon would set NoScheduleNoAdmit while kubectl drain would set NoScheduleNoAdmit + NoExecute, without actively deleting any pods but instead letting the schedule do it. Tolerations and Forgiveness could then be used to make my DaemonSet's pods stop and execute the necessary migration steps using the preStop hook - While at the same time, DaemonSets such as kube-proxy would tolerate these taints and keep running indefinitely.

Does it make sense? Am I missing anything? I believe these are planned for 1.6?

@aveshagarwal
Member

@Quentin-M yes the current plan for NoScheduleNoAdmit and NoExecute implementation is 1.6. Here are related PRs: #34825 , #34400 .

@davidopp
Member
davidopp commented Nov 20, 2016 edited

Each pod in a DaemonSet is supposed to be tied to the node it is running on. There shouldn't be any need to migrate anything when the machine goes down. So it sounds like you are using DaemonSet in an unexpected way.

But anyway, what you and @aveshagarwal is indeed the plan; we will switch cordon/uncordon/drain to use taints, and then DaemonSet pods can decide for themselves whether they want to be deleted when these actions happen. We were hoping to get this into 1.5 but it didn't make it, but it should be in 1.6. #29178 covers the "represent unschedulable as a taint" part; i don't think we have an issue filed for making kubectl drain use taints for eviection but we can just leave your comment here about it as a reminder.

@davidopp
Member

Actually I don't think we can use taints for kubectl drain, because we want to respect PodDisruptionBudget when doing the evictions for a drain, and eviction due to NoExecute taint presumably won't. I could imagine some scheme where the NoExecute taint could specify whether the eviction should respect PDB (go through pod /eviction subresource) or not, but this seems complicated. So I don't think eviction for taint will respect PDB. If we find some important use cases where we do need it, then we could do something like what I described, but it's complication we should avoid unless it's truly necessary.

@davidopp
Member

We should consider whether to add taint/toleration check to GeneralPredicates (so Kubelet checks them) when moving to Beta.

@sdminonne
Contributor

/cc

@davidopp
Member

We need to write documentation. The documentation should emphasize that this is not really intended to be used by users directly, but by higher-level tools that would apply the appropriate taints and configure an admission controller to add the appropriate tolerations. We should also mention that you need to pair a toleration with a node affinity in some scenarios.

@davidopp
Member

Once we think the work is done, we need to verify that no instances remain in the codebase of the strings "TolerationsAnnotationKey", "api.TolerationsAnnotationKey", "scheduler.alpha.kubernetes.io/tolerations", "TaintsAnnotationKey", "api.TaintsAnnotationKey", "scheduler.alpha.kubernetes.io/taints"

@aveshagarwal
Member

@davidopp Yeah sure. Though this PR #38957 has already removed them. So I will update this PR after forgiveness related work is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment