Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RuntimeClass scheduling] native scheduler support, ready to implement #909

Merged
merged 2 commits into from Apr 4, 2019

Conversation

@tallclair
Copy link
Member

commented Mar 20, 2019

Updated the RuntimeClass scheduling KEP to use native scheduler support rather than a mixin approach.

This approach uses a cleaner API (doesn't need to mixin a bunch of rules into the pod's node affinity), it let's us use the more expressive NodeSelector API, and provides a nice user experience in the error case.

I am opening this PR to continue the discussion @bsalamat started here

/sig node scheduling
/assign @bsalamat @derekwaynecarr @dchen1107
/cc @yastij
/milestone v1.15
/priority important-longterm

scheduling is to merge the topology constraints from the RuntimeClass into the
PodSpec. Eventually, the controller's responsibilities may grow, such as to
merge in [pod overhead][] or validate feature compatibility.
A new scheduler predicate will manage the RuntimeClass topology. It will lookup

This comment has been minimized.

Copy link
@yastij

yastij Mar 20, 2019

Member

Maybe we should discuss predicate composition vs adding support of runtimeClass into other existing predicate

This comment has been minimized.

Copy link
@tallclair

tallclair Mar 21, 2019

Author Member

If we add it to existing predicates, we don't get a useful error message when the pod can't be scheduled, right?

This comment has been minimized.

Copy link
@yastij

This comment has been minimized.

Copy link
@yastij

yastij Mar 28, 2019

Member

@tallclair - if we merge tolerations into podSpec how can we inform user correctly (UX) when not finding a node that fits ?

This comment has been minimized.

Copy link
@tallclair

tallclair Mar 28, 2019

Author Member

I don't think that changes anything, as the tolerations would be resolved by the same PodToleratesNodeTaints predicate? But I might be misunderstanding how the debug messages are created.

This comment has been minimized.

Copy link
@yastij

yastij Mar 28, 2019

Member

Do you mean that we’d still check the runtimeClass when computing podToleratesNodeTaints ?

This comment has been minimized.

Copy link
@yastij

yastij Apr 1, 2019

Member

I think we can move forward, and discuss this while implementing

#### Mix-in

The RuntimeClass topology is merged with the pod's NodeSelector & Tolerations.
The `PodToleratesNodeTaints` predicate will also be extended to be

This comment has been minimized.

Copy link
@yastij

yastij Mar 20, 2019

Member

Can't we just call from the RuntimeClassPredicate the PodToleratesNodeTaints with the right parameters ? composing seems easier than modifying each predicate to include runtimeClass awareness

This comment has been minimized.

Copy link
@tallclair

tallclair Mar 21, 2019

Author Member

I assume predicates are AND'd together, in which case the PodToleratesNodeTaints predicate would also be called without the runtimeclass tolerations, and in that instance fail? I'm not very familiar with the scheduler code though, so I could be totally wrong.

This comment has been minimized.

Copy link
@yastij

yastij Mar 26, 2019

Member

This raises a broader Issue at predicates level, when a predicate is basically a composition of other:

  • do we add domain awareness to the existing predicates ?
  • do we compose these predicates into domain predicate ?

This might over complicate our predicates definition, and create performance overhead in some cases.

@tallclair

This comment has been minimized.

Copy link
Member Author

commented Mar 26, 2019

Summarizing feedback from 2019-03-26 sig-node:

  • Add a section about node label restrictions, and how those should play into the RuntimeClass topology (security considerations)
  • Additional tradeoff on native scheduling vs. mixin: with native scheduling, it's harder to apply scheduling policy. This seems like more of a concern with tolerations than with the node selector, so I think we should actually take a hybrid approach, where tolerations are mixed in but the node selector is applied at scheduling time.

/assign @egernst

@yastij

This comment has been minimized.

Copy link
Member

commented Mar 26, 2019

@tallclair - I think this has 2 main benefits

  • This tradeoff aligns with the scheduling policy, especially that enforcement of affinities will be done at scheduling time.

  • This will avoid the issue of predicate composition (If we still want to go down this path, I think we need a KEP for predicate composition)

my only concern is on the UX side as @bsalamat stated

// +optional
NodeSelectorTerm []corev1.NodeSelectorRequirement
NodeSelector corev1.NodeSelector
// tolerations adds tolerations to pods running with this RuntimeClass.

This comment has been minimized.

Copy link
@bsalamat

bsalamat Mar 27, 2019

Member

I am a bit unsure about having tolerations as a part of "topology" in RuntimeClass. This is useful if one doesn't want to provide RuntimeClass on all their pods. If they provide RuntimeClass for all their pods they can just use NodeSelector. If they insist on tainting some nodes, they can use regular Toleration (which is not a part of RuntimeClass). So, I guess my question is "what is the major benefit of having tolerations" here?

This comment has been minimized.

Copy link
@tallclair

tallclair Mar 27, 2019

Author Member

I think this is covered under the Tolerations discussion on line 170:

Tolerations

While NodeSelectors and labels are used for steering pods towards nodes,
[taints and tolerations][taint-and-toleration] are used for steering pods away
from nodes. If every pod had a RuntimeClass and every RuntimeClass had a strict
NodeSelector, then RuntimeClasses could use non-overlapping selectors in place
of taints & tolerations. However the same could be said of regular pod
selectors, yet taints & tolerations are still a useful addition. Examples of
use cases for including tolerations in RuntimeClass topology
inculde:

  • Tainting Windows nodes with windows.microsoft.com to keep default linux pods
    away from the nodes. Windows RuntimeClasses would then need a corresponding
    toleration.
  • Tainting "sandbox" nodes with sandboxed.kubernetes.io to keep services
    providing privileged node features away from sandboxed workloads. Sandboxed
    RuntimeClasses would need a toleration to enable them to be run on those
    nodes.

I agree that it's not strictly necessary, but I think it will be useful in practice. I guess both of the use cases listed could be accomplished with a default RuntimeClass that looked like this:

kind: RuntimeClass
metadata:
  name: default
handler: runc
nodeSelector:
  nodeSelectorTerms:
    matchExpressions:
      - key: "windows.microsoft.com"
        operator: DoesNotExist
      - key: "sandboxed.kubernetes.io"
        operator: DoesNotExist

This comment has been minimized.

Copy link
@yastij

yastij Mar 27, 2019

Member

Does this assume that the toleration key is also set as label ? I think that this isn't always the case.

This comment has been minimized.

Copy link
@tallclair

tallclair Mar 27, 2019

Author Member

This implies using labels rather than taints.

This comment has been minimized.

Copy link
@yastij

yastij Mar 27, 2019

Member

it is more staightforward for users to implement dedicated nodes using taints and tolerations rather than nodeSelectors.

@bsalamat - The benefit of having tolerations on the runtimeClass is freeing users that want to consume a specific runtimeClass available on a dedicated node to think about which toleration they should set or not.

This comment has been minimized.

Copy link
@bsalamat

bsalamat Mar 27, 2019

Member

Your points are valid. Thanks for clarifying.

@tallclair

This comment has been minimized.

Copy link
Member Author

commented Mar 27, 2019

Integrated feedback from sig-node. In particular, the RuntimeController was added back, and tolerations are back to being a mixin.

@@ -203,6 +202,36 @@ will be a clear explanation of why. For example:
0/10 Nodes are available: 5 nodes do not have enough memory, 5 nodes don't match RuntimeClass
```

### RuntimeController

This comment has been minimized.

Copy link
@bsalamat

bsalamat Mar 27, 2019

Member

nit: I would call this Runtime Admission Controller to clarify that this is not a controller with a control loop that runs all the time and monitors things in the cluster.

This comment has been minimized.

Copy link
@yastij

yastij Mar 27, 2019

Member

I agree

This comment has been minimized.

Copy link
@tallclair

tallclair Mar 29, 2019

Author Member

Looks like the convention of the admission plugins is to just omit the controller piece altogether - so this would just be "Runtime". I'll clarify the name in the KEP though.

This comment has been minimized.

Copy link
@yastij

yastij Apr 1, 2019

Member

or rename to RuntimeTopology ?

This comment has been minimized.

Copy link
@tallclair

tallclair Apr 1, 2019

Author Member

The plan is to reuse the controller for other RuntimeClass-related admission controls. In particular PodOverhead. So I don't think it makes sense to include topology in the name.

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Apr 1, 2019

Member

agree with a single admission plugin for a common feature.

This comment has been minimized.

Copy link
@egernst

egernst Apr 3, 2019

Contributor

@tallclair - have a suggested controller name in mind? ContainerRuntime? Runtime is pretty overloaded...

This comment has been minimized.

Copy link
@tallclair

tallclair Apr 4, 2019

Author Member

I went with RuntimeClass for the controller name. Pretty self-explanatory.

@yastij

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

/lgtm
/hold

just a rename or clarification on the name of the admission controller is missing.

@bsalamat - any thoughts on merging toleration into podSpec ? This would help enforcing schedulingPolicy

@bsalamat

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

@bsalamat - any thoughts on merging toleration into podSpec ? This would help enforcing schedulingPolicy

Are you saying that instead of having "toleration" in the RuntimeClass, we use the same toleration of Podspec? If so, we discussed it in this KEP and decided that the existence of toleration in the runtime class is useful.

@tallclair

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

@bsalamat I think @yastij was asking about your thoughts on merging the tolerations FROM the RuntimeClass into the PodSpec during admission (what this KEP currently proposes), as opposed to just merging the tolerations at scheduling time (what it originally proposed).

@bsalamat

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

@tallclair I'd prefer it to be separate so that we can provide better "unschedulable" message to users.

@yastij

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

@bsalamat - I agree

@tallclair - if we merge tolerations into podSpec how can we inform user correctly (UX) when not finding a node that fits ?

I think we can keep them separate for now. If we see that we need to add tolerations to the podSpec we can do it afterwards.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

@tallclair reached out and asked that i review. i am fine with what is presented.

@tallclair

This comment has been minimized.

Copy link
Member Author

commented Apr 1, 2019

@yastij WDYT? Does that alleviate your concerns?

@yastij

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

@tallclair - sgtm

@yastij

This comment has been minimized.

Copy link
Member

commented Apr 1, 2019

/hold cancel

@egernst

egernst approved these changes Apr 1, 2019

Copy link
Contributor

left a comment

lgtm

@bsalamat

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

@tallclair
@egernst has a good point. What will happen if a RuntimeClass is deleted after pod creation? I would like to clarify the behavior of the scheduler for pods that are created but not scheduled yet. I am not concerned about already running pods.
When the scheduler sees a pod with a non-existing RuntimeClass, what should it do? I think it is reasonable to assume that no RuntimeClass is specified and always pass the predicate. Another obvious option is to always fail the RuntimeClass predicate and never schedule the pod. Anyway, I think this should be specified in the KEP.

@tallclair

This comment has been minimized.

Copy link
Member Author

commented Apr 2, 2019

Hmm, good point. I wasn't thinking about the case that the RuntimeClass was missing before the pod is scheduled...

Another obvious option is to always fail the RuntimeClass predicate and never schedule the pod.

I think this makes the most sense, otherwise we're just delaying the failure until it lands on a node. If the predicate fails, the scheduler will retry, right?

@yastij

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

I think this makes the most sense, otherwise we're just delaying the failure until it lands on a node. If the predicate fails, the scheduler will retry, right?

I agree here we'll just delay the failure, the predicate should fail and provide meaningful logs/reason.

@bsalamat

This comment has been minimized.

Copy link
Member

commented Apr 2, 2019

I think this makes the most sense, otherwise we're just delaying the failure until it lands on a node. If the predicate fails, the scheduler will retry, right?

That's right. The scheduler will keep retrying (with some backoff time).

@k8s-ci-robot k8s-ci-robot removed the lgtm label Apr 4, 2019

@bsalamat
Copy link
Member

left a comment

/lgtm
/approve

Thanks, @tallclair!

@k8s-ci-robot k8s-ci-robot added lgtm and removed lgtm labels Apr 4, 2019

@tallclair tallclair changed the title RFC: [RuntimeClass scheduling] native scheduler support [RuntimeClass scheduling] native scheduler support, ready to implement Apr 4, 2019

@tallclair

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

  • Added failure modes for the scheduler & admission controller
  • Renamed RuntimeController to RuntimeClass (admission controller)
  • Marked KEP as implementable - I am happy with it's current state, and think we should commence coding.
@egernst

egernst approved these changes Apr 4, 2019

@tallclair tallclair force-pushed the tallclair:runtimeclass-scheduling branch from 7d9e6d2 to e99bc99 Apr 4, 2019

@tallclair

This comment has been minimized.

Copy link
Member Author

commented Apr 4, 2019

Squashed commits.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Apr 4, 2019

/approve

we can clarify the admission requirement in a follow-on or during documentation.

@tallclair tallclair force-pushed the tallclair:runtimeclass-scheduling branch from e99bc99 to 152d8a8 Apr 4, 2019

@yastij

This comment has been minimized.

Copy link
Member

commented Apr 4, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm label Apr 4, 2019

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Apr 4, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bsalamat, derekwaynecarr, tallclair, yastij

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 246c106 into kubernetes:master Apr 4, 2019

2 of 3 checks passed

tide Not mergeable. Needs lgtm label.
Details
cla/linuxfoundation tallclair authorized
Details
pull-enhancements-verify Job succeeded.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.