Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node label for policy and condition for CPUPressure #970

Closed
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 18 additions & 6 deletions contributors/design-proposals/cpu-manager.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,10 @@ configuration. The three policies are **none**, **static** and **dynamic**.

The active CPU manager policy is set through a new Kubelet
configuration value `--cpu-manager-policy`. The default value is `none`.
This policy will be reported in the Node resource as a label with key
`alpha.kubernetes.io/cpu-policy`. This allows users with with a policy
preference to use a node selector to ensure the pod lands on a node with a
particular cpu manager policy enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are efforts in progress (#911) to remove or restrict the ability of nodes to label themselves, since that allows them to steer workloads to themselves.

Copy link
Contributor Author

@sjenning sjenning Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow ok. Didn't know about that.

I can either 1) remove this section completely or 2) s/will/can/ if trusted node label mechanism configures it independently rather then the kubelet knowing its own config and reporting it through a label.

What will happen to all the node labels in kubernetes.io that the node self-labels now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we'll have to whitelist them (which we may do for things like os/arch/instance-type) or silently drop them until we can remove them from the kubelet and get past the version skew window for kubelets that try to set them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using labels for this could still work, I would just expect them to be applied to nodes, rather than having nodes self-report them

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the spirit of the design, but it does mean that some fields the kubelet sets today based on its own introspection of the machine need to move to fields. Are we ok with the node controller then converting those fields to labels? Can cpu manager policy be included in that initial whitelist?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh - have you been tracking removing the ability for a node to self label? did you have other ideas for converting them to fields and then node controller assigning them as labels/taints?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We rely on Node labels heavily to address scheduling constraints. Fields in node object cannot be used for scheduling purposes as of now.
Having users create labels that map kubelet configuration is not a great user experience. I can see the security aspect of having nodes introduce arbitrary labels.
Perhaps a whitelist of label keys is a way to get around the security issue? Ideally such a whitelist would be configurable (Configmaps maybe?) which would give cluster admins the ability to choose what labels they'd like to expose and also extend kubernetes easily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr #911 is news to me. I wonder why that proposal wasn't discussed in SIG-node yet.


The number of CPUs that pods may run on can be implicitly controlled using the
existing node-allocatable configuration settings. See the [node allocatable
Expand Down Expand Up @@ -266,14 +270,22 @@ func (p *staticPolicy) UnregisterContainer(s State, containerID string) error {
cpuset for all containers running in the shared pool.

1. _The shared pool becomes empty._
1. The CPU manager adds a node condition with effect NoSchedule,
NoExecute that prevents BestEffort and Burstable QoS class pods from
running on the node. BestEffort and Burstable QoS class pods are
evicted from the node.
1. The CPU manager sets a CPUPressure node condition to true that prevents
BestEffort and Burstable QoS class pods with no CPU resource request
from being scheduled on the node. Already running BestEffort and Burstable QoS
class pods without a CPU resource request are evicted from the node.

NOTE: The decision to allow the shared pool to completely empty is
driven by a desire to keep the scheduler accounting simple. If a
--cpu-policy-static-min-shared-cpus flag were to exist, there is no simple way
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add backticks to flag name to avoid soft-wrapping.

to convey that information to the scheduler. The scheduler would then need to
know that some portion of the CPU Allocatable is "special", for use only by
non-exlusive containers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly obvious, but consider also mentioning what would happen with a shared pool low-water-mark but without scheduler enhancements -- Kubelet would need to reject some G pods after scheduling, with high potential to confuse users.


1. _The shared pool becomes nonempty._
1. The CPU manager removes the node condition with effect NoSchedule,
NoExecute for BestEffort and Burstable QoS class pods.
1. The CPU manager sets a CPUPressure node condition to false that allows
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned in the PR, this node condition isn't necessary if we reserve a core by default which I feel will be user's expectation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BestEffort and Burstable QoS class pods with no CPU resource request to
be scheduled on the node.

#### Policy 3: "dynamic" cpuset control

Expand Down