Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document node allocatable enhancements #4532

Merged
merged 1 commit into from Jun 19, 2017
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
108 changes: 107 additions & 1 deletion admin_guide/allocating_node_resources.adoc
Expand Up @@ -73,9 +73,18 @@ introduction of allocatable resources.
An allocated amount of a resource is computed based on the following formula:

----
[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved]
[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - [Hard-Eviction-Thresholds]
----

[NOTE]
====
The withholding of `Hard-Eviction-Thresholds` from allocatable is a change in behavior to improve
system reliability now that allocatable is enforced for end-user pods at the node level.
The `*experimental-allocatable-ignore-eviction*` setting is available to preserve legacy behavior,
but it will be deprecated in a future release.
====


If `[Allocatable]` is negative, it is set to *0*.

[[viewing-node-allocatable-resources-and-capacity]]
Expand Down Expand Up @@ -155,6 +164,103 @@ $ curl <certificate details> https://<master>/api/v1/nodes/cluster.node22/proxy/

See xref:../rest_api/index.adoc[REST API Overview] for more details about certificate details.

[[node-enforcement]]
== Node enforcement

The node is able to limit the total amount of resources that pods
may consume based on the configured allocatable value. This feature significantly
improves the reliability of the node by preventing pods from starving
system services (for example: container runtime, node agent, etc.) for resources.
It is strongly encouraged that administrators reserve
resources based on the desired node utilization target
in order to improve node reliability.

The node enforces resource constraints using a new cgroup hierarchy
that enforces quality of service. All pods are launched in a
dedicated cgroup hierarchy separate from system daemons.

To configure this ability, the following kubelet arguments are provided.

.Node Cgroup Settings
====
[source,yaml]
----
kubeletArguments:
cgroups-per-qos:
- "true" <1>
cgroup-driver:
- "systemd" <2>
enforce-node-allocatable:
- "pods" <3>
----
<1> Enable or disable the new cgroup hierarchy managed by the node. Any change
of this setting requires a full drain of the node. This flag must be true to allow the node to
enforce node allocatable. We do not recommend users change this value.
<2> The cgroup driver used by the node when managing cgroup hierarchies. This
value must match the driver associated with the container runtime. Valid values
are `systemd` and `cgroupfs`. The default is `systemd`.
<3> A comma-delimited list of scopes for where the node should enforce node
resource constraints. Valid values are `pods`, `system-reserved`, and `kube-reserved`.
The default is `pods`. We do not recommend users change this value.
====

Optionally, the node can be made to enforce kube-reserved and system-reserved by
specifying those tokens in the enforce-node-allocatable flag. If specified, the
corresponding `--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be provided.
In future releases, the node and container runtime will be packaged in a common cgroup
separate from `system.slice`. Until that time, we do not recommend users
change the default value of enforce-node-allocatable flag.

Administrators should treat system daemons similar to Guaranteed pods. System daemons
can burst within their bounding control groups and this behavior needs to be managed
as part of cluster deployments. Enforcing system-reserved limits
can lead to critical system services being CPU starved or OOM killed on the node. The
recommendation is to enforce system-reserved only if operators have profiled their nodes
exhaustively to determine precise estimates and are confident in their ability to
recover if any process in that group is OOM killed.

As a result, we strongly recommended that users only enforce node allocatable for
`pods` by default, and set aside appropriate reservations for system daemons to maintain
overall node reliability.
Copy link

@qwang1 qwang1 Jun 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean, it's recommended likes this?

kubeletArguments:
  cgroups-per-qos:
    - "true" 
  cgroup-driver:
    - "systemd"
  enforce-node-allocatable:
    - "pods" 
  kube-reserved:
    - "cpu=200m,memory=30G"
  system-reserved:
    - "cpu=200m,memory=30G"
  eviction-hard:
    - "memory.available<1Gi"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qwang1 - for the described scenario, this would be the example configuration.

we explicitly avoid defining a recommendation for the reservation at this time as its a factor of pod density.

kubeletArguments:
  cgroups-per-qos:
    - "true" 
  cgroup-driver:
    - "systemd"
  enforce-node-allocatable:
    - "pods" 
  kube-reserved:
    - "memory=2GI"
  system-reserved:
    - "cpu=1,memory=1Gi"
  eviction-hard:
    - "memory.available<100Mi"


[[node-enforcement]]
== Eviction Thresholds

If a node is under memory pressure, it can impact the entire node and all pods running on
it. If a system daemon is using more than its reserved amount of memory, an OOM
event may occur that can impact the entire node and all pods running on it. To avoid
(or reduce the probability of) system OOMs the node
provides xref:../admin_guide/out_of_resource_handling.adoc[Out Of Resource Handling].

By reserving some memory via the `--eviction-hard` flag, the node attempts to evict
pods whenever memory availability on the node drops below the absolute value or percentage.
If system daemons did not exist on a node, pods are limited to the memory
`capacity - eviction-hard`. For this reason, resources set aside as a buffer for eviction
before reaching out of memory conditions are not available for pods.

Here is an example to illustrate the impact of node allocatable for memory:

* Node capacity is `32Gi`
* --kube-reserved is `2Gi`
* --system-reserved is `1Gi`
* --eviction-hard is set to `<100Mi`.

For this node, the effective node allocatable value is `28.9Gi`. If the node
and system components use up all their reservation, the memory available for pods is `28.9Gi`,
and kubelet will evict pods when it exceeds this usage.

If we enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceed `28.9Gi`.
Evictions would not be performed unless system daemons are consuming more than `3.1Gi` of memory.

If system daemons do not use up all their reservation, with the above example,
pods would face memcg OOM kills from their bounding cgroup before node evictions kick in.
To better enforce QoS under this situation, the node applies the hard eviction thresholds to
the top-level cgroup for all pods to be `Node Allocatable + Eviction Hard Thresholds`.

If system daemons do not use up all their reservation, the node will evict pods whenever
they consume more than `28.9Gi` of memory. If eviction does not occur in time, a pod
will be OOM killed if pods consume `29Gi` of memory.

[[allocating-node-scheduler]]
== Scheduler

Expand Down