openshift · mburke5678 · Jun 19, 2017 · Jun 2, 2017 · qwang1 · Jun 13, 2017
diff --git a/admin_guide/allocating_node_resources.adoc b/admin_guide/allocating_node_resources.adoc
@@ -73,9 +73,18 @@ introduction of allocatable resources.
 An allocated amount of a resource is computed based on the following formula:
 
 ----
-[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved]
+[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - [Hard-Eviction-Thresholds]
 ----
 
+[NOTE]
+====
+The withholding of `Hard-Eviction-Thresholds` from allocatable is a change in behavior to improve
+system reliability now that allocatable is enforced for end-user pods at the node level.
+The `*experimental-allocatable-ignore-eviction*` setting is available to preserve legacy behavior,
+but it will be deprecated in a future release.
+====
+
+
 If `[Allocatable]` is negative, it is set to *0*.
 
 [[viewing-node-allocatable-resources-and-capacity]]
@@ -155,6 +164,103 @@ $ curl <certificate details> https://<master>/api/v1/nodes/cluster.node22/proxy/
 
 See xref:../rest_api/index.adoc[REST API Overview] for more details about certificate details.
 
+[[node-enforcement]]
+== Node enforcement
+
+The node is able to limit the total amount of resources that pods
+may consume based on the configured allocatable value.  This feature significantly
+improves the reliability of the node by preventing pods from starving
+system services (for example: container runtime, node agent, etc.) for resources.
+It is strongly encouraged that administrators reserve
+resources based on the desired node utilization target
+in order to improve node reliability.
+
+The node enforces resource constraints using a new cgroup hierarchy
+that enforces quality of service.  All pods are launched in a
+dedicated cgroup hierarchy separate from system daemons.
+
+To configure this ability, the following kubelet arguments are provided.
+
+.Node Cgroup Settings
+====
+[source,yaml]
+----
+kubeletArguments:
+  cgroups-per-qos:
+    - "true" <1>
+  cgroup-driver:
+    - "systemd" <2>
+  enforce-node-allocatable:
+    - "pods" <3>
+----
+<1> Enable or disable the new cgroup hierarchy managed by the node.  Any change
+of this setting requires a full drain of the node.  This flag must be true to allow the node to
+enforce node allocatable.  We do not recommend users change this value.
+<2> The cgroup driver used by the node when managing cgroup hierarchies.  This
+value must match the driver associated with the container runtime.  Valid values
+are `systemd` and `cgroupfs`.  The default is `systemd`.
+<3> A comma-delimited list of scopes for where the node should enforce node
+resource constraints.  Valid values are `pods`, `system-reserved`, and `kube-reserved`.
+The default is `pods`.  We do not recommend users change this value.
+====
+
+Optionally, the node can be made to enforce kube-reserved and system-reserved by
+specifying those tokens in the enforce-node-allocatable flag.  If specified, the
+corresponding `--kube-reserved-cgroup` or `--system-reserved-cgroup` needs to be provided.
+In future releases, the node and container runtime will be packaged in a common cgroup
+separate from `system.slice`.  Until that time, we do not recommend users
+change the default value of enforce-node-allocatable flag.
+
+Administrators should treat system daemons similar to Guaranteed pods.  System daemons
+can burst within their bounding control groups and this behavior needs to be managed
+as part of cluster deployments.  Enforcing system-reserved limits
+can lead to critical system services being CPU starved or OOM killed on the node. The
+recommendation is to enforce system-reserved only if operators have profiled their nodes
+exhaustively to determine precise estimates and are confident in their ability to 
+recover if any process in that group is OOM killed.
+
+As a result, we strongly recommended that users only enforce node allocatable for
+`pods` by default, and set aside appropriate reservations for system daemons to maintain
+overall node reliability.
+
+[[node-enforcement]]
+== Eviction Thresholds
+
+If a node is under memory pressure, it can impact the entire node and all pods running on
+it.  If a system daemon is using more than its reserved amount of memory, an OOM
+event may occur that can impact the entire node and all pods running on it.  To avoid
+(or reduce the probability of) system OOMs the node
+provides xref:../admin_guide/out_of_resource_handling.adoc[Out Of Resource Handling].
+
+By reserving some memory via the `--eviction-hard` flag, the node attempts to evict
+pods whenever memory availability on the node drops below the absolute value or percentage.
+If system daemons did not exist on a node, pods are limited to the memory
+`capacity - eviction-hard`. For this reason, resources set aside as a buffer for eviction
+before reaching out of memory conditions are not available for pods.
+
+Here is an example to illustrate the impact of node allocatable for memory:
+
+* Node capacity is `32Gi`
+* --kube-reserved is `2Gi`
+* --system-reserved is `1Gi`
+* --eviction-hard is set to `<100Mi`.
+
+For this node, the effective node allocatable value is `28.9Gi`. If the node
+and system components use up all their reservation, the memory available for pods is `28.9Gi`,
+and kubelet will evict pods when it exceeds this usage.
+
+If we enforce node allocatable (`28.9Gi`) via top level cgroups, then pods can never exceed `28.9Gi`.
+Evictions would not be performed unless system daemons are consuming more than `3.1Gi` of memory.
+
+If system daemons do not use up all their reservation, with the above example, 
+pods would face memcg OOM kills from their bounding cgroup before node evictions kick in. 
+To better enforce QoS under this situation, the node applies the hard eviction thresholds to
+the top-level cgroup for all pods to be `Node Allocatable + Eviction Hard Thresholds`.  
+
+If system daemons do not use up all their reservation, the node will evict pods whenever
+they consume more than `28.9Gi` of memory. If eviction does not occur in time, a pod
+will be OOM killed if pods consume `29Gi` of memory.
+
 [[allocating-node-scheduler]]
 == Scheduler