Table of Contents
- Pid Limiting
- Table of Contents
- Graduation Criteria
- Implementation History
Created by gh-md-toc
A proposal to enable isolation of pid resources. It proposes a mechanism to enable pod-to-pod PID isolation as well as node-to-pod PID isolation.
Pids are a fundamental resource on Linux hosts. It is trivial to hit the task limit without hitting any other resource limits and cause instability to a host machine.
Administrators require mechanisms to ensure that user pods cannot induce pid exhaustion that prevents host daemons (runtime, kubelet, etc) from running. In addition, it is important to ensure that pids are limited among pods in order to ensure they have limited impact to other workloads on the node.
This proposal aims to the following:
- enable administrator control to provide pod-to-pod pid isolation
- enable administrator control to provide node-to-pod pid isolation
This proposal defers the following:
- ability for a user to request additional number of pid resources per pod
It is anticipated we will support that via a policy knob that could be
restricted and/or defaulted via PodSecurityPolicy or LimitRange. We anticipate
tracking this work under a separate feature gate
defaulting applied to pods today would only be used if the pod had no local pod
pid limiting policy in future dates.
User Stories [optional]
- Administrator can default the number of pids per pod to provide pod-to-pod isolation.
- Administrator can reserve a number of allocatable pids to user pods via node allocatable.
Implementation Details/Notes/Constraints [optional]
Pod to Pod Isolation
To enable pid isolation among pods, the
SupportPodPidsLimit feature gate is
If enabled, the kubelet argument for
pod-max-pids will write out the
configured pid limit to the pod level cgroup to the value specified on Linux
hosts. If -1, the kubelet will default to the node allocatable pid capacity.
Node to Pod Isolation
To enable pid isolation from node to pods, the
gate is proposed. If enabled, pid reservations may be supported at the node
allocatable and eviction manager subsystem configurations.
Node allocatable is a well-established feature concept in the kubelet that
allows isolation of user pod resources from host daemons at the
cgroup level that parents all end-user pods.
The kubelet will be updated to support reservation of pids so the effective pid limit is enabled as follows:
[Allocatable] = [Node Capacity] - [Kube-Reserved] - [System-Reserved] - [Hard-Eviction-Threshold]
To use this feature, the
--cgroups-per-qos must be enabled. In addition, the
pids cgroup must be mounted.
kubepods cgroup is bounded by the
The QoS level cgroups are left unbounded across all pid pool sizes.
The pod level cgroup sandbox is configured as follows:
- the pod-max-pids value if positive and is specified on kubelet config
- the local pod pid limiting policy (future)
- unbounded (so it is restricted by the
Risks and Mitigations
Pod to Pod pid isolation
The following criteria applies to
SupportPodPidsLimit feature gate:
- basic support integrated in kubelet
- ensure proper node e2e test coverage is integrated verifying cgroup settings
- see testing: https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/pids_test.go https://k8s-testgrid.appspot.com/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Feature%3ASupportPodPidsLimit
- assuming no negative user feedback based on production experience, promote after 2 releases in beta.
Node to Pod pid isolation
Adding support for pid limiting at the Node Allocatable level
The following criteria applies to
- basic support integrated via eviction manager and/or node allocatable level
- ensure proper node e2e testing coverage to ensure a pod is unable to fork-bomb
a node even when
- see testing: https://github.com/kubernetes/kubernetes/pull/73651/files#diff-7681b587a8fd514b312fa29c3acc669e
- assuming no negative user feedback, promote after 1 release at beta.
SupportPodPidsLimit implemented at Alpha.
SupportPodPidsLimitto Beta by adding node e2e test coverage for pid cgroup isolation, ensure PidPressure works as intended.
SupportNodePidsLimitto beta by adding node e2e test coverage for node cgrop isoation.