A proposal to enable isolation of pid resources. It proposes a mechanism to enable pod-to-pod PID isolation as well as node-to-pod PID isolation.
Pids are a fundamental resource on Linux hosts. It is trivial to hit the task limit without hitting any other resource limits and cause instability to a host machine.
Administrators require mechanisms to ensure that user pods cannot induce pid exhaustion that prevents host daemons (runtime, kubelet, etc) from running. In addition, it is important to ensure that pids are limited among pods in order to ensure they have limited impact to other workloads on the node.
This proposal aims to the following:
- enable administrator control to provide pod-to-pod pid isolation
- enable administrator control to provide node-to-pod pid isolation
This proposal defers the following:
- ability for a user to request additional number of pid resources per pod
It is anticipated we will support that via a policy knob that could be
restricted and/or defaulted via PodSecurityPolicy or LimitRange. We anticipate
tracking this work under a separate feature gate GranularPidLimitsPerPod
. Any
defaulting applied to pods today would only be used if the pod had no local pod
pid limiting policy in future dates.
- Administrator can default the number of pids per pod to provide pod-to-pod isolation.
- Administrator can reserve a number of allocatable pids to user pods via node allocatable.
To enable pid isolation among pods, the SupportPodPidsLimit
feature gate is
defined.
If enabled, the kubelet argument for pod-max-pids
will write out the
configured pid limit to the pod level cgroup to the value specified on Linux
hosts. If -1, the kubelet will default to the node allocatable pid capacity.
To enable pid isolation from node to pods, the SupportNodePidsLimit
feature
gate is proposed. If enabled, pid reservations may be supported at the node
allocatable and eviction manager subsystem configurations.
Node allocatable is a well-established feature concept in the kubelet that
allows isolation of user pod resources from host daemons at the kubepods
cgroup level that parents all end-user pods.
The kubelet will be updated to support reservation of pids so the effective pid limit is enabled as follows:
[Allocatable] = [Node Capacity] -
[Kube-Reserved] -
[System-Reserved] -
[Hard-Eviction-Threshold]
To use this feature, the --cgroups-per-qos
must be enabled. In addition, the
pids
cgroup must be mounted.
The kubepods
cgroup is bounded by the Allocatable
value.
The QoS level cgroups are left unbounded across all pid pool sizes.
The pod level cgroup sandbox is configured as follows:
- the pod-max-pids value if positive and is specified on kubelet config
- the local pod pid limiting policy (future)
- unbounded (so it is restricted by the
Allocatable
value atkubepods
)
None
The following criteria applies to SupportPodPidsLimit
feature gate:
Alpha
- basic support integrated in kubelet
Beta
- ensure proper node e2e test coverage is integrated verifying cgroup settings
- see testing: https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/pids_test.go https://testgrid.k8s.io/sig-node-kubelet#node-kubelet-serial&include-filter-by-regex=Feature%3ASupportPodPidsLimit
GA
- assuming no negative user feedback based on production experience, promote after 2 releases in beta.
Adding support for pid limiting at the Node Allocatable level
Eviction will rank based on priority, followed by the number of processes used. To integrate this into the eviction manager's control loops, we will add pod-level ProcessStats:
// ProcessStats are stats pertaining to processes.
type ProcessStats struct {
// Number of processes
// +optional
ProcessCount *uint64 `json:"process_count,omitempty"`
}
// PodStats holds pod-level unprocessed sample stats.
type PodStats struct { type PodStats struct {
...
// ProcessStats pertaining to processes.
// +optional
ProcessStats *ProcessStats `json:"process_stats,omitempty"`
}
The following criteria applies to SupportNodePidsLimit
:
Alpha
- basic support integrated via eviction manager and/or node allocatable level
Beta
- ensure proper node e2e testing coverage to ensure a pod is unable to fork-bomb
a node even when
pod-max-pids
is unbounded. - see testing: https://github.com/kubernetes/kubernetes/pull/73651/files#diff-7681b587a8fd514b312fa29c3acc669e
GA
- assuming no negative user feedback, promote after 1 release at beta.
SupportPodPidsLimit
implemented at Alpha.
- Implement
SupportNodePidsLimit
as Alpha. - Graduate
SupportPodPidsLimit
to Beta by adding node e2e test coverage for pid cgroup isolation, ensure PidPressure works as intended.
- Graduate
SupportNodePidsLimit
to beta by adding node e2e test coverage for node cgroup isoation.
- Graduate
SupportNodePidsLimit
to GA after year of production usage. - Graduate
SupportPodPidsLimit
to GA after year of production usage.