Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
196 lines (140 sloc) 5.98 KB
kep-number title authors owning-sig participating-sigs reviewers approvers editor creation-date last-updated status see-also replaces superseded-by
34
Pid Limiting
@derekwaynecarr
@dims
sig-node
@dashpole
@dashpole
@dchen1107
Derek Carr
2019-01-29
2019-03-05
implemented

Pid Limiting

Table of Contents

Created by gh-md-toc

Summary

A proposal to enable isolation of pid resources. It proposes a mechanism to enable pod-to-pod PID isolation as well as node-to-pod PID isolation.

Motivation

Pids are a fundamental resource on Linux hosts. It is trivial to hit the task limit without hitting any other resource limits and cause instability to a host machine.

Administrators require mechanisms to ensure that user pods cannot induce pid exhaustion that prevents host daemons (runtime, kubelet, etc) from running. In addition, it is important to ensure that pids are limited among pods in order to ensure they have limited impact to other workloads on the node.

Goals

This proposal aims to the following:

  • enable administrator control to provide pod-to-pod pid isolation
  • enable administrator control to provide node-to-pod pid isolation

Non-Goals

This proposal defers the following:

  • ability for a user to request additional number of pid resources per pod

It is anticipated we will support that via a policy knob that could be restricted and/or defaulted via PodSecurityPolicy or LimitRange. We anticipate tracking this work under a separate feature gate GranularPidLimitsPerPod. Any defaulting applied to pods today would only be used if the pod had no local pod pid limiting policy in future dates.

Proposal

User Stories [optional]

  1. Administrator can default the number of pids per pod to provide pod-to-pod isolation.
  2. Administrator can reserve a number of allocatable pids to user pods via node allocatable.

Implementation Details/Notes/Constraints [optional]

Pod to Pod Isolation

To enable pid isolation among pods, the SupportPodPidsLimit feature gate is defined.

If enabled, the kubelet argument for pod-max-pids will write out the configured pid limit to the pod level cgroup to the value specified on Linux hosts. If -1, the kubelet will default to the node allocatable pid capacity.

Node to Pod Isolation

To enable pid isolation from node to pods, the SupportNodePidsLimit feature gate is proposed. If enabled, pid reservations may be supported at the node allocatable and eviction manager subsystem configurations.

Node allocatable is a well-established feature concept in the kubelet that allows isolation of user pod resources from host daemons at the kubepods cgroup level that parents all end-user pods.

The kubelet will be updated to support reservation of pids so the effective pid limit is enabled as follows:

[Allocatable] = [Node Capacity] - 
 [Kube-Reserved] - 
 [System-Reserved] - 
 [Hard-Eviction-Threshold]

Cgroup Enforcement

To use this feature, the --cgroups-per-qos must be enabled. In addition, the pids cgroup must be mounted.

The kubepods cgroup is bounded by the Allocatable value.

The QoS level cgroups are left unbounded across all pid pool sizes.

The pod level cgroup sandbox is configured as follows:

  1. the pod-max-pids value if positive and is specified on kubelet config
  2. the local pod pid limiting policy (future)
  3. unbounded (so it is restricted by the Allocatable value at kubepods)

Risks and Mitigations

None

Graduation Criteria

Pod to Pod pid isolation

The following criteria applies to SupportPodPidsLimit feature gate:

Alpha

  • basic support integrated in kubelet

Beta

GA

  • assuming no negative user feedback based on production experience, promote after 2 releases in beta.

Node to Pod pid isolation

Adding support for pid limiting at the Node Allocatable level

The following criteria applies to SupportNodePidsLimit:

Alpha

  • basic support integrated via eviction manager and/or node allocatable level

Beta

GA

  • assuming no negative user feedback, promote after 1 release at beta.

Implementation History

Version 1.10

SupportPodPidsLimit implemented at Alpha.

Version 1.14

  • Implement SupportNodePidsLimit as Alpha.
  • Graduate SupportPodPidsLimit to Beta by adding node e2e test coverage for pid cgroup isolation, ensure PidPressure works as intended.

Version 1.15

  • Graduate SupportNodePidsLimit to beta by adding node e2e test coverage for node cgrop isoation.
You can’t perform that action at this time.