-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Update Docker Shared PID proposal for per-pod configuration #1048
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
345 changes: 296 additions & 49 deletions
345
contributors/design-proposals/node/pod-pid-namespace.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,72 +1,319 @@ | ||
# Shared PID Namespace | ||
|
||
Pods share namespaces where possible, but a requirement for sharing the PID | ||
namespace has not been defined due to lack of support in Docker. Docker began | ||
supporting a shared PID namespace in 1.12, and other Kubernetes runtimes (rkt, | ||
cri-o, hyper) have already implemented a shared PID namespace. | ||
|
||
This proposal defines a shared PID namespace as a requirement of the Container | ||
Runtime Interface and links its rollout in Docker to that of the CRI. | ||
* Status: Pending | ||
* Version: Alpha | ||
* Implementation Owner: [@verb](https://github.com/verb) | ||
|
||
## Motivation | ||
|
||
Sharing a PID namespace between containers in a pod is discussed in | ||
[#1615](https://issues.k8s.io/1615), and enables: | ||
Pods share namespaces where possible, but support for sharing the PID namespace | ||
had not been defined due to lack of support in Docker. This created an implicit | ||
API on which certain container images now rely. This document proposes adding | ||
support for sharing a process namespace between containers in a pod while | ||
maintaining backwards compatibility with the existing implicit API. | ||
|
||
1. signaling between containers, which is useful for side cars (e.g. for | ||
signaling a daemon process after rotating logs). | ||
2. easier troubleshooting of pods. | ||
3. addressing [Docker's zombie problem][1] by reaping orphaned zombies in the | ||
infra container. | ||
## Proposal | ||
|
||
## Goals and Non-Goals | ||
### Goals and Non-Goals | ||
|
||
Goals include: | ||
- Changing default behavior in the Docker runtime as implemented by the CRI | ||
- Making Docker behavior compatible with the other Kubernetes runtimes | ||
|
||
* Backwards compatibility with container images expecting `pid == 1` semantics | ||
* Per-pod configuration of PID namespace sharing | ||
* Ability to change default sharing behavior in `v2.Pod` | ||
|
||
Non-goals include: | ||
- Creating an init solution that works for all runtimes | ||
- Supporting isolated PID namespace indefinitely | ||
|
||
## Modification to the Docker Runtime | ||
* Creating a general purpose container init solution | ||
* Multiple shared PID namespaces per pod | ||
* Per-container configuration of PID namespace sharing | ||
|
||
### Summary | ||
|
||
We will add support for configuring pod-shared process namespaces by adding a | ||
new boolean field `ShareProcessNamespace` to the pod spec. The default to false | ||
means that each container will have a separate process namespace. When set to | ||
true, all containers in the pod will share a single process namespace. | ||
|
||
The Container Runtime Interface (CRI) will be updated to support three namespace | ||
modes: Container, Pod & Node. The Runtime Manager will translate the pod spec | ||
into one of these modes as follows: | ||
|
||
Pod `shareProcessNamespace` | Pod `hostPID` | CRI PID Mode | ||
--------------------------- | ------------- | ------------ | ||
false | false | Container | ||
false | true | Node | ||
true | false | Pod | ||
true | true | *Error* | ||
|
||
If a runtime does not implement a particular PID mode, it must return an error. | ||
For reference, Docker will support all three modes when using version >= 1.13.1. | ||
|
||
The shared PID functionality will be hidden behind a new feature gate in both | ||
the API server and the kubelet, and the existing `--docker-disable-shared-pid` | ||
flag will be removed from the kubelet, subject to [deprecation | ||
policy](https://kubernetes.io/docs/reference/deprecation-policy/). | ||
|
||
## User Experience | ||
|
||
### Use Cases | ||
|
||
Sharing a PID namespace between containers in a pod is discussed in | ||
[#1615](https://issues.k8s.io/1615) and enables: | ||
|
||
1. signaling between containers, which is useful for side cars (e.g. for | ||
signaling a daemon process after rotating logs). | ||
1. easier troubleshooting of pods. | ||
1. addressing [Docker's zombie | ||
problem](https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/) | ||
by reaping orphaned zombies in the infra container. | ||
|
||
### Behavioral Changes | ||
|
||
Sharing a process namespace fits well with Kubernetes' pod abstraction, but it's | ||
a significant departure from the traditional behavior of Docker. This may break | ||
container images and development patterns that have come to rely on process | ||
isolation. Notably: | ||
|
||
1. **The main container process no longer has PID 1**. It cannot be signalled | ||
using `kill 1`, and attempting to do so will instead signal the | ||
infrastructure container and potentially restart the pod. Containers | ||
shipping an init system like systemd may [require additional | ||
flags](https://github.com/kubernetes/kubernetes/issues/48937#issuecomment-321243669). | ||
1. **Processes are visible to other containers in the pod**. This includes all | ||
information visible in `/proc`, such as passwords as arguments or | ||
environment variables, and process signalling. This can be somewhat | ||
mitigated by running processes as separate, non-root users. | ||
1. **Container filesystems are visible to other containers in the pod through | ||
the <code>/proc/$pid/root</code> magic symlink**. This makes debugging | ||
easier, but it also means that secrets are protected only by standard | ||
filesystem permissions. | ||
|
||
## Implementation | ||
|
||
### Kubernetes API Changes | ||
|
||
`v1.PodSpec` gains a new field named `ShareProcessNamespace`: | ||
|
||
``` | ||
// PodSpec is a description of a pod. | ||
type PodSpec struct { | ||
... | ||
// Use the host's pid namespace. | ||
// Note that HostPID and ShareProcessNamespace cannot both be set. | ||
// Optional: Default to false. | ||
// +k8s:conversion-gen=false | ||
// +optional | ||
HostPID bool `json:"hostPID,omitempty" protobuf:"varint,12,opt,name=hostPID"` | ||
// Share a single process namespace between all of the containers in a pod. | ||
// Note that HostPID and ShareProcessNamespace cannot both be set. | ||
// Optional: Default to false. | ||
// +k8s:conversion-gen=false | ||
// +optional | ||
ShareProcessNamespace *bool `json:"shareProcessNamespace,omitempty" protobuf:"varint,XX,opt,name=shareProcessNamespace"` | ||
... | ||
``` | ||
|
||
The field name deviates from that of HostPID in an attempt to [better signal the | ||
consequences](https://github.com/kubernetes/community/pull/1048/files#r159146536) | ||
of setting the option. Setting both `ShareProcessNamespace` and `HostPID` will | ||
cause a validation error. | ||
|
||
### Container Runtime Interface Changes | ||
|
||
Namespace options in the CRI are currently specified for both `PodSandbox` and | ||
`Container` creation requests via booleans in `NamespaceOption`: | ||
|
||
``` | ||
message NamespaceOption { | ||
// If set, use the host's network namespace. | ||
bool host_network = 1; | ||
// If set, use the host's PID namespace. | ||
bool host_pid = 2; | ||
// If set, use the host's IPC namespace. | ||
bool host_ipc = 3; | ||
} | ||
``` | ||
|
||
We will change `NamespaceOption` to use a `NamespaceMode` enumeration for the | ||
existing namespace options: | ||
|
||
``` | ||
enum NamespaceMode { | ||
POD = 0; | ||
CONTAINER = 1; | ||
NODE = 2; | ||
} | ||
|
||
// NamespaceOption provides options for Linux namespaces. | ||
message NamespaceOption { | ||
// Network namespace for this container/sandbox. | ||
// Runtimes must support: POD, NODE | ||
NamespaceMode network = 1; | ||
// PID namespace for this container/sandbox. | ||
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER. | ||
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods. | ||
// Runtimes must support: POD, CONTAINER, NODE | ||
NamespaceMode pid = 2; | ||
// IPC namespace for this container/sandbox. | ||
// Runtimes must support: POD, NODE | ||
NamespaceMode ipc = 3; | ||
} | ||
``` | ||
|
||
Note that this breaks backwards compatibility in the CRI, which is still in | ||
alpha. | ||
|
||
The protocol default for a namespace is `POD` because that's the default for | ||
network and IPC, and we will consider making it the default for PID in `v2.Pod`. | ||
The kubelet will explicitly set `pid` to `CONTAINER` for `v1.Pod` by default so | ||
that the default behavior of `v1.Pod` does not change. | ||
|
||
This CRI design allows different namespace configuration for each of the | ||
containers in the pod and the sandbox, but currently we have no plans to support | ||
this in the Kubernetes API. The kubelet will translate namespace booleans from | ||
v1.PodSpec into a single `NamespaceMode` to be used for the sandbox and all | ||
regular and init containers in a pod. | ||
|
||
#### Targeting a Specific Container's Namespace | ||
|
||
Though we don't intend to support this in general pod configuration, there is a | ||
use case for mixed process namespaces within a single pod. [Troubleshooting | ||
Running Pods](troubleshooting-running-pods.md) allows inserting an ephemeral | ||
Debug Container in an existing, running pod. In order for this to be useful we | ||
want to share, within the pod, a process namespace between the new container | ||
performing the debugging and its existing target container. | ||
|
||
This is done with the additional `NamespaceMode` `TARGET` and field `target_id`: | ||
|
||
``` | ||
enum NamespaceMode { | ||
POD = 0; | ||
CONTAINER = 1; | ||
NODE = 2; | ||
TARGET = 3; | ||
} | ||
|
||
// NamespaceOption provides options for Linux namespaces. | ||
message NamespaceOption { | ||
// Network namespace for this container/sandbox. | ||
// Runtimes must support: POD, NODE | ||
NamespaceMode network = 1; | ||
// PID namespace for this container/sandbox. | ||
// Note: The CRI default is POD, but the v1.PodSpec default is CONTAINER. | ||
// The kubelet's runtime manager will set this to CONTAINER explicitly for v1 pods. | ||
// Runtimes must support: POD, CONTAINER, NODE, TARGET | ||
NamespaceMode pid = 2; | ||
// IPC namespace for this container/sandbox. | ||
// Runtimes must support: POD, NODE | ||
NamespaceMode ipc = 3; | ||
// Target Container ID for NamespaceMode of TARGET. This container must be in the | ||
// same pod as the target container. | ||
string target_id = 4; | ||
} | ||
``` | ||
|
||
When `NamespaceOption.pid` is set to `TARGET`, a runtime must create the new | ||
container in the namespace used by the container ID in `target_id`. If the | ||
target container has `NamespaceOption.pid` set to `POD`, then the new container | ||
should also use the pod namespace. If the target container has an isolated | ||
process namespace, then the new container will join only that container's | ||
namespace. Examples are provided for dockershim below. | ||
|
||
There is no mechanism in the Kubernetes API for an end-user to set `TARGET`. It | ||
exists for the kubelet to run automation or debugging from a container image in | ||
the namespace of an existing pod and container. Additionally, we choose to | ||
explicitly not support sharing namespaces between different pods. The kubelet | ||
must not generate such a reference, and the runtime should not accept it. That | ||
is, for pod{Container `A`, Container `B`, Sandbox `S}` and any other unrelated | ||
Container `C`: | ||
|
||
valid `target_id` | invalid `target_id` | ||
----------------- | ------------------- | ||
containerID(A) | sandboxID(S) | ||
containerID(B) | containerID(C) | ||
|
||
### dockershim Changes | ||
|
||
The Docker runtime implements the pod sandbox as a container running the pause | ||
container image. When configured for `POD` namespace sharing, the PID namespace | ||
of the sandbox will become the single PID namespace for the pod. This means a | ||
namespace of `POD` and `CONTAINER` are equivalent for the sandbox. The mapping | ||
of the _sandbox's_ PID mode to docker's `HostConfig.PidMode` is (`v1.Pod` | ||
settings provided as reference): | ||
|
||
ShareProcessNamespace | HostPID | Sandbox PID Mode | HostConfig.PidMode | ||
--------------------- | ------- | ---------------- | ------------------ | ||
false | false | CONTAINER | *unset* | ||
true | false | POD | *unset* | ||
false | true | NODE | "host" | ||
\- | \- | TARGET | *Error* | ||
|
||
For _containers_, `HostConfig.PidMode` will be set as follows: | ||
|
||
ShareProcessNamespace | HostPID | Container PID Mode | HostConfig.PidMode | ||
--------------------- | ------- | ------------------ | ------------------ | ||
false | false | CONTAINER | *unset* | ||
true | false | POD | "container:[sandbox-container-id]" | ||
false | true | NODE | "host" | ||
false | false | TARGET | "container:[target-container-id]" | ||
true | false | TARGET | "container:[sandbox-container-id]" | ||
false | true | TARGET | "host" | ||
|
||
If the Docker runtime version does not support sharing pid namespaces, a | ||
`CreateContainerRequest` with `namespace_options.pid` set to `POD` will return | ||
an error. | ||
|
||
### Deprecation of existing kubelet flag | ||
|
||
SIG Node did not anticipate the strong objections to migrating from isolated to | ||
shared process namespaces for Docker. The previous (now abandoned) migration | ||
plan introduced a kubelet flag to toggle the shared namespace behavior, but | ||
objections did not materialize until the flag had moved from experimental to GA. | ||
|
||
We will modify the Docker implementation of the CRI to use a shared PID | ||
namespace when running with a version of Docker >= 1.12. The legacy | ||
`dockertools` implementation will not be changed. | ||
The `--docker-disable-shared-pid` (default: true) kubelet flag disables the use | ||
of shared process namespaces for the Docker runtime. We will immediately mark it | ||
as deprecated, but according to the [deprecation | ||
policy](https://kubernetes.io/docs/reference/deprecation-policy/) we must | ||
support it for 6 months. | ||
|
||
Linking this change to the CRI means that Kubernetes users who care to test such | ||
changes can test the combined changes at once. Users who do not care to test | ||
such changes will be insulated by Kubernetes not recommending Docker >= 1.12 | ||
until after switching to the CRI. | ||
We must provide a transition path for users setting this kubelet flag to false. | ||
Setting this flag asserts a desire to override the default Kubernetes behavior | ||
for all pods. Until the flag is removed, the kubelet will honor this assertion | ||
by ignoring the value of `ShareProcessNamespace` and logging a warning to the | ||
event log. | ||
|
||
Other changes that must be made to support this change: | ||
## Alternatives Considered | ||
|
||
1. Add a test to verify all containers restart if the infra container | ||
responsible for the PodSandbox dies. (Note: With Docker 1.12 if the source | ||
of the PID namespace dies all containers sharing that namespace are killed | ||
as well.) | ||
2. Modify the Infra container used by the Docker runtime to reap orphaned | ||
zombies ([#36853](https://pr.k8s.io/36853)). | ||
### Explicit Container/Sandbox ID Targeting | ||
|
||
## Rollout Plan | ||
Rather than using a `NamespaceMode`, `NamespaceOption.pid` could be a string | ||
that explicitly targets a container or sandbox ID: | ||
|
||
SIG Node is planning to switch to the CRI as a default in 1.6, at which point | ||
users with Docker >= 1.12 will receive a shared PID namespace by default. | ||
Cluster administrators will be able to disable this behavior by providing a flag | ||
to the kubelet which will cause the dockershim to revert to previous behavior. | ||
``` | ||
// NamespaceOption provides options for Linux namespaces. | ||
message NamespaceOption { | ||
... | ||
// ID of Sandbox or Container to use for PID namespace, or "host" | ||
string pid = 2; | ||
... | ||
} | ||
``` | ||
|
||
The ability to disable shared PID namespaces is intended as a way to roll back | ||
to prior behavior in the event of unforeseen problems. It won't be possible to | ||
configure the behavior per-pod. We believe this is acceptable because: | ||
This removes the need for a separate `TARGET` mode, but a mode enumeration | ||
better captures the intent of the option. | ||
|
||
* We have not identified a concrete use case requiring isolated PID namespaces. | ||
* Making PID namespace configurable requires changing the CRI, which we would | ||
like to avoid since there are no use cases. | ||
### Defaulting to PID Namespace Sharing | ||
|
||
In a future release, SIG Node will recommend docker >= 1.12. Unless a compelling | ||
use case for isolated PID namespaces is discovered, we will remove the ability | ||
to disable the shared PID namespace in the subsequent release. | ||
Other Kubernetes runtimes already share a single PID namespace between | ||
containers in a pod. We could easily change the Docker runtime to always share a | ||
PID namespace when supported by the installed Docker version, but this would | ||
cause problems for container images that assume they will always be PID 1. | ||
|
||
### Migration to Shared-only Namespaces | ||
|
||
[1]: https://blog.phusion.nl/2015/01/20/docker-and-the-pid-1-zombie-reaping-problem/ | ||
Rather than adding support to the API for configuring namespaces we could allow | ||
changing the default behavior with pod annotations with the intention of | ||
removing support for isolated PID namespaces in v2.Pod. Many members of the | ||
community want to use the isolated namespaces as security boundary between | ||
containers in a pod, however. |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please note the downsides of enabling this shared mode - sidecar containers that were previously isolated are no longer so, environment variables are now visible to all other processes, any "kill all" semantics used within the process are now broken, exec processes from other containers will now show up, etc. This doc should clarify tradeoffs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section here. Let me know if I missed anything.