No New Privileges
- Current Implementations
- Existing SecurityContext objects
- Changes of SecurityContext objects
- Pod Security Policy changes
In Linux, the
execve system call can grant more privileges to a newly-created
process than its parent process. Considering security issues, since Linux kernel
v3.5, there is a new flag named
no_new_privs added to prevent those new
privileges from being granted to the processes.
is inherited across
execve and can not be unset. With
execve promises not to grant the privilege to do anything
that could not have been done without the
For more details about
no_new_privs, please check the
Linux kernel documentation.
This is different from
NOSUID in that
no_new_privscan give permission to
the container process to further restrict child processes with seccomp. This
permission goes only one-way in that the container process can not grant more
permissions, only further restrict.
Interactions with other Linux primitives
- suid binaries: will break when
- seccomp2 as a non root user: requires
- seccomp2 with dropped
- ambient capabilities: requires
- selinux transitions: bugs that were fixed documented here
Support in Docker
Since Docker 1.11, a user can specify
--security-opt to enable
while creating containers, for example
docker run --security-opt=no_new_privs busybox.
Docker provides via their Go api an object named
configure container creation parameters. In this object, there is a string
HostConfig.SecurityOpt to specify the security options. Client can
utilize this field to specify the arguments for security options while
creating new containers.
This field did not scale well for the Docker client, so it's suggested that Kubernetes does not follow that design.
This is not on by default in Docker.
Support in rkt
Since rkt v1.26.0, the
NoNewPrivileges option has been enabled in rkt.
More details of the rkt implementation can be read here.
Support in OCI runtimes
Since version 0.3.0 of the OCI runtime specification, a user can specify the
noNewPrivs boolean flag in the configuration file.
More details of the OCI implementation can be read here.
Existing SecurityContext objects
SecurityContext objects define the related security options
for Kubernetes containers, e.g. selinux options.
To support "no new privileges" options in Kubernetes, it is proposed to make the following changes:
Changes of SecurityContext objects
Add a new
*bool type field named
allowPrivilegeEscalation to the
By default, ie when
allowPrivilegeEscalation=nil, we will set
with the following exceptions:
- when a container is
CAP_SYS_ADMINis added to a container
- when a container is not run as root, uid
0(to prevent breaking suid binaries)
The API will reject as invalid
allowPrivilegeEscalation=false, as well as
allowPrivilegeEscalation is set to
false it will enable
for that container.
SecurityContext provides container level
control of the
no_new_privs flag and can override the default in both directions
This requires changes to the Docker, rkt, and CRI runtime integrations so that
kubelet will add the specific
Pod Security Policy changes
The default can be set via a new
*bool type field named
in a Pod Security Policy.
This would allow users to set
defaultAllowPrivilegeEscalation=false, overriding the
nil behavior of
no_new_privs=false for containers
whose uids are not 0.
This would also keep the behavior of setting the security context as
for privileged containers and those with
To recap, below is a table defining the default behavior at the pod security policy level and what can be set as a default with a pod security policy.
|allowPrivilegeEscalation setting||uid = 0 or unset||uid != 0||privileged/CAP_SYS_ADMIN|
bool field named
allowPrivilegeEscalation will be added to the Pod
Security Policy as well to gate whether or not a user is allowed to set the
security context to
allowPrivilegeEscalation=true. This field will default to