Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Security Contexts #3910

Merged
merged 2 commits into from
Feb 17, 2015
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
190 changes: 190 additions & 0 deletions docs/design/security_context.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# Security Contexts
## Abstract
A security context is a set of constraints that are applied to a container in order to achieve the following goals (from [security design](security.md)):

1. Ensure a clear isolation between container and the underlying host it runs on
2. Limit the ability of the container to negatively impact the infrastructure or other containers

## Background

The problem of securing containers in Kubernetes has come up [before](https://github.com/GoogleCloudPlatform/kubernetes/issues/398) and the potential problems with container security are [well known](http://opensource.com/business/14/7/docker-security-selinux). Although it is not possible to completely isolate Docker containers from their hosts, new features like [user namespaces](https://github.com/docker/libcontainer/pull/304) make it possible to greatly reduce the attack surface.

## Motivation

### Container isolation

In order to improve container isolation from host and other containers running on the host, containers should only be
granted the access they need to perform their work. To this end it should be possible to take advantage of Docker
features such as the ability to [add or remove capabilities](https://docs.docker.com/reference/run/#runtime-privilege-linux-capabilities-and-lxc-configuration) and [assign MCS labels](https://docs.docker.com/reference/run/#security-configuration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What types of volumes would the MCS labels be used with? Presumably there aren't files that are sensitive for the container process in the emptydir. If this for files in the hostDir, or some other type of volume?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything - the container would be relabeled, the process would have those labels, and any volumes would either be labelled or potentially left as is (in a few cases maybe this is reasonable?). Common case though is "you get these labels". I believe we have all but the volume support upstream and we carry the relabeling support on RHEL docker.

On Jan 29, 2015, at 7:49 PM, Eric Tune notifications@github.com wrote:

In docs/design/security_context.md:

+A security context is a set of constraints that are applied to a container in order to achieve the following goals (from security design):
+
+1. Ensure a clear isolation between container and the underlying host it runs on
+2. Limit the ability of the container to negatively impact the infrastructure or other containers
+
+## Background
+
+The problem of securing containers in Kubernetes has come up before and the potential problems with container security are well known. Although it is not possible to completely isolate Docker containers from their hosts, new features like user namespaces make it possible to greatly reduce the attack surface.
+
+## Motivation
+
+### Container isolation
+
+In order to improve container isolation from host and other containers running on the host, containers should only be
+granted the access they need to perform their work. To this end it should be possible to take advantage of Docker
+features such as the ability to add or remove capabilities and assign MCS labels
What types of volumes would the MCS labels be used with? Presumably there aren't files that are sensitive for the container process in the emptydir. If this for files in the hostDir, or some other type of volume?


Reply to this email directly or view it on GitHub.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, reading further I see you are talking about NFS, and stuff like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - actually relabel is bad if arbitrary (I shouldn't be able to relabel existing content because I tricked the master). It would be better to relabel only new content.

On Jan 29, 2015, at 8:32 PM, Eric Tune notifications@github.com wrote:

In docs/design/security_context.md:

+A security context is a set of constraints that are applied to a container in order to achieve the following goals (from security design):
+
+1. Ensure a clear isolation between container and the underlying host it runs on
+2. Limit the ability of the container to negatively impact the infrastructure or other containers
+
+## Background
+
+The problem of securing containers in Kubernetes has come up before and the potential problems with container security are well known. Although it is not possible to completely isolate Docker containers from their hosts, new features like user namespaces make it possible to greatly reduce the attack surface.
+
+## Motivation
+
+### Container isolation
+
+In order to improve container isolation from host and other containers running on the host, containers should only be
+granted the access they need to perform their work. To this end it should be possible to take advantage of Docker
+features such as the ability to add or remove capabilities and assign MCS labels
Okay, reading further I see you are talking about NFS, and stuff like that.


Reply to this email directly or view it on GitHub.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define a kubelet default security context as well - ie if nothing is specified this is the context. The kubelet can just auto assign uids locally for user namespaces and do similar for labels. At least some defense in depth.

On Jan 29, 2015, at 8:32 PM, Eric Tune notifications@github.com wrote:

In docs/design/security_context.md:

+A security context is a set of constraints that are applied to a container in order to achieve the following goals (from security design):
+
+1. Ensure a clear isolation between container and the underlying host it runs on
+2. Limit the ability of the container to negatively impact the infrastructure or other containers
+
+## Background
+
+The problem of securing containers in Kubernetes has come up before and the potential problems with container security are well known. Although it is not possible to completely isolate Docker containers from their hosts, new features like user namespaces make it possible to greatly reduce the attack surface.
+
+## Motivation
+
+### Container isolation
+
+In order to improve container isolation from host and other containers running on the host, containers should only be
+granted the access they need to perform their work. To this end it should be possible to take advantage of Docker
+features such as the ability to add or remove capabilities and assign MCS labels
Okay, reading further I see you are talking about NFS, and stuff like that.


Reply to this email directly or view it on GitHub.

to the container process.

Support for user namespaces has recently been [merged](https://github.com/docker/libcontainer/pull/304) into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers.

### External integration with shared storage
In order to support external integration with shared storage, processes running in a Kubernetes cluster
should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established.
Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the in-namespace UID or the root-namespace UID?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For disk, outside. However, the user to run as inside the namespace may be something a user wants to change. If namespaces are not present, a mismatch between the two should reject the pod, maybe.

On Feb 20, 2015, at 12:00 AM, Tim Hockin notifications@github.com wrote:

In docs/design/security_context.md:

+## Motivation
+
+### Container isolation
+
+In order to improve container isolation from host and other containers running on the host, containers should only be
+granted the access they need to perform their work. To this end it should be possible to take advantage of Docker
+features such as the ability to add or remove capabilities and assign MCS labels
+to the container process.
+
+Support for user namespaces has recently been merged into Docker's libcontainer project and should soon surface in Docker itself. It will make it possible to assign a range of unprivileged uids and gids from the host to each container, improving the isolation between host and container and between containers.
+
+### External integration with shared storage
+In order to support external integration with shared storage, processes running in a Kubernetes cluster
+should be able to be uniquely identified by their Unix UID, such that a chain of ownership can be established.
+Processes in pods will need to have consistent UID/GID/SELinux category labels in order to access shared disks.
Does this mean the in-namespace UID or the root-namespace UID?


Reply to this email directly or view it on GitHub.


## Constraints and Assumptions
* It is out of the scope of this document to prescribe a specific set
of constraints to isolate containers from their host. Different use cases need different
settings.
* The concept of a security context should not be tied to a particular security mechanism or platform
(ie. SELinux, AppArmor)
* Applying a different security context to a scope (namespace or pod) requires a solution such as the one proposed for
[service accounts](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297).

## Use Cases

In order of increasing complexity, following are example use cases that would
be addressed with security contexts:

1. Kubernetes is used to run a single cloud application. In order to protect
nodes from containers:
* All containers run as a single non-root user
* Privileged containers are disabled
* All containers run with a particular MCS label
* Kernel capabilities like CHOWN and MKNOD are removed from containers

2. Just like case #1, except that I have more than one application running on
the Kubernetes cluster.
* Each application is run in its own namespace to avoid name collisions
* For each application a different uid and MCS label is used

3. Kubernetes is used as the base for a PAAS with
multiple projects, each project represented by a namespace.
* Each namespace is associated with a range of uids/gids on the node that
are mapped to uids/gids on containers using linux user namespaces.
* Certain pods in each namespace have special privileges to perform system
actions such as talking back to the server for deployment, run docker
builds, etc.
* External NFS storage is assigned to each namespace and permissions set
using the range of uids/gids assigned to that namespace.

## Proposed Design

### Overview
A *security context* consists of a set of constraints that determine how a container
is secured before getting created and run. It has a 1:1 correspondence to a
[service account](https://github.com/GoogleCloudPlatform/kubernetes/pull/2297). A *security context provider* is passed to the Kubelet so it can have a chance
to mutate Docker API calls in order to apply the security context.

It is recommended that this design be implemented in two phases:

1. Implement the security context provider extension point in the Kubelet
so that a default security context can be applied on container run and creation.
2. Implement a security context structure that is part of a service account. The
default context provider can then be used to apply a security context based
on the service account associated with the pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If pods on different nodes are accessing shared storage, their UIDs need to be be unique across nodes. So, their uids need to be allocated either statically to nodes at node join time, or dynamically to pods at bind time, by some cluster level thing. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct - I believe that a reasonable default would be that each security context on the master (a service account that is the "default" for the namespace?) would get a UID allocated to it that no other security context would get. An administrator would then later be able to assign complementary UIDs across namespaces if needed. In the future, there could be additional security contexts that grant access to shared resources.

----- Original Message -----

+## Proposed Design
+
+### Overview
+A security context consists of a set of constraints that determine how a
container
+is secured before getting created and run. It has a 1:1 correspondence to
a
+service
account
. A
security context provider is passed to the Kubelet so it can have a
chance
+to mutate Docker API calls in order to apply the security context.
+
+It is recommended that this design be implemented in two phases:
+
+1. Implement the security context provider extension point in the Kubelet

  • so that a default security context can be applied on container run and
    creation.
    +2. Implement a security context structure that is part of a service
    account. The
  • default context provider can then be used to apply a security context
    based
  • on the service account associated with the pod.

If pods on different nodes are accessing shared storage, their UIDs need to
be be unique across nodes. So, their uids need to be allocated either
statically to nodes at node join time, or dynamically to pods at bind time,
by some cluster level thing. Thoughts?


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/3910/files#r23849135

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. So, this could be done manually, or by a namespace creation helper client, or perhaps by a control loop. SGTM.


### Security Context Provider

The Kubelet will have an interface that points to a `SecurityContextProvider`. The `SecurityContextProvider` is invoked before creating and running a given container:

```go
type SecurityContextProvider interface {
// ModifyContainerConfig is called before the Docker createContainer call.
// The security context provider can make changes to the Config with which
// the container is created.
// An error is returned if it's not possible to secure the container as
// requested with a security context.
ModifyContainerConfig(pod *api.BoundPod, container *api.Container, config *docker.Config) error
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it work to have the SecurityContextProvider just modify the api.BoundPod, and not take a docker.Config as an argument. Reasons:

  • follow the pattern already in the code where we modify objects as they are passed along, including how we add env vars to the pod object in the kubelet.
  • we probably will want a kubelet debug API that lets you see the "actual Pod started", including env vars and security context modifications
  • we want to be able to bootstrap using http or file sources. So, this ensures that any security context information is expressible in the pod.
  • It makes it a tad easier to put in a docker alternative eventually if docker.Config is not used in as many places in the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----- Original Message -----

+2. Implement a security context structure that is part of a service
account. The

  • default context provider can then be used to apply a security context
    based
  • on the service account associated with the pod.

+### Security Context Provider
+
+The Kubelet will have an interface that points to a
SecurityContextProvider. The SecurityContextProvider is invoked before
creating and running a given container:
+
+```go
+type SecurityContextProvider interface {

  • // ModifyContainerConfig is called before the Docker createContainer
    call.
  • // The security context provider can make changes to the Config with
    which
  • // the container is created.
  • // An error is returned if it's not possible to secure the container
    as
  • // requested with a security context.
  • ModifyContainerConfig(pod *api.BoundPod, container *api.Container, config
    *docker.Config) error

Would it work to have the SecurityContextProvider just modify the
api.BoundPod, and not take a docker.Config as an argument. Reasons:

  • follow the pattern already in the code where we modify objects as they are
    passed along, including how we add env vars to the pod object in the
    kubelet.
  • we probably will want a kubelet debug API that lets you see the "actual
    Pod started", including env vars and security context modifications
  • we want to be able to bootstrap using http or file sources. So, this
    ensures that any security context information is expressible in the pod.
  • It makes it a tad easier to put in a docker alternative eventually if
    docker.Config is not used in as many places in the code.

True - I think cesar (correct me if I'm wrong here) had started here because the options to docker may be complex - setting up user namespaces, labels, and default behavior. However, a two step abstraction of making the docker interface we have from the kubelet support the additional options on BoundPods, and then adding bound pods options, seems reasonable.

Some security context stuff might be a finalizer at the master level. Security context, if applied on the kubelet for final defaults, and on the master for cluster level isolation, seems similar to other finalizer style patterns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----- Original Message -----

----- Original Message -----

+2. Implement a security context structure that is part of a service
account. The

  • default context provider can then be used to apply a security
    context
    based
  • on the service account associated with the pod.

+### Security Context Provider
+
+The Kubelet will have an interface that points to a
SecurityContextProvider. The SecurityContextProvider is invoked
before
creating and running a given container:
+
+```go
+type SecurityContextProvider interface {

  • // ModifyContainerConfig is called before the Docker createContainer
    call.
  • // The security context provider can make changes to the Config with
    which
  • // the container is created.
  • // An error is returned if it's not possible to secure the container
    as
  • // requested with a security context.
  • ModifyContainerConfig(pod *api.BoundPod, container *api.Container,
    config
    *docker.Config) error

Would it work to have the SecurityContextProvider just modify the
api.BoundPod, and not take a docker.Config as an argument. Reasons:

  • follow the pattern already in the code where we modify objects as they
    are
    passed along, including how we add env vars to the pod object in the
    kubelet.
  • we probably will want a kubelet debug API that lets you see the "actual
    Pod started", including env vars and security context modifications
  • we want to be able to bootstrap using http or file sources. So, this
    ensures that any security context information is expressible in the pod.
  • It makes it a tad easier to put in a docker alternative eventually if
    docker.Config is not used in as many places in the code.

True - I think cesar (correct me if I'm wrong here) had started here because
the options to docker may be complex - setting up user namespaces, labels,
and default behavior. However, a two step abstraction of making the docker
interface we have from the kubelet support the additional options on
BoundPods, and then adding bound pods options, seems reasonable.

One thing I did think of - you may need to know (from the image) what user the image is going to run as, and like ENTRYPOINT I think it's frustrating to an end user to have to specify that image up front in the pod definition. Some level of "map user X inside the container to Y outside" happening by default seemed potentially valuable. However, the two step process (setup security context, then pass to the docker interface) could also handle that.

Some security context stuff might be a finalizer at the master level.
Security context, if applied on the kubelet for final defaults, and on the
master for cluster level isolation, seems similar to other finalizer style
patterns.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There exist both "SCRATCH" container images (single process, single uid, not sensitive to the choice of uid) and "traditional" images, (which have many entries in their /etc/passwd and which may have multiple processes running as multiple uids in them).

Should the default default [sic] security context support the rich base image style? If so how? Need a range of UIDs right, and don't know how many till you examine the contents of the image. On the other hand, should we make it easy to also run the scratch style, and encourage it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erictune - Back to the first comment about only modifying the BoundPod ... If we want to only express intent in the pod definition, then we couldn't just mutate it to apply the security context. At some point the intent needs to become implementation. The security context provider which knows how to implement the pod's intent needs to make specific changes to the actual docker calls. Or am I missing something?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the idea is to support other container formats in the future, seems like we would need a Security Context Provider associated with the underlying container runtime, but agree that ultimately, for the container runtime == docker, you need a docker.Config.

----- Original Message -----
From: "Cesar Wong" notifications@github.com
To: "GoogleCloudPlatform/kubernetes" kubernetes@noreply.github.com
Cc: "Derek Carr" decarr@redhat.com
Sent: Friday, January 30, 2015 1:27:40 PM
Subject: Re: [kubernetes] [Proposal] Security Contexts (#3910)

+2. Implement a security context structure that is part of a service account. The

  • default context provider can then be used to apply a security context based
  • on the service account associated with the pod.
  • +### Security Context Provider
    +
    +The Kubelet will have an interface that points to a SecurityContextProvider. The SecurityContextProvider is invoked before creating and running a given container:
    +
    +```go
    +type SecurityContextProvider interface {
  • // ModifyContainerConfig is called before the Docker createContainer call.
  • // The security context provider can make changes to the Config with which
  • // the container is created.
  • // An error is returned if it's not possible to secure the container as
  • // requested with a security context.
  • ModifyContainerConfig(pod *api.BoundPod, container *api.Container, config *docker.Config) error

@erictune - Back to the first comment about only modifying the BoundPod ... If we want to only express intent in the pod definition, then we couldn't just mutate it to apply the security context. At some point the intent needs to become implementation. The security context provider which knows how to implement the pod's intent needs to make specific changes to the actual docker calls. Or am I missing something?


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/3910/files#r23860957

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I withdraw my comments about ModifyContainerConfig. We can change the code pretty easily later to support things other than docker. The important thing is to get the API right.


// ModifyHostConfig is called before the Docker runContainer call.
// The security context provider can make changes to the HostConfig, affecting
// security options, whether the container is privileged, volume binds, etc.
// An error is returned if it's not possible to secure the container as requested
// with a security context.
ModifyHostConfig(pod *api.BoundPod, container *api.Container, hostConfig *docker.HostConfig)
}
```

If the value of the SecurityContextProvider field on the Kubelet is nil, the kubelet will create and run the container as it does today.

### Security Context

A security context has a 1:1 correspondence to a service account and it can be included as
part of the service account resource. Following is an example of an initial implementation:

```go

// SecurityContext specifies the security constraints associated with a service account
type SecurityContext struct {
// user is the uid to use when running the container
User int
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Containers can have multiple processes running as multiple uids. This may not be the recommended style of container, by my impression is that there are many of them out there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option is to declare that we don't support this style of container, but that excludes a lot of existing images, I think.

Another option is to just set the lead process to have this uid, but allow other processes to have other virtual uids which map back to useless physical uids.

Another option, which I'm not sure if it works at all, is to map multiple virtual uids to 1 physical uids.

(I'm using virtual to mean "in-namespace" and physical to mean "in the root linux namespace").

Another option is to use a per-volume strategy for virtual-to-physical mapping.
For NFS, you could use the NFS mount options to map a single local uid to a true remote uid using anonuid and anongid. Then you would map all the container uids to the local anongid. Like this:
container uid 0 maps to rootns uid 12345 using usernamespaces. rootns uid 12345 maps to NFS server uid 5432 using mount options.
Note that this does require an nfs mount for every container versus one mount per node, but I think that is the way y'all were planning anyhow.

I think higher level questions is: If there are two uids in a container, should their filesystem writes, at the canonical view of the filesystem, appear as one or two different uids/gids? I think "one" is simpler for users but harder for implementers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----- Original Message -----

  • // An error is returned if it's not possible to secure the container as
    requested
  • // with a security context.
  • ModifyHostConfig(pod *api.BoundPod, container *api.Container, hostConfig
    *docker.HostConfig)
    +}
    +`
    +If the value of the SecurityContextProvider field on the Kubelet is nil,
    the kubelet will create and run the container as it does today.

+### Security Context
+
+A security context has a 1:1 correspondence to a service account and it
can be included as
+part of the service account resource. Following is an example of an
initial implementation:
+
+`go
+type SecurityContext struct {

  • // user is the uid to use when running the container
  • User int

One option is to declare that we don't support this style of container, but
that excludes a lot of existing images, I think.

Some, but that could also be the old school image (pre-pods) vs the new school (one process / user per container). We can probably get pretty far on that for Kube users.

Another option is to just set the lead process to have this uid, but allow
other processes to have other virtual uids which map back to useless
physical uids.

Another option, which I'm not sure if it works at all, is to map multiple
virtual uids to 1 physical uids.

I don't think it works. In docker upstream we had a long discussion on this about ranges - we think we can allocate ranges and have this work, but you have to have large ranges. If people had to predeclare how many uids they need and had a map or something, we could maybe allocate a set for a namespace (10k was mooted per container before). I've also wondered whether we could just do two ranges - shared, and unshared. Shared is allocated by the master and cluster wide. Unshared is node scoped and each started container gets a set. I think you can then pass two ranges into the container. @mrunalp to keep me honest here.

(I'm using virtual to mean "in-namespace" and physical to mean "in the root
linux namespace").

Another option is to use a per-volume strategy for virtual-to-physical
mapping.
For NFS, you could use the NFS mount options to map a single local uid to a
true remote uid using anonuid and anongid. Then you would map all the
container uids to the local anongid. Like this:
container uid 0 maps to rootns uid 12345 using usernamespaces. rootns uid 12345 maps to NFS server uid 5432 using mount options.
Note that this does require an nfs mount for every container versus one mount
per node, but I think that is the way y'all were planning anyhow.

I think higher level questions is: If there are two uids in a container,
should their filesystem writes, at the canonical view of the filesystem,
appear as one or two different uids/gids? I think "one" is simpler for
users but harder for implementers.

Yeah, although in practice I suspect 60-80% of containers that people should run will be single uid. So we can make single uid work well, and have multi uid be not quite as nice.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/3910/files#r23854292

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with "make single uid work well, and have multi uid be not quite as nice".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does "Not quite as nice" mean:

  1. "you have to run that old-school container/pod, and maybe your whole namespace or whole cluster, in a more permissive mode than otherwise".
  2. or "you have to write a verbose pod spec that includes arcane SecurityContext magic to get it to work at all"

I hope it means the first one. First priority is ability to move your existing dockerized workloads onto kubernetes with minimal pod spec writing. Second priority is to lock down your cluster.


// AllowPrivileged indicates whether this context allows privileged mode containers
AllowPrivileged bool

// AllowedVolumeTypes lists the types of volumes that a container can bind
AllowedVolumeTypes []string

// AddCapabilities is the list of Linux kernel capabilities to add
AddCapabilities []string

// RemoveCapabilities is the list of Linux kernel capabilities to remove
RemoveCapabilities []string

// Isolation specifies the type of isolation required for containers
// in this security context
Isolation ContainerIsolationSpec
}

// ContainerIsolationSpec indicates intent for container isolation
type ContainerIsolationSpec struct {
// Type is the container isolation type (None, Private)
Type ContainerIsolationType

// FUTURE: IDMapping specifies how users and groups from the host will be mapped
IDMapping *IDMapping
}

// ContainerIsolationType is the type of container isolation for a security context
type ContainerIsolationType string

const (
// ContainerIsolationNone means that no additional consraints are added to
// containers to isolate them from their host
ContainerIsolationNone ContainerIsolationType = "None"

// ContainerIsolationPrivate means that containers are isolated in process
// and storage from their host and other containers.
ContainerIsolationPrivate ContainerIsolationType = "Private"
)

// IDMapping specifies the requested user and group mappings for containers
// associated with a specific security context
type IDMapping struct {
// SharedUsers is the set of user ranges that must be unique to the entire cluster
SharedUsers []IDMappingRange

// SharedGroups is the set of group ranges that must be unique to the entire cluster
SharedGroups []IDMappingRange

// PrivateUsers are mapped to users on the host node, but are not necessarily
// unique to the entire cluster
PrivateUsers []IDMappingRange

// PrivateGroups are mapped to groups on the host node, but are not necessarily
// unique to the entire cluster
PrivateGroups []IDMappingRange
}

// IDMappingRange specifies a mapping between container IDs and node IDs
type IDMappingRange struct {
// ContainerID is the starting container ID
ContainerID int

// HostID is the starting host ID
HostID int

// Length is the length of the ID range
Length int
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that we would want to use all the mechanisms mentioned above (capabilities, MCS labels, apparmor profiles) if available. And that the initial implementation should use these, since RedHat has so much expertise with these.

At the same time, it is very much tied to a specific implementation. This makes it harder for users to understand so much detail; harder to drop in alternative implementations should we ever want to do that. Examples of different implementations: some company might use grsecurity and Pax already. (I don't but someone might); some hosting provider might write a different implementation that has similar effect (I can see us doing that).

So, can you think of a way to divide this up into two layers: one that is a core API object that expresses intent, and another, which implements the intent?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if my intent is: "This container should not have the same identity as any other container, both for node-local resources and for shared (storage) resources", then I am pretty sure the system could automatically come up with a User, SELinux.Level, SELinux.Type, and AppArmor.Profile settings. The question that needs some thought is whether most other intents can similarly be expressed at an abstract level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----- Original Message -----

+type SELinuxContext struct {

  • // MCS label/SELinux level to run the container under
  • Level string
  • // SELinux type label for container processes
  • Type string
  • // FUTURE:
  • // LabelVolumeMountsExclusive []Volume
  • // LabelVolumeMountsShared []Volume
    +}

+type AppArmorContext struct {

  • // AppArmor profile
  • Profile string
    +}

I agree that we would want to use all the mechanisms mentioned above
(capabilities, MCS labels, apparmor profiles) if available. And that the
initial implementation should use these, since RedHat has so much expertise
with these.

At the same time, it is very much tied to a specific implementation. This
makes it harder for users to understand so much detail; harder to drop in
alternative implementations should we ever want to do that. Examples of
different implementations: some company might use grsecurity and Pax
already. (I don't but someone might); some hosting provider might write a
different implementation that has similar effect (I can see us doing that).

So, can you think of a way to divide this up into two layers: one that is a
core API object that expresses intent, and another, which implements the
intent?

At a minimum, anything that is not 100% all Linuxes should be an extension plugin (or a default extension). No disagreement from me.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/3910/files#r23858436

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----- Original Message -----

+type SELinuxContext struct {

  • // MCS label/SELinux level to run the container under
  • Level string
  • // SELinux type label for container processes
  • Type string
  • // FUTURE:
  • // LabelVolumeMountsExclusive []Volume
  • // LabelVolumeMountsShared []Volume
    +}

+type AppArmorContext struct {

  • // AppArmor profile
  • Profile string
    +}

For example, if my intent is: "This container should not have the same
identity as any other container, both for node-local resources and for
shared (storage) resources", then I am pretty sure the system could
automatically come up with a User, SELinux.Level, SELinux.Type, and
AppArmor.Profile settings. The question that needs some thought is
whether most other intents can similarly be expressed at an abstract level.

This is an interesting question - volumes are very low level (give me EXACTLY this thing). Security context as modeled is a bit more like volumes. It means a finalizer goes and turns a generic intent (maybe expressed by the admin or the namespace) onto a specific context on the pod (pod should run as this UID). Your suggesting the opposite, something that the user can set "hey, I want this kind of security context", and then something has to go finalize and specialize it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see developers as:

  • writing pod specs.
  • making some of their own images, and using some "community" docker images
  • knowing which pods need to talk to which other pods
  • decides which pods should be share files with other pods, and which should not.
  • reasons about application level security, such as containing the effects of a local-file-read exploit in a webserver pod.
  • does not often reason about operating system or organizational security.
  • not necessarily comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.

I see project admins as:

  • allocating identities and roles and namespaces.
  • reasoning about organizational security:
    • don't give a developer permissions that are not needed for role.
    • protect files on shared storage from unnecessary cross-team access
  • less focused about application security

I see cluster admins as:

  • less focused on application security. Focused on operation system security.
  • protects the node from bad actors in containers, and properly-configured innocent containers from bad actors in other containers.
  • comfortable reasoning about the security properties of a system at the level of detail of Linux Capabilities, SELinux, AppArmor, etc.
  • decides who can use which Linux Capabilities, run privileged containers, use hostDir, etc.
    • e.g. a team that manages Ceph or a mysql server might be trusted to have raw access to storage devices in some organizations, but teams that develop the applications at higher layers would not.

Do you agree that those are reasonable separations of responsibilities for a Kubernetes cluster?

If so, do you think the current design allows those three groups to work independently of each other and to focus on the information they need? I'm not sure; I'm trying to think that through.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

----- Original Message -----

+type SELinuxContext struct {

  • // MCS label/SELinux level to run the container under
  • Level string
  • // SELinux type label for container processes
  • Type string
  • // FUTURE:
  • // LabelVolumeMountsExclusive []Volume
  • // LabelVolumeMountsShared []Volume
    +}

+type AppArmorContext struct {

  • // AppArmor profile
  • Profile string
    +}

I see developers as:

  • writing pod specs.
  • making some of their own images, and using some "community" docker images
  • knowing which pods need to talk to which other pods
  • decides which pods should be share files with other pods, and which
    should not.
  • reasons about application level security, such as containing the effects
    of a local-file-read exploit in a webserver pod.
  • does not often reason about operating system or organizational security.
  • not necessarily comfortable reasoning about the security properties of a
    system at the level of detail of Linux Capabilities, SELinux, AppArmor,
    etc.

I see project admins as:

  • allocating identities and roles and namespaces.
  • reasoning about organizational security:
    • don't give a developer permissions that are not needed for role.

Also: Concerned about how some things running as higher trust (builds can push images to docker repo X) don't get abused by ordinary developers. Your phrasing is fine, just adding a scenario we think about.

- protect files on shared storage from unnecessary cross-team access
  • less focused about application security

I see cluster admins as:

  • less focused on application security. Focused on operation system
    security.
  • protects the node from bad actors in containers, and properly-configured
    innocent containers from bad actors in other containers.
  • comfortable reasoning about the security properties of a system at the
    level of detail of Linux Capabilities, SELinux, AppArmor, etc.
  • decides who can use which Linux Capabilities, run privileged containers,
    use hostDir, etc.
- e.g. a team that manages Ceph or a mysql server might be trusted to
have raw access to storage devices in some organizations, but teams that
develop the applications at higher layers would not.

Do you agree that those are reasonable separations of responsibilities for a
Kubernetes cluster?

Yes, nailed it. Those distinctions should be in this proposal and in service account (or whatever context it takes)

If so, do you think the current design allows those three groups to work
independently of each other and to focus on the information they need? I'm
not sure; I'm trying to think that through.

I think the proposal doesn't describe the higher level pieces that are in service account and secrets, but assumes they exist. I would feel like service account is a concept for the developer end of the spectrum and security context is much more about the other end. The cluster and project admins must allow developers to have capabilities, the developers must understand how they use those capabilities, and in general higher level developer concepts get boiled down into security contexts and execution details. So this proposal is definitely talking about a part of the overall story.

The outcome of these proposals / prototypes should at minimum include a document that describes the above and how the pieces provide that spectrum.

I think at this point that I could argue a convincing story about:

  • how namespaces get a service account by default (configured by cluster admins)
  • project admins can tweak both those and authorization policies (what we are working through here https://github.com/openshift/origin/blob/master/docs/proposals/policy.md) to properly isolate those
  • some secrets get added by default or via developers directly adding them to service accounts or their pods
  • how service accounts are converted to a security context down to the kubelet via finalizers / admission controller
  • how secrets from a service account (or other mechanism) could be bind mounted into containers, either a la the docker vault / secrets proposals, or via a more specific volume type
  • how the kubelet could take the info on the security context and turn it into a user namespace / labels / volumes
  • how developers could adapt their images and applications to work within the limitations of the above items

There is of course a lot of handwaving in between those bits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to see your above story plus your comments in #2297 in a single overview document. If you or @csrwng have time to do that, it would be great, since you seem to have the big picture. Otherwise, I'm willing to make an attempt.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll try for monday, and then we can collaboratively edit. Agree an overarching story is a gap - we're designing the bits, but not articulating how they flow from a central point.

----- Original Message -----

+type SELinuxContext struct {

  • // MCS label/SELinux level to run the container under
  • Level string
  • // SELinux type label for container processes
  • Type string
  • // FUTURE:
  • // LabelVolumeMountsExclusive []Volume
  • // LabelVolumeMountsShared []Volume
    +}

+type AppArmorContext struct {

  • // AppArmor profile
  • Profile string
    +}

It would be great to see your above story plus your comments in #2297 in a
single overview document. If you or @csrwng have time to do that, it would
be great, since you seem to have the big picture. Otherwise, I'm willing to
make an attempt.


Reply to this email directly or view it on GitHub:
https://github.com/GoogleCloudPlatform/kubernetes/pull/3910/files#r23879770

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I read the https://github.com/openshift/origin/blob/master/docs/proposals/policy.md. That looks pretty cool and well thought out. I see now how you separate cluster admin and project admin responsibilities with the master namespace versus other namespaces.


```


#### Security Context Lifecycle

The lifecycle of a security context will be tied to that of a service account. It is expected that a service account with a default security context will be created for every Kubernetes namespace (without administrator intervention). If resources need to be allocated when creating a security context (for example, assign a range of host uids/gids), a pattern such as [finalizers](https://github.com/GoogleCloudPlatform/kubernetes/issues/3585) can be used before declaring the security context / service account / namespace ready for use.