Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user namespace proposal #34569

Closed
wants to merge 5 commits into from
Closed

Conversation

pweil-
Copy link
Contributor

@pweil- pweil- commented Oct 11, 2016

Continuing the discussion on options for user namespace support started in #31169 and #30684

@derekwaynecarr @smarterclayton @vishh @pmorie @mrunalp

I've outlined the options a bit more and included details on some of the issues I ran in to when testing user namespaces locally. PTAL at the options and provide feedback. Depending on which way we go I can probably poach a lot of the existing work.


This change is Reviewable

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Oct 11, 2016
@k8s-ci-robot
Copy link
Contributor

Jenkins verification failed for commit f790977. Full PR test history.

The magic incantation to run this job again is @k8s-bot verify test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Copy link
Contributor

@smarterclayton smarterclayton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add rkt specific section if there are any implications.

@euank
Copy link
Contributor

euank commented Oct 13, 2016

cc @kubernetes/sig-node. I also would have appreciated a ping since I recall asking for this proposal 😃

@pweil-
Copy link
Contributor Author

pweil- commented Oct 13, 2016

sorry about that @euank 👍

@euank
Copy link
Contributor

euank commented Oct 13, 2016

Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks broke.


docs/proposals/user-namespace.md, line 80 at r1 (raw file):

should we validate in the API server for misconfiguration or allow
this to fail in the container runtime?
1.  Is there a valid use case for running some nodes with remapping

At the very least, we should expect a mixed environment while updating an existing cluster to this feature.


docs/proposals/user-namespace.md, line 106 at r1 (raw file):

1. requires extra flags to make the system aware of the remapped
environment (is this required anyway due to the mount issue?)
1. the user may unexpectedly run a container with the host user namespace.

I really don't like this con.

It seems really dangerous that I might have a pod that runs with userns, and then I make some change like add a hostpath volume and it now doesn't run with userns.

Even if we do defaulting, I'd rather be able to have a field in SecurityContext I can toggle to avoid defaulting and say "userns always" so it errors out instead of runs with worse security.


docs/proposals/user-namespace.md, line 135 at r1 (raw file):

1. requires pod authors to have linux and cluster environment specific 
knowledge to know that the pod spec must contain a flag (unless defaulting

So if we implement it with defaulting there's no cons at all, sounds like an easy decision :)


Comments from Reviewable

@euank
Copy link
Contributor

euank commented Oct 13, 2016

Mention of how this will play with rkt, would be nice. This page has some good info. (cc @lucab)

We could probably transition from option 1 to option 2. Perhaps "1 then later 2" could be called out as an explicit option?

@euank
Copy link
Contributor

euank commented Oct 13, 2016

Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks broke.


docs/proposals/user-namespace.md, line 146 at r1 (raw file):

This may be accomplished by ensuring group access to the correct
directories.  It implies that the Kubelet will be given (or read)
the value of `/etc/subgid` to perform a `chmod`.  

Will kubelet also manage this file?


Comments from Reviewable

@pweil-
Copy link
Contributor Author

pweil- commented Oct 13, 2016

Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks broke.


docs/proposals/user-namespace.md, line 106 at r1 (raw file):

Previously, euank (Euan Kemp) wrote…

I really don't like this con.

It seems really dangerous that I might have a pod that runs with userns, and then I make some change like add a hostpath volume and it now doesn't run with userns.

Even if we do defaulting, I'd rather be able to have a field in SecurityContext I can toggle to avoid defaulting and say "userns always" so it errors out instead of runs with worse security.

In addition to something on the SC we'd probably also want something on policy that can force this setting. Otherwise someone will forget to mark it and whoops, we're in the same situation again. That said, there is precedent for this in the non-root UID strategy of PSP which sets a flag on the SC and is subsequently assured in the Kubelet prior to running. I can add details around this.

docs/proposals/user-namespace.md, line 135 at r1 (raw file):

Previously, euank (Euan Kemp) wrote…

So if we implement it with defaulting there's no cons at all, sounds like an easy decision :)

well, except the cons listed above for the defaulting options which I agree can probably be severe without something to govern the defaulting 😄

docs/proposals/user-namespace.md, line 146 at r1 (raw file):

Previously, euank (Euan Kemp) wrote…

Will kubelet also manage this file?

No, it should not be managed by the Kubelet.

Comments from Reviewable

@pweil- pweil- force-pushed the userns-feature branch 2 times, most recently from 6c1c51c to 1a3d164 Compare October 17, 2016 18:26
@pweil-
Copy link
Contributor Author

pweil- commented Oct 17, 2016

@smarterclayton @euank I added a small section about rkt. It is interesting that the implementation will pick random UID pieces. I didn't find mention of GID mappings so maybe someone can weigh in on that piece.

The main issue, I think, is ensuring that the container has access to the /var/lib/kubelet/pods directories so it may affect how that implementation proceeds.

Other feedback about the options here?

@pweil-
Copy link
Contributor Author

pweil- commented Oct 25, 2016

bump

@vishh based on the feedback so far I think the only new action item is ensuring we know when we're using remapping and resolving permissions on the mounts

* MKNOD
* SYS_MODULE
* SYS_TIME
1. explicitly granted permissions to use a host volume and escape the private user namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that if a pod is mounting a host volume, it will run in host namespace automatically rather than pod's user namespace. Wonder if there is a possibility that an admin can map the UID/GID of the hostpath with correct userns UID/GID mapping for running pod so that access is granted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, if the pod has a host mount it would run in the host user namespace. I think any remapping that could be done would need to be done by the Kubelet or the storage infrastructure. That starts to get pretty tricky though, especially if the remapped UIDs are allocated and not static.

I'm going to add a section about mounts so we can discuss

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use case here is that I have distributed file system which does not have native support for kubernetes and I plan mount them on underneath host and expose them to running container on the cluster. So with the current proposal I can use usernamespaces at all. As kubelet will run pod into host usernamespace if I use host volume. I think there will be more users in my situation, so having a possibility to override the defaults will help me in having. I do know in this case I have to maintain the subuid/subgid mapping to get the access working in container. But that can be managed given I know my users and can expose /etc/{passwd,group} into containers with corresponding UID/GIDs.

So if you have any other way of solving this then please comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be a use case for exposing user namespace on the security context. If that was exposed via the API and you did not turn on defaulting you would be able to still escape user namespaces when required and then set your --userns-remap={your managed name} so you can manually set up the uids and groups that provide the correct DAC.

@mrunalp @php-coder @yujuhong thoughts here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that is true, as you mentioned if I do not enable defaulting currently I have no way to run pod with host namespaces e.g. net or privileged pods, which I need to run in certain namespaces e.g. ingress controller.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong - thoughts on api exposure? I think it could probably done as a follow up to an initial non-alpha implementation if we want to just focus on the bare essentials here for the 80% case.


### Constraints

* If containers share any namespace they must share the user namespace.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is inconsistent wording. It is not clear what you mean.

  1. If containers share any namespace, they must share the user namespace.

  2. If any namespace is set to host, the pod must use the host user namespace.

Assume we have 3 namespaces - net, IPC, user. Default is to share all of them. If I specify hostNetwork: true, but leave IPC as default, clause 1 says they must share the userns. But clause 2 says they must use the host userns.

???

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each non-user ns has an owning user ns. If net is set to host, then the owning user ns for net will be host user ns and by extension for ipc as well. So, statement 2 is valid but could be reworded better.

Copy link
Contributor

@mrunalp mrunalp Jan 4, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify further, if net=host, then ipc could still be container's private ipc namespace, but user ns will have to be host. Basically, when we set any ns to host, then we are disabling user namespaces.


### Option 1: System Defaulting

In this option, there is no `HostUserNamespace` flag exposed in the API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does SecurityContext.RunAsUser imply using the host namespace, or entirely depend on what user namespace the container is using?

It is debatable whether or not the kubelet should provide defaulting based on
this knowledge since it may mean that a container is run with a
configuration that is not what the pod author expected. However, if
policy can check that the defaulted configuration is valid for the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume kubelet will be performing the check and report an event or set the pod status?


Cons:

1. requires extra flags to make the system aware of the remapped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be nice to be able to detect this, instead of adding yet another flag to kubelet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, I'd like to not have to tell the Kubelet and have it detect it


This may be accomplished by ensuring group access to the correct
directories. It implies that the Kubelet will be given (or read)
the value of `/etc/subgid` to perform a `chmod`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation seems too docker-specific. Can we abstract it a little bit to work with CRI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and after reading the rkt implementation I think this needs to come from the CRI anyway which would help us with auto detection mentioned above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Let's make sure the abstraction is more runtime-agnostic.

* MKNOD
* SYS_MODULE
* SYS_TIME
1. explicitly granted permissions to use a host volume and escape the private user namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about persistent disks?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point. For block devices it is possible to do a client-side chown; for things like NFS it generally is not. So, it seems like whether a particular FS will work in a userns-remapped container depends on the filesystem and/or our support for mounting or chown/chmoding that FS in the right to be usable from the container's user namespace.

@k8s-github-robot k8s-github-robot added the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Nov 10, 2016
@apelisse apelisse removed the do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. label Nov 11, 2016
@thockin
Copy link
Member

thockin commented Nov 22, 2016

@yujuhong do you want to own this one?

@smarterclayton
Copy link
Contributor

RunAsUser cares only about the inside of the container. We knew we would
eventually want range remapping but the interplay with volumes left that up
in the air.

On Nov 22, 2016, at 4:25 PM, Tim Hockin notifications@github.com wrote:

@yujuhong https://github.com/yujuhong do you want to own this one?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#34569 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ABG_pzc8qqxdmp6wB39-G-LGSUNLdteGks5rA125gaJpZM4KUDre
.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 4, 2017
unknown if private user namespaces will be used until the container launch
command is created.

As a consistent solution to both driving any behavior to escape
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong - I was thinking about the feedback we discussed where the kubelet could be told by the CRI that it was in a remap mode but I don't think that is compatible with both docker and rkt since rkt uses a flag on the run command vs something like a daemon flag. Maybe I'm missing something, the feature gate seemed like it would support both - WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yujuhong more info added to the proposal here but calling it out directly since we've discussed it - in docker 1.13+ you can discover the userns environment via the /info endpoint. For backwards compatibility I think we'd still need a gate.

is never defaulted to use the host user namespace. This falls under
the responsibility of cluster policy with `PodSecurityPolicy`.

`PodSecurityPolicy` has the ability to control using the host network,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@euank - after thinking about this a bit more I realized that we already have the ability to control what would be defaulted and what wouldn't via the PSP structures already in place. This has the disadvantage of not being obvious (ie. a AlwaysUserPrivateUserNs on the SecurityContext) but should already be working.

How do you feel about that path?


### Constraints

* All non-user namespaces have an owning user namespace. Actions in
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrunalp @thockin - tried to reword this a bit better based on feedback.


TODO - are GIDs being remapped in the rkt implementation?

TODO - specific information on mount types - some support chmod and
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@php-coder - FYI

with a systemd-nspawn based implementation. Each pod started will
be allocated a random UID range.

To user private user namespaces in rkt you specify the `--private-users`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/To user/To use/

@k8s-ci-robot
Copy link
Contributor

Jenkins Bazel Build failed for commit 19eaae2. Full PR test history.

The magic incantation to run this job again is @k8s-bot bazel test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pweil-
Copy link
Contributor Author

pweil- commented Jan 9, 2017

added more info on proposed changes to directory permissions and discovery of the remap environment

@yujuhong PTAL when you get a moment, thanks!

to items like the termination logs.

How this GID is retrieved should be a container runtime detail. This
leaves the container runtimes free to manage the permissions of the
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts on allowing each runtime to managed the gid ownership of the pod root directory and below?

@k8s-ci-robot
Copy link
Contributor

Jenkins unit/integration failed for commit 832d335. Full PR test history.

The magic incantation to run this job again is @k8s-bot unit test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@pweil-
Copy link
Contributor Author

pweil- commented Jan 10, 2017

bump 😄 @mrunalp @euank @yujuhong @smarterclayton @thockin

Looking for thoughts on the following topics:

  1. Since rkt does not depend on a daemon level setting (and rather uses a flag when launching the container) I don't think we have a nice solution to tell whether we're in remap mode.
    1. docker will tell you (as of 1.13) that it is doing remapping via the info endpoint
  2. Since docker uses static UID/GID and rkt uses runtime allocated UID/GID I don't think we can have a nice way (right now) to know what the GID is ahead of time and allow the kubelet to manage access. This leads me to believe the right solution of managing access to things like the termination log should live in the individual runtime implementations. Prototyped here for docker
  3. 👍 or 👎 on whether folks think host namespace should be exposed via the security context in the initial implementation

@php-coder is looking into volume types. Essentially, if we can define a nice way to get the GID from the runtime (don't think it's possible right now) we could use fsgroup to chmod on volume types that support it. Otherwise the permissions need to be managed by an administrator.

@pweil-
Copy link
Contributor Author

pweil- commented Jan 10, 2017

@pmorie care to review?

@pweil-
Copy link
Contributor Author

pweil- commented Jan 10, 2017

I should also note that a chmod type solution like the one I'm suggesting for termination log (and similar to the fsgroup chmod) will already be performed by the docker runtime for us in some cases. For example:

  1. any "networky" type files that are mounted in. Ie. the /var/lib/kubelet/pods/{uid}/etc-hosts file
  2. host mount directories that do not already exist


TODO

* are GIDs being remapped in the rkt implementation? If not then
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are remapped to the same value as the allocated UID. Will update the doc

* are GIDs being remapped in the rkt implementation? If not then
root group may be enough.
* more specific information on mount types
* can the remap GID be injected as an `fsGroup` for mount types that
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaning toward this being a later TODO - for now FSGroup can be controlled by PSP

@pweil-
Copy link
Contributor Author

pweil- commented Jan 18, 2017

@mrunalp @thockin @yujuhong @euank

Since I haven't received much feedback here I want to try and break this down to what the most minimal changes are that I believe will support using userns in a continued experimental alpha/beta form.

  1. ensures that there is at least read access to /var/lib/kubelet/pods - required due to the fact that the pod no longer runs with the root group.
  2. when in remap mode (via config or docker info) ensure that /var/lib/kubelet/pods/{uid} and below is owned by the remap group (done in the runtime impl so that each can handle it differently)
  3. assume that, in remap mode, the FSGroup is the remap group - ie the admin can tell the system what the remap group is without some new flag.
  4. don't include the FSGroup in the GroupAdd HostConfig (including the remap group in the --group-add argument is invalid so we can ignore it)

The thought here is that FSGroup already works to chown some types of volumes which would then be available in the pod without further intervention (tested with emptydir to confirm it was writable). It also means that the admin can control the remap group via PSP and we would not have to discover it by reading the /etc/subgid file after first trying to discover what the --userns-remap setting was for the daemon (1.13+ tells us we are remapping with /info but does not provide anything additional).

This is prototyped with pweil-@16f29eb.

Thoughts about introducing this behavior? It can all be gated by the current experimental flag IMO.

As upstream runtime implementations mature we may reduce our need to manage the file system (shiftFS, runtime management of ownership such as the network mounts in Docker, providing runtime mappings rather than hard coded or black box allocation). I think at this point it is about the best we can do as these other items work themselves out.

@vishh
Copy link
Contributor

vishh commented Jan 18, 2017

@pweil- It is kind of annoying to have kubelet adapt to docker's remapping strategy. Is it not possible to inject one to one mapped gids anymore with userns?

@pweil-
Copy link
Contributor Author

pweil- commented Jan 18, 2017

It is kind of annoying to have kubelet adapt to docker's remapping strategy. Is it not possible to inject one to one mapped gids anymore with userns?

Can you elaborate? Yes, an admin can have more than 1 set of GIDs that end up in the container but there is still the issue of directory access which is the kubelet (not docker_manager) change I'm proposing.

My suggestion is that we enable the minimum amount of access in the kubelet and then allow the runtime to worry about the pod directory and allow FSGroup to work for other volume types.

@pweil-
Copy link
Contributor Author

pweil- commented Jan 19, 2017

@mrunalp @thockin @yujuhong @euank @vishh @pmorie added info about my suggestion above of assuming the FSGroup is the correct GID as a phase 2 approach of experimental support. I will submit a PR detailing the change so it is easier to visualize the extent.

Feedback welcome, willing to provide bribes. Supporting userns with volumes that support ownership management seems like a huge leap forward for folks while the underlying implementation of something like shiftFS and the ability to pass specific remap IDs progresses in the upstreams.

@k8s-ci-robot
Copy link
Contributor

Jenkins GCE e2e failed for commit 0296d5a. Full PR test history. cc @pweil-

The magic incantation to run this job again is @k8s-bot cvm gce e2e test this. Please help us cut down flakes by linking to an open flake issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

PID, and IPC namespaces, which capabilities can be added, as well as
which types of volumes can be used. This effectively controls who
will be allowed to trigger the Kubelet defaulting logic that would
escape private user namespaces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pweil- I am not sure even if using PSP will help me in the case of hostpath. As mentioned earlier, in distributed file system case where DFS is mounted on each underlying host. I need to allow users to be able to use PVC which is backed by hostpath. The current design will then escape usernamespace. I would really like to have an option in PSP or in pod spec security context which enables me choose to honor defaulting logic or override it. Now as the defaulting logic is already implemented in 1.5 and we are talking about phase2, I think this is quite necessary to have such an option for users like me who has DFS and want to use usernamespaces.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gurvindersingh what we discussed earlier is the ability to turn off defaulting and expose host user namespace on the api which hasn't received buy-in yet. Do you think that will not support your use case anymore? Thanks for the feedback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pweil- Referring to this comment AlwaysUserPrivateUserNs on the SecurityContext: If I understand it correctly then it means that I can define PSP on kubernetes namespaces where all the pods will always run in usernamespaces and on others where I allow hostns can escape. If this understanding is correct then I think it will solve the case and 👍 from me to get this in the current phase 2. With this in, I will not need to enable the defaulting at all as long as I have correct PSPs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pweil- wondering if there is any update on exposing the AlwaysUserPrivateUserNs on the SecurityContext. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gurvindersingh after a discussion with @yujuhong it was decided that this work will wait until the shim work is complete to avoid unnecessary complications. But I do think that exposing the host user namespace via the PSP is reasonable and would support your use case. I'll update the proposal for discussion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pweil- so the current proposal implementation is targeted for 1.6 or 1.7 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gurvindersingh 1.7 - nothing will go into 1.6 from this document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pweil- got it.

Copy link
Member

@pmorie pmorie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised some questions but this generally makes sense as outlined.

the `/etc/subgid` files to determine [how to map UIDs and GIDs
in the container](https://docs.docker.com/engine/reference/commandline/dockerd/#detailed-information-on-subuidsubgid-ranges).

Additionally, the `HostConfig` for a container now contains a parameter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a note about your prior host userns work here


## Kubernetes Support

As of 1.5 [experimental support](https://github.com/kubernetes/kubernetes/pull/31169)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, there it is.

1. Negative oomScoreAdj values fail. In this case the pause container
cannot be started and shows an error `write /proc/self/oom_score_adj: permission denied`
in the daemon log. Failing in docker 1.10, working in 1.11+
1. Cannot share the host or a container's network namespace when user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the issue with the host network namespace here but not another container's network namespace. Is this saying that sharing a network namespace among containers in a pod will break? Or that in order to share network namespaces, two processes have to be in the same user namespace?

in the daemon log. Failing in docker 1.10, working in 1.11+
1. Cannot share the host or a container's network namespace when user
namespaces are enabled. [Fixed in 1.11](https://github.com/docker/docker/pull/21383).
1. Permission denied mounting. Occurs due to ownership permissions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just use the FSGroup for this, does it work? I actually think this represents a bug...

* All non-user namespaces have an owning user namespace. Actions in
the non-user namespace require [capabilities that are checked against
the user namespace](http://man7.org/linux/man-pages/man7/user_namespaces.7.html).
This means that when containers share a non-user namespace they must also
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question answered... you might want to move the constraints stuff up, since it informs some of the text in the prior sections.

supplemental groups) the pod has access to read these directories. In
a remap environment the group itself may change so the pod will no longer
be able to access these directories. In order to fix this the root
directory should be given 0755 access in the Kubelet's `setupDataDirs`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and what ownership?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ps: I have had a lot of trouble with constraints on attributes of the root directory, so you might consider whether the kubelet should self-regulate this based on a flag.

be able to access these directories. In order to fix this the root
directory should be given 0755 access in the Kubelet's `setupDataDirs`.

Beneath the Kubelet root directory each pod is given it's own directory
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its

Rather than increasing the access on directories under the root directory,
the pod directory (and sub directories) may be owned by the remapped
GID. This is very similar to the `fsGroup`, however `fsGroup` is not applied
to items like the termination logs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, seems like a bug, huh?

makes sense that the ability to allocate a range should belong to the
orchestration system. This would allow Kuberenetes to know exactly
what GID requires access to the directories and ensure the correct
access is granted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply some future integration between the runtime and PSP allocators for these IDs?

* MKNOD
* SYS_MODULE
* SYS_TIME
1. explicitly granted permissions to use a host volume and escape the private user namespace
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting point. For block devices it is possible to do a client-side chown; for things like NFS it generally is not. So, it seems like whether a particular FS will work in a userns-remapped container depends on the filesystem and/or our support for mounting or chown/chmoding that FS in the right to be usable from the container's user namespace.

@k8s-github-robot
Copy link

[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files:

We suggest the following people:
cc @smarterclayton
You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot
Copy link

This PR hasn't been active in 90 days. Closing this PR. Please reopen if you would like to work towards merging this change, if/when the PR is ready for the next round of review.

cc @pweil- @smarterclayton @thockin @yujuhong

You can add 'keep-open' label to prevent this from happening again, or add a comment to keep it open another 90 days

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge DEPRECATED. Indicates that a PR should not merge. Label can only be manually applied/removed. kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet