New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
user namespace proposal #34569
user namespace proposal #34569
Conversation
Jenkins verification failed for commit f790977. Full PR test history. The magic incantation to run this job again is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add rkt specific section if there are any implications.
cc @kubernetes/sig-node. I also would have appreciated a ping since I recall asking for this proposal 😃 |
sorry about that @euank 👍 |
Review status: 0 of 1 files reviewed at latest revision, 3 unresolved discussions, some commit checks broke. docs/proposals/user-namespace.md, line 80 at r1 (raw file):
At the very least, we should expect a mixed environment while updating an existing cluster to this feature. docs/proposals/user-namespace.md, line 106 at r1 (raw file):
I really don't like this con. It seems really dangerous that I might have a pod that runs with Even if we do defaulting, I'd rather be able to have a field in docs/proposals/user-namespace.md, line 135 at r1 (raw file):
So if we implement it with defaulting there's no cons at all, sounds like an easy decision :) Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks broke. docs/proposals/user-namespace.md, line 146 at r1 (raw file):
Will kubelet also manage this file? Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 4 unresolved discussions, some commit checks broke. docs/proposals/user-namespace.md, line 106 at r1 (raw file):
|
6c1c51c
to
1a3d164
Compare
@smarterclayton @euank I added a small section about rkt. It is interesting that the implementation will pick random UID pieces. I didn't find mention of GID mappings so maybe someone can weigh in on that piece. The main issue, I think, is ensuring that the container has access to the Other feedback about the options here? |
bump @vishh based on the feedback so far I think the only new action item is ensuring we know when we're using remapping and resolving permissions on the mounts |
* MKNOD | ||
* SYS_MODULE | ||
* SYS_TIME | ||
1. explicitly granted permissions to use a host volume and escape the private user namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that if a pod is mounting a host volume, it will run in host namespace automatically rather than pod's user namespace. Wonder if there is a possibility that an admin can map the UID/GID of the hostpath with correct userns UID/GID mapping for running pod so that access is granted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, if the pod has a host mount it would run in the host user namespace. I think any remapping that could be done would need to be done by the Kubelet or the storage infrastructure. That starts to get pretty tricky though, especially if the remapped UIDs are allocated and not static.
I'm going to add a section about mounts so we can discuss
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use case here is that I have distributed file system which does not have native support for kubernetes and I plan mount them on underneath host and expose them to running container on the cluster. So with the current proposal I can use usernamespaces at all. As kubelet will run pod into host usernamespace if I use host volume. I think there will be more users in my situation, so having a possibility to override the defaults will help me in having. I do know in this case I have to maintain the subuid/subgid mapping to get the access working in container. But that can be managed given I know my users and can expose /etc/{passwd,group}
into containers with corresponding UID/GIDs.
So if you have any other way of solving this then please comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be a use case for exposing user namespace on the security context. If that was exposed via the API and you did not turn on defaulting you would be able to still escape user namespaces when required and then set your --userns-remap={your managed name}
so you can manually set up the uids and groups that provide the correct DAC.
@mrunalp @php-coder @yujuhong thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that is true, as you mentioned if I do not enable defaulting currently I have no way to run pod with host namespaces e.g. net or privileged pods, which I need to run in certain namespaces e.g. ingress controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong - thoughts on api exposure? I think it could probably done as a follow up to an initial non-alpha implementation if we want to just focus on the bare essentials here for the 80% case.
|
||
### Constraints | ||
|
||
* If containers share any namespace they must share the user namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is inconsistent wording. It is not clear what you mean.
-
If containers share any namespace, they must share the user namespace.
-
If any namespace is set to
host
, the pod must use the host user namespace.
Assume we have 3 namespaces - net, IPC, user. Default is to share all of them. If I specify hostNetwork: true
, but leave IPC as default, clause 1 says they must share the userns. But clause 2 says they must use the host userns.
???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each non-user ns has an owning user ns. If net is set to host, then the owning user ns for net will be host user ns and by extension for ipc as well. So, statement 2 is valid but could be reworded better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify further, if net=host, then ipc could still be container's private ipc namespace, but user ns will have to be host. Basically, when we set any ns to host, then we are disabling user namespaces.
|
||
### Option 1: System Defaulting | ||
|
||
In this option, there is no `HostUserNamespace` flag exposed in the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does SecurityContext.RunAsUser
imply using the host namespace, or entirely depend on what user namespace the container is using?
It is debatable whether or not the kubelet should provide defaulting based on | ||
this knowledge since it may mean that a container is run with a | ||
configuration that is not what the pod author expected. However, if | ||
policy can check that the defaulted configuration is valid for the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume kubelet will be performing the check and report an event or set the pod status?
|
||
Cons: | ||
|
||
1. requires extra flags to make the system aware of the remapped |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to be able to detect this, instead of adding yet another flag to kubelet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree, I'd like to not have to tell the Kubelet and have it detect it
|
||
This may be accomplished by ensuring group access to the correct | ||
directories. It implies that the Kubelet will be given (or read) | ||
the value of `/etc/subgid` to perform a `chmod`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation seems too docker-specific. Can we abstract it a little bit to work with CRI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and after reading the rkt implementation I think this needs to come from the CRI anyway which would help us with auto detection mentioned above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. Let's make sure the abstraction is more runtime-agnostic.
* MKNOD | ||
* SYS_MODULE | ||
* SYS_TIME | ||
1. explicitly granted permissions to use a host volume and escape the private user namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about persistent disks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point. For block devices it is possible to do a client-side chown; for things like NFS it generally is not. So, it seems like whether a particular FS will work in a userns-remapped container depends on the filesystem and/or our support for mounting or chown/chmoding that FS in the right to be usable from the container's user namespace.
@yujuhong do you want to own this one? |
RunAsUser cares only about the inside of the container. We knew we would On Nov 22, 2016, at 4:25 PM, Tim Hockin notifications@github.com wrote: @yujuhong https://github.com/yujuhong do you want to own this one? — |
unknown if private user namespaces will be used until the container launch | ||
command is created. | ||
|
||
As a consistent solution to both driving any behavior to escape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong - I was thinking about the feedback we discussed where the kubelet could be told by the CRI that it was in a remap mode but I don't think that is compatible with both docker and rkt since rkt uses a flag on the run command vs something like a daemon flag. Maybe I'm missing something, the feature gate seemed like it would support both - WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yujuhong more info added to the proposal here but calling it out directly since we've discussed it - in docker 1.13+ you can discover the userns environment via the /info endpoint. For backwards compatibility I think we'd still need a gate.
is never defaulted to use the host user namespace. This falls under | ||
the responsibility of cluster policy with `PodSecurityPolicy`. | ||
|
||
`PodSecurityPolicy` has the ability to control using the host network, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@euank - after thinking about this a bit more I realized that we already have the ability to control what would be defaulted and what wouldn't via the PSP structures already in place. This has the disadvantage of not being obvious (ie. a AlwaysUserPrivateUserNs
on the SecurityContext
) but should already be working.
How do you feel about that path?
|
||
### Constraints | ||
|
||
* All non-user namespaces have an owning user namespace. Actions in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
TODO - are GIDs being remapped in the rkt implementation? | ||
|
||
TODO - specific information on mount types - some support chmod and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@php-coder - FYI
with a systemd-nspawn based implementation. Each pod started will | ||
be allocated a random UID range. | ||
|
||
To user private user namespaces in rkt you specify the `--private-users` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/To user/To use/
Jenkins Bazel Build failed for commit 19eaae2. Full PR test history. The magic incantation to run this job again is If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
added more info on proposed changes to directory permissions and discovery of the remap environment @yujuhong PTAL when you get a moment, thanks! |
to items like the termination logs. | ||
|
||
How this GID is retrieved should be a container runtime detail. This | ||
leaves the container runtimes free to manage the permissions of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on allowing each runtime to managed the gid ownership of the pod root directory and below?
Jenkins unit/integration failed for commit 832d335. Full PR test history. The magic incantation to run this job again is Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
bump 😄 @mrunalp @euank @yujuhong @smarterclayton @thockin Looking for thoughts on the following topics:
@php-coder is looking into volume types. Essentially, if we can define a nice way to get the GID from the runtime (don't think it's possible right now) we could use fsgroup to chmod on volume types that support it. Otherwise the permissions need to be managed by an administrator. |
@pmorie care to review? |
I should also note that a
|
|
||
TODO | ||
|
||
* are GIDs being remapped in the rkt implementation? If not then |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are remapped to the same value as the allocated UID. Will update the doc
* are GIDs being remapped in the rkt implementation? If not then | ||
root group may be enough. | ||
* more specific information on mount types | ||
* can the remap GID be injected as an `fsGroup` for mount types that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
leaning toward this being a later TODO - for now FSGroup can be controlled by PSP
@mrunalp @thockin @yujuhong @euank Since I haven't received much feedback here I want to try and break this down to what the most minimal changes are that I believe will support using userns in a continued experimental alpha/beta form.
The thought here is that FSGroup already works to chown some types of volumes which would then be available in the pod without further intervention (tested with emptydir to confirm it was writable). It also means that the admin can control the remap group via PSP and we would not have to discover it by reading the This is prototyped with pweil-@16f29eb. Thoughts about introducing this behavior? It can all be gated by the current experimental flag IMO. As upstream runtime implementations mature we may reduce our need to manage the file system (shiftFS, runtime management of ownership such as the network mounts in Docker, providing runtime mappings rather than hard coded or black box allocation). I think at this point it is about the best we can do as these other items work themselves out. |
@pweil- It is kind of annoying to have kubelet adapt to docker's remapping strategy. Is it not possible to inject one to one mapped gids anymore with userns? |
Can you elaborate? Yes, an admin can have more than 1 set of GIDs that end up in the container but there is still the issue of directory access which is the kubelet (not docker_manager) change I'm proposing. My suggestion is that we enable the minimum amount of access in the kubelet and then allow the runtime to worry about the pod directory and allow FSGroup to work for other volume types. |
@mrunalp @thockin @yujuhong @euank @vishh @pmorie added info about my suggestion above of assuming the FSGroup is the correct GID as a phase 2 approach of experimental support. I will submit a PR detailing the change so it is easier to visualize the extent. Feedback welcome, willing to provide bribes. Supporting userns with volumes that support ownership management seems like a huge leap forward for folks while the underlying implementation of something like shiftFS and the ability to pass specific remap IDs progresses in the upstreams. |
Jenkins GCE e2e failed for commit 0296d5a. Full PR test history. cc @pweil- The magic incantation to run this job again is Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
PID, and IPC namespaces, which capabilities can be added, as well as | ||
which types of volumes can be used. This effectively controls who | ||
will be allowed to trigger the Kubelet defaulting logic that would | ||
escape private user namespaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pweil- I am not sure even if using PSP will help me in the case of hostpath. As mentioned earlier, in distributed file system case where DFS is mounted on each underlying host. I need to allow users to be able to use PVC which is backed by hostpath. The current design will then escape usernamespace
. I would really like to have an option in PSP or in pod spec security context which enables me choose to honor defaulting logic or override it. Now as the defaulting logic is already implemented in 1.5 and we are talking about phase2, I think this is quite necessary to have such an option for users like me who has DFS and want to use usernamespaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gurvindersingh what we discussed earlier is the ability to turn off defaulting and expose host user namespace on the api which hasn't received buy-in yet. Do you think that will not support your use case anymore? Thanks for the feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pweil- Referring to this comment AlwaysUserPrivateUserNs on the SecurityContext
: If I understand it correctly then it means that I can define PSP on kubernetes namespaces where all the pods will always run in usernamespaces
and on others where I allow hostns
can escape. If this understanding is correct then I think it will solve the case and 👍 from me to get this in the current phase 2. With this in, I will not need to enable the defaulting at all as long as I have correct PSPs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pweil- wondering if there is any update on exposing the AlwaysUserPrivateUserNs on the SecurityContext
. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gurvindersingh after a discussion with @yujuhong it was decided that this work will wait until the shim work is complete to avoid unnecessary complications. But I do think that exposing the host user namespace via the PSP is reasonable and would support your use case. I'll update the proposal for discussion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pweil- so the current proposal implementation is targeted for 1.6 or 1.7 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gurvindersingh 1.7 - nothing will go into 1.6 from this document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pweil- got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised some questions but this generally makes sense as outlined.
the `/etc/subgid` files to determine [how to map UIDs and GIDs | ||
in the container](https://docs.docker.com/engine/reference/commandline/dockerd/#detailed-information-on-subuidsubgid-ranges). | ||
|
||
Additionally, the `HostConfig` for a container now contains a parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add a note about your prior host userns work here
|
||
## Kubernetes Support | ||
|
||
As of 1.5 [experimental support](https://github.com/kubernetes/kubernetes/pull/31169) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, there it is.
1. Negative oomScoreAdj values fail. In this case the pause container | ||
cannot be started and shows an error `write /proc/self/oom_score_adj: permission denied` | ||
in the daemon log. Failing in docker 1.10, working in 1.11+ | ||
1. Cannot share the host or a container's network namespace when user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get the issue with the host network namespace here but not another container's network namespace. Is this saying that sharing a network namespace among containers in a pod will break? Or that in order to share network namespaces, two processes have to be in the same user namespace?
in the daemon log. Failing in docker 1.10, working in 1.11+ | ||
1. Cannot share the host or a container's network namespace when user | ||
namespaces are enabled. [Fixed in 1.11](https://github.com/docker/docker/pull/21383). | ||
1. Permission denied mounting. Occurs due to ownership permissions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we just use the FSGroup for this, does it work? I actually think this represents a bug...
* All non-user namespaces have an owning user namespace. Actions in | ||
the non-user namespace require [capabilities that are checked against | ||
the user namespace](http://man7.org/linux/man-pages/man7/user_namespaces.7.html). | ||
This means that when containers share a non-user namespace they must also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question answered... you might want to move the constraints stuff up, since it informs some of the text in the prior sections.
supplemental groups) the pod has access to read these directories. In | ||
a remap environment the group itself may change so the pod will no longer | ||
be able to access these directories. In order to fix this the root | ||
directory should be given 0755 access in the Kubelet's `setupDataDirs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and what ownership?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ps: I have had a lot of trouble with constraints on attributes of the root directory, so you might consider whether the kubelet should self-regulate this based on a flag.
be able to access these directories. In order to fix this the root | ||
directory should be given 0755 access in the Kubelet's `setupDataDirs`. | ||
|
||
Beneath the Kubelet root directory each pod is given it's own directory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its
Rather than increasing the access on directories under the root directory, | ||
the pod directory (and sub directories) may be owned by the remapped | ||
GID. This is very similar to the `fsGroup`, however `fsGroup` is not applied | ||
to items like the termination logs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, seems like a bug, huh?
makes sense that the ability to allocate a range should belong to the | ||
orchestration system. This would allow Kuberenetes to know exactly | ||
what GID requires access to the directories and ensure the correct | ||
access is granted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this imply some future integration between the runtime and PSP allocators for these IDs?
* MKNOD | ||
* SYS_MODULE | ||
* SYS_TIME | ||
1. explicitly granted permissions to use a host volume and escape the private user namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting point. For block devices it is possible to do a client-side chown; for things like NFS it generally is not. So, it seems like whether a particular FS will work in a userns-remapped container depends on the filesystem and/or our support for mounting or chown/chmoding that FS in the right to be usable from the container's user namespace.
[APPROVALNOTIFIER] Needs approval from an approver in each of these OWNERS Files: We suggest the following people: |
This PR hasn't been active in 90 days. Closing this PR. Please reopen if you would like to work towards merging this change, if/when the PR is ready for the next round of review. cc @pweil- @smarterclayton @thockin @yujuhong You can add 'keep-open' label to prevent this from happening again, or add a comment to keep it open another 90 days |
Continuing the discussion on options for user namespace support started in #31169 and #30684
@derekwaynecarr @smarterclayton @vishh @pmorie @mrunalp
I've outlined the options a bit more and included details on some of the issues I ran in to when testing user namespaces locally. PTAL at the options and provide feedback. Depending on which way we go I can probably poach a lot of the existing work.
This change is