-
Notifications
You must be signed in to change notification settings - Fork 18.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: User namespace support for re-mapped root per daemon setting #11253
Conversation
@@ -101,6 +101,7 @@ expect an integer, and they can only be specified once. | |||
--mtu=0 Set the containers network MTU | |||
-p, --pidfile="/var/run/docker.pid" Path to use for daemon PID file | |||
--registry-mirror=[] Preferred Docker registry mirror | |||
-r, --root="" Set root user/uid remap option |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather we only have --root
and not -r
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me; I'm fine with simply --root
. @tibor: What do you think of @moxiegirl's suggestion on a "self-documenting" option like --remap-root
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something that is closer to the term user ? --user-root
? I really dont know
How do we handle pre-existing volumes which are owned by host uid 0? |
@cpuguy83 see this exchange from @tiborvass and I on that subject--I have ideas, but there is no easy solution for the extreme edge cases (e.g starting and stopping the daemon repeatedly with different options: https://botbot.me/freenode/docker-dev/2015-03-04/?msg=33342876&page=4 |
|
||
Example relying on default Docker username management: | ||
|
||
$ sudo docker -d -r default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is ambiguous. It could mean use the pre-existing default
user/group. This would mean that default
is a reserved value and cannot be used for specifying a user/group. If we're okay with this, it is worth mentioning in the docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternative could be to have dockroot
be the default user, and dockroot
would be special in the sense that if it's a non-existing user/group it would be created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I thought about the potential confusion given default
might exist as a real username/groupname. I could go either way--if we use default
, you are correct the docs will have to be clear that this means dockroot
as user and group, created if necessary.
|
@@ -349,6 +350,31 @@ https://linuxcontainers.org/) via the `lxc` execution driver, however, this is | |||
not where the primary development of new functionality is taking place. | |||
Add `-e lxc` to the daemon flags to use the `lxc` execution driver. | |||
|
|||
### Daemon user namespace support |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the table Phil! I took a closer look at this content from the reader perspective. I came up with some more questions and a bit of an edit. If it is too early in the PR to do this, just ignore me.
Remap the container's root user
You can remap a container's root user to an unprivileged user on the Docker
host. To allow remapping, you enable Linux user namespace support on the Docker
daemon. The daemon's namespace configuration applies to all containers the
daemon runs; You cannot remap on a per-container basis.
To enable user namespace support, provide a username
or Linux uid:gid
to the
-root
flag. If you want the daemon to create and use a Docker default user
management, specify default
instead. When you specify default
, the daemon
creates a user and group named dockroot
(if they don't already exist). Then,
the daemon uses dockerroot
root as the root remapped ID inside all containers
for that daemon instance.
I removed this clause.
Due to the requirement that the filesystem layers need to have user/group
ownership modification to make this remapping useful,
A very complex cause -- I think you are trying to explain to the reader why you
are remapping at the daemon level not per container. Do they need the
explanation to use the feature? I tend to think this explanation is unnecessary.
Also, if a user supplies just a username
what happens to the gid/group
ownership values. Is the user both the "user" and the "group" or is the group
owner still root? Maybe this goes back to your clause but the answer to my question is not immediately
clear from the text.
Finally, just as a nice to have, why would the user want to use the default user
management versus enabling remapping? Is it so obvious or could we do a little
throw away line that would clue newbs into the feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to clue the reader in as to the reason for the technical limitation. Doing the remapping on a per container basis is probably something a lot of people would want (I know I do) and they're going to end up digging to figure out what's stopping it anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question "why can't I set this per container" is easy to understand. If we provide an answer, it should be equally easy to understand.
For example, here the first three sentences provide a pretty good explanation of filesystem layering. These sentences have a Grade Level 16 of readability. Most of our documentation is written at 8 or lower. Effectively, most of our readers would find this very elegantly written explanation hard to parse.
Moreover, in designing the information in this section, do I really want to distract the reader from their actual focus task only to provide three long sentences to explain part of what is meant by this single phrase Due to the requirement that the filesystem layers need to have user/group ownership modification to make this remapping useful,?
If we need to say anything, I'd rather write "Due to file system limitations, you cannot map on a per container basis" and move on. Yes, I agree with you, there are readers at the contributor level who may want the implementation detail but we have other avenues such as IRC or forums for those questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely disagree.
Documentation should never be dumbed down or details removed especially simply because there's other avenues. Have you ever sat on IRC and answered the same question day in and day out because the docs didn't say something?
I think detail is > readability in any sense (with technical things like Docker - most people using it are developers or devops). If you don't understand something, learn it. If you don't need to read something, skim/skip it but there's no excuse to leave ANYTHING undocumented or subvert detail.
Now, if the documentation is a tutorial for a specific task, maybe, but, if it applies, it should be included, even if under a separate header for easy skipping or under a tl/dr.
I don't disagree, that overall, the documentation should be easy to read however, honestly, this is a technical subject and aiming for grade 8 readability (assuming Flesch-Kin aid formula) seems a little low. Perhaps, a target, but every part falling under this would seriously deprive documentation of important detail.
Ps. IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moxiegirl I'm not even saying to keep the sentence the same way, I'm saying that there should be some sort of breadcrumb in there to help people get on the track of understanding the limitation. Just leaving it as, "You cannot do what you want to do", is frustrating. Maybe just referring to another part of the documentation is fine?
Documentation-first proposal for user namespace UX Docker-DCO-1.1-Signed-off-by: Phil Estes <estesp@linux.vnet.ibm.com> (github: estesp)
@moxiegirl I'm not ignoring all your comments; at this point, some basic implementation details are still being figured out, but my initial response is that maybe user namespaces in general will have some hairy technical topics to dig into that will require something deeper like |
Proposal text updated with discussion on |
@estesp it's me who asked about --privileged, thanks for clarifying. One use case I have been interested in is enabling using Docker containers to build other Docker containers with user namespacing enabled. jpetazzo/dind requires --privileged today so I wonder if there are fundamental limitations in making this working with user namespaced Docker container or it's a matter of right massaging. |
--privileged should mean NO SECURITY. If I run a container with --privileged root=root. Anything else would be confusing. |
I take it from your proposal that if I have user 3267 and I was configured to use docker I will be able to create any container/image under the /var/lib/docker/3267:3267 directory. When every I run a container the image will magically be mounted and a chmod -R 3267:3267 / will happen? (Sadly this will eliminate any non root uid separation). Seems like this would be a little slow. You would also need to eliminate the ability of user 3267 to execute any "privileged" calls when talking to the docker daemon. |
@rhatdan actually, the reason for separate roots is that they will not need to What that does mean is there is potential "bloat" of duplicate data underneath various "subroots" under |
@estesp I think it makes sense to design it in such a way that full user namespace support could be enabled for users in the future. So, IMO, instead of just using the root mapping, the full idmap should be passed around and the code should be written to understand the mappings. What is exposed to the user in stages could be controlled by the CLI. |
I don't think we have fully talked about different use cases for using usernamespace. I see at least three different potential use case. 1 Improve general separation between container, in such a way that we can turn off all Linux Capabilities for a container. This would tighten up security of the system from containers, but would not necessarily improve separation between containers. In this mode, I would envision we would pick one UID for "DOCKERROOT". Then setup all containers to use. Say DOCKERROOT was UID=2. I would setup a mapping for UID0=2 GID0=2 And then map all UIDS > 2 to themselves. 3-MAX_UID=3-MAX_UID Similarly for GID. By doing this we have eliminated ability for a container to attack root. It is also a lot simpler for volume mounts. 2 The Open Shift method. In this method all files within a container get mapped to a single UID/GID pair. Every user on the system gets a different UID. Only real reason for this would be if the user container required processes to run with a kernel capability. Otherwise User Namespace buys you nothing. 3 Each container gets a separate UID Range mapping from every other container. This gives you the ability to run lots of containers and use UID separation to keep containers apart. But skyrockets the complexity. Volume mounts become a Huge headache. In order to make this work, I would recommend we add -v /SRC/DEST:U which would chown UID:GID /SRC during the mount to the default UID for the container. I am not these three use cases can be used together... |
@rhatdan: If my goal is multi-tenancy it seems like an even better version of your first option would be to map root from each container to a different user, but map all UIDs > 2 to themselves. |
@jheiss Option 1 is no good for Multi-tenancy since I could have data owned by UID=60 (Apache) be attackable by all containers. |
Could this have label project/security added? |
@rhatdan Can you explain that more? The scenario I'm imagining where that attack works is that you volume mount a shared directory that has files owned by Apache into all containers, but that would seem like a pretty weird thing to do in the first place. What's the situation you were thinking about? |
@nugend I always look at worst possible scenario. If the container process some how breaks out of the container and is able to see content on the whole system. Usernamespace would not protect containers data that is owned by UID 60 from a process running in another container running as UID 60. |
I think @rhatdan's #3 option is definitely the ideal: the host is protected against container breakouts due to root mapping to a non-root host UID; and the containers are protected against each other due to different host UID's for the same container UID. This said, I agree that the solution is complex. @crosbymichael I think we know the behavior that we want out of this, we just need to figure out a good way of dealing with volume mounts. Exposing this to the user will probably lead to this never being used and/or being used the wrong way. |
I propose the following terrible idea: UNLR - User Namespace Layout Randomization 👎 /cc @NathanMcCauley |
@diogomonica It might be done if Docker can reserve a block of N UID/GIDs in the high range. For example and region significantly after UID 2000 or-so. Personally I would be very happy with option 3 because then it would solve all host volumes permissions problem. Hooray! Just a bit concerned if there is some performance penalty that comes as a cost of doing all such remapping (on multiple containers simultaneously)? |
since the PR is open do we need this open or is everything here included there |
Hey @jfrazelle .. I need to copy over the "docs" part of this PR into the other, and then we probably need some place to discuss "phase 2" requirements, which are encapsulated in some of the discussion above. I guess an issue is going to be the better spot for that as we won't be able to "PR" the more complex use cases for awhile, most likely. |
just cleaning up house haha up to you, this was the least recently updated On Wed, Apr 22, 2015 at 2:03 PM, Phil Estes notifications@github.com
|
@estesp let me know how we can get this over the finish line. Would love to get root remapping early in the release cycle so we have more time to test for potential issues. |
@diogomonica hey there..please look at PR #12648 as the initial implementation of the remapped root. The branch there builds and runs with remapped root and I've tested a handful of common images and container operations. Status is basically this: (a) fix a few known bugs and (b) get the PR reviewed and hopefully merged after cycle of response/updates based on review input. As far as "(a)", I only have two known bugs, and one is somewhat new to me based on the integration client test runs. (oops!) Two test failures seem to show that I un-fixed a volume ownership issue (and the ownership fails even when I'm not remapping root, but using my set of patches). The other known bug is the ownership of pipes for stdin/stdout when not running in interactive/tty mode. I am working on solving both bugs now. |
Does this proposal mean that if the process running as root in the container gets compromised, that process will only have the remapped uid:gid rights within the host system (supposing it breaks out of the container)? Wondering about safety, this helps but it seems to push people to run services as root within the container. That would mean that for example a compromised nginx/PHP/... process would have root access within the container; giving it a lot more power than |
@wernight correct, that is the promise (and hopefully big security improvement) of user namespacing in general. If you make that uid:gid something unused on the host (e.g. uid/gid = 999999) it will basically have nearly zero rights on the host system. I don't know what you mean about "pushing people to run services as root"? What services don't run as root today? Only if (a) the service itself drops privileges (setuid/setgid like apache as www-data) or (b) the container image creator uses " |
@estesp is absolutely right. The idea is that, while people shouldn't be running their applications inside of the container as root, they do. If we ship this feature (eventually turned on by default), we can help Docker be safer by default. There are obviously other things that we can do to help us on this "Safer by default" path, but this is a great step in that direction. Thank you @estesp! |
Users should still run their processes as NON-"ContainerRoot" inside the container. This is why UserNamespace has the ability to use ranges of UID's. Having all processes running in the container as a single UID and able to manage every file within the container, causes the processes in the container to have no separation of access to their data. While it is true User Namespace provides better separation between containers and containers and the host. I would still like to be able to have my apache process running within my container running as the Container Apache user 48 rather then as Container Root. Then I can setup content that is readonly to the apache process, and data that is read/write. If I run multi-service containers, I want separation between my processes. I don't believe anyone would suggest taking a single service vm and running all of the processes as root and chown -R root:root /. Just because some people run their services within a container as root does not mean that we should encourage this. |
@rhatdan I've obviously done a poor job of documenting the implementation, or somewhere there is some very poor information on how the proposed implementation works :) Maybe the simplest response is as follows:
In summary, this implementation will not preclude using specific "service" IDs to run processes. |
@estesp Yes using USER or privilege drop. So supposing one follows good practices and runs inside the contain as non-root (I usually use some random UID/GID to avoid for example |
@estesp I agree all you have proposed is remapping the root user, which is similar with what I have proposed in the past. But I am really responding to other comments in this pull request, that I think are short sighted. There is really no easy solution to using user namespace currently that would work for everyone, as I stated above on March 16th. |
@rhatdan agreed that "there is no easy solution" to a supports-all-usecases implementation, which is why the actual implementation PR (#12648) is titled "Phase 1". The assumption is that having (a) code that understands and operates with full maps, but (b) the UI is restricted to remapping root per-daemon is a good-enough first step to greatly increase security such that it immediately drops "real host root" usage from the equation for all containers. A "phase 2" design needs to be thought through and proposed that helps with various multi-tenant scenarios cataloged in this thread and elsewhere. It might be as simple as letting full uid/gid maps be provided, and it is "buyer beware" on how you handle the filesystem ownership, given it will be impossible to share those layers with other runtime containers. Maybe the daemon needs to have a local repo cache for each map that is currently running, so at least tenants using the same mapping can share layers; or multiple "graphs" in a daemon based on hash of uid/gid maps. Anyway, hopefully we can agree that the more complex use case has potential solutions, but they need proposed, thought through for pros/cons, and decided on for future implementation past "phase 1". @wernight I think I understand what you are saying, but other than root no longer being "real host root" inside the container, this proposal (and proposed phase 1 implementation) does nothing to change the use of IDs in or out of the container. Uid 30 inside the container is uid 30 outside, and so on. If you are utilizing random high UIDs inside the container for some form of separation, that will continue to work. What is slightly funny that no one has mentioned here is that you are using names, you probably are already getting mismatches you didn't even realize because rarely do distros use the same ID for the same "standard user"-- |
@estesp Thanks. Yes I use named inside and numbers outside (which I found out the hard way). That's why I tend to keep numbers random in [200, 800] as those are system accounts but usually not used. Still this PR is a really good improvements to increase safety overall. |
I am fine with this suggestion, I like the idea that you would setup a mapping with root=100 Would end up adding mappings like the following, if I get my Usenamespace mappings correct. '0 100 1', '1 1 99', '101 101 MAXUID' I would also like to see an option like UIDMAPPING, GIDMAPPING, which would allow people to experiment more with user namespaces. Why not add an option for --UIDMAPPING='0 3000 1' --GIDmappng '0 3000 1' |
Fine with me. |
Fine with me. |
Proposal for User Namespace Daemon Support
This is a docs-first proposal to get review/feedback on the UX for specifying a per-daemon-instance remapping of container root to an unprivileged user.
Depends on libcontainer API/userns support
The support for user namespaces already exists in libcontainer. The PR for bringing that new API and functionality is open for review: #11208. This present proposal cannot be implemented until that PR is merged and the libcontainer vendor is updated in Docker itself.
A per-daemon setting for root
The documentation change in this PR notes a new flag
-r, --root=""
which would be used to specify the requester user & group name--or uid:gid--that the daemon would instantiate new containers as the remapped root. Specifying a special value ofdefault
would be the user's request to have Docker create (or use if existing) a special user/group nameddockroot
(docker group is already taken and used by Docker itself).This new flag, when specified, would cause a new template (see https://github.com/docker/docker/blob/master/daemon/execdriver/native/template/default_template.go for current template) to be used when containers are created with the native execdriver, which will cause the added creation of a user namespace, using the specified
uid:gid
as the remapped root within the container.Other required modifications to support user namespaces
Locally hosted layer content
Because the filesystem layers of any image have root:root ownership of most of the files, a re-mapping operation will also need to occur on untar and tar of image layers. This is one of the reasons to start with user namespaces as a daemon-level option: to keep from potentially significant churn of
chmod
activity, all image layers for a specific daemon can be untarred with root:root ownership mapped to the re-mapped unprivileged container root uid:gid, allowing those images to be used by any containers within the daemon successfully. When images are pushed, since atar
action happens anyway, these image layers can be remapped back to root:root at this point. This work is underway, but has no UX component for review.Daemon root
Currently the daemon, by default, creates a directory
/var/lib/docker
owned byroot:root
with permissions0700
. Given the actual container root filesytems live underneath this hierarchy, a user namespaced container will not even start today as the earlychdir()
call will fail due to lack of access to the root-accessible-only directory hierarchy under/var/lib/docker
.My proposal is that
/var/lib/docker
will become a super-root of any number of daemon roots, each one owned by theuid:gid
of the daemon's remapped root, if provided. If root is not remapped (user namespaces are "off"), then a new daemon root under/var/lib/docker
(or the user-specified location) simply named "0.0" will be used. For migration purposes, the first time the daemon is run with this feature, current data from/var/lib/docker
(or the user-specified root) will be migrated to the subdir./0.0
(for the user namespaces-off case). The "super-root" directory perms would change to0755
, but subdirectories would still use0700
, with ownership matching the remapped root, or real root depending on the case. An example of three subdirectories under a/var/lib/docker
with0755
permissions is shown below:Questions
Open questions/concerns from early review/discussion
Is this a bad starting point if we support full specification of user/group maps in the future?
I believe that if we support (in the future) more complete user control over the namespace capabilities that exist in libcontainer/Linux kernel level, it will not deprecate this "mode" of user namespace support. I believe instead it will be a "Conflicts:" scenario between
--root
and{future map config option(s)}
. Because we expect--root
can be supported by the daemon and image/graph subsystem, it will probably be the more likely path, and the custom uid/gid maps will need to have a set of restrictions around using Hub images, or other "migration" scenarios.How does this impact the use of
--privileged
?Given user namespaces are about restricting privileges inside the container, you can guess that the general answer is "--privileged is incompatible with user namespaces". However, given using
--privileged
maps to a varied set of actions at container setup (from making/sys
rw instead of just r to allowing moreCAPS_
to remain), it will require a deeper look at which of those may be compatible or reasonable with user namespacing restrictions.