New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for user namespaces #4572

Closed
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
@dineshs-altiscale
Contributor

dineshs-altiscale commented Mar 11, 2014

This exposes UID namespace support. A new command line option (--uidmap) maps a set of virtual UIDs to which the application within the container is confined. The application could potentially be the root in the container but unprivileged on the host.

This is still missing tests but wanted to push it anyway to get feedback. Testing requires the latest kernel (kernel.org 3.13 or Fedora 20).

Addresses issue #2918
Docker-DCO-1.1-Signed-off-by: Dinesh Subhraveti dineshs@altiscale.com (github: dineshs-altiscale)

@crosbymichael

This comment has been minimized.

Show comment
Hide comment
@crosbymichael

crosbymichael Mar 11, 2014

Contributor

Can you show us an example of how a user would interact with this on the cli?

Contributor

crosbymichael commented Mar 11, 2014

Can you show us an example of how a user would interact with this on the cli?

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 11, 2014

Contributor

Yes, there is one example in cli.rst:

$ touch /tmp/uid100000
$ sudo chown 100000:100000 /tmp/uid100000
$ sudo docker run --uidmap="0:100000:10000" -v="/tmp:/mnt:rw" -i -t ubuntu ls -lh /mnt/uid100000
-rw-r--r--. 1 root root 0 Mar 10 19:16 /mnt/uid100000

The file has a UID of 100000 on the host but appears as root-owned in the container. Similarly, the ls -lh process runs as root in the container but appears as UID 100000 in ps au on the host.

Note that I am using a standard Ubuntu image. Before starting the container, the UIDs and GIDs of its files are translated to their real values on the host (100000 range). The time taken for the translation (a second or two for the above image) can be avoided by using a pre-translated image.

Contributor

dineshs-altiscale commented Mar 11, 2014

Yes, there is one example in cli.rst:

$ touch /tmp/uid100000
$ sudo chown 100000:100000 /tmp/uid100000
$ sudo docker run --uidmap="0:100000:10000" -v="/tmp:/mnt:rw" -i -t ubuntu ls -lh /mnt/uid100000
-rw-r--r--. 1 root root 0 Mar 10 19:16 /mnt/uid100000

The file has a UID of 100000 on the host but appears as root-owned in the container. Similarly, the ls -lh process runs as root in the container but appears as UID 100000 in ps au on the host.

Note that I am using a standard Ubuntu image. Before starting the container, the UIDs and GIDs of its files are translated to their real values on the host (100000 range). The time taken for the translation (a second or two for the above image) can be avoided by using a pre-translated image.

@timthelion

This comment has been minimized.

Show comment
Hide comment
@timthelion

timthelion Mar 12, 2014

Contributor

With this interface, how would I map multiple uid's. Can I do:

$ docker run --uidmap="*:100000:10000" -v="/tmp:/mnt:rw" -i -t ubuntu ls -lh /mnt/uid100000

To mount all of the docker's uid's to a given uid? Can I do:

$ docker run --uidmap="0:100000:10000" --uidmap="100:100100:10100" -v="/tmp:/mnt:rw" -i -t ubuntu ls -lh /mnt/uid100000

To map 0 to 100000 and 100 to 100100?

I'm really looking forward to this feature! I'm glad to see progress!

Contributor

timthelion commented Mar 12, 2014

With this interface, how would I map multiple uid's. Can I do:

$ docker run --uidmap="*:100000:10000" -v="/tmp:/mnt:rw" -i -t ubuntu ls -lh /mnt/uid100000

To mount all of the docker's uid's to a given uid? Can I do:

$ docker run --uidmap="0:100000:10000" --uidmap="100:100100:10100" -v="/tmp:/mnt:rw" -i -t ubuntu ls -lh /mnt/uid100000

To map 0 to 100000 and 100 to 100100?

I'm really looking forward to this feature! I'm glad to see progress!

@timthelion

This comment has been minimized.

Show comment
Hide comment
@timthelion

timthelion Mar 12, 2014

Contributor

You say:

""""
Note that I am using a standard Ubuntu image. Before starting the container, the UIDs and GIDs of its files are translated to their real values on the host (100000 range). The time taken for the translation (a second or two for the above image) can be avoided by using a pre-translated image.
""""

Is there a way to apply this user mapping at build time? Would it be possible to have a docker remap-uids ubuntu 0:1000:1000 100:1100:1100 command to be run once(rather than every time you ran the container), in order to not have the wait at startup?

Contributor

timthelion commented Mar 12, 2014

You say:

""""
Note that I am using a standard Ubuntu image. Before starting the container, the UIDs and GIDs of its files are translated to their real values on the host (100000 range). The time taken for the translation (a second or two for the above image) can be avoided by using a pre-translated image.
""""

Is there a way to apply this user mapping at build time? Would it be possible to have a docker remap-uids ubuntu 0:1000:1000 100:1100:1100 command to be run once(rather than every time you ran the container), in order to not have the wait at startup?

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 12, 2014

Contributor

The semantics of --uidmap are as follows:

--uidmap="containerUID:hostUID:range" maps range virtual UIDs in the container starting from containerUID to their respective real UIDs starting from hostUID on the host. Any UID reference passed into a container is translated by the kernel from its real to virtual value according to the specified mapping at the container boundary (and vice versa).

The mappings can be sparse. Multiple ranges of UIDs can be mapped with multiple --uidmap options. If a real to virtual UID mapping doesn't exist, it would show up as nobody or nogroup. (By the way, this patch doesn't yet chown volumes and other artifacts like /proc to container root, so they currently appear as nobody.)

The mappings cannot overlap. Obviously one-to-many mapping doesn't make sense (one virtual UID cannot be translated into multiple real UIDs at the same time). Many-to-one mappings may be allowed in principle but they are disallowed in current LXC implementation. For example, if real UID 100000 on the host is mapped to virtual UID 0 in the container, real UID 0 cannot also be mapped to virtual UID 0 in the container. This makes any directories in the container image owned by root on the host appear as owned by nobody in the container.

Contributor

dineshs-altiscale commented Mar 12, 2014

The semantics of --uidmap are as follows:

--uidmap="containerUID:hostUID:range" maps range virtual UIDs in the container starting from containerUID to their respective real UIDs starting from hostUID on the host. Any UID reference passed into a container is translated by the kernel from its real to virtual value according to the specified mapping at the container boundary (and vice versa).

The mappings can be sparse. Multiple ranges of UIDs can be mapped with multiple --uidmap options. If a real to virtual UID mapping doesn't exist, it would show up as nobody or nogroup. (By the way, this patch doesn't yet chown volumes and other artifacts like /proc to container root, so they currently appear as nobody.)

The mappings cannot overlap. Obviously one-to-many mapping doesn't make sense (one virtual UID cannot be translated into multiple real UIDs at the same time). Many-to-one mappings may be allowed in principle but they are disallowed in current LXC implementation. For example, if real UID 100000 on the host is mapped to virtual UID 0 in the container, real UID 0 cannot also be mapped to virtual UID 0 in the container. This makes any directories in the container image owned by root on the host appear as owned by nobody in the container.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 12, 2014

Contributor

The UID translation time can be avoided by using a 'pre-translated' image which is basically produced by committing the container into the (same or new) image and using it the next time.

Contributor

dineshs-altiscale commented Mar 12, 2014

The UID translation time can be avoided by using a 'pre-translated' image which is basically produced by committing the container into the (same or new) image and using it the next time.

@timthelion

This comment has been minimized.

Show comment
Hide comment
@timthelion

timthelion Mar 12, 2014

Contributor

"Many-to-one mappings may be allowed in principle but they are disallowed in current LXC implementation." - That's a pity, as there goes my usecase... Mapping "any/every user in the container including root to a given user."

Is --uidmap="containerUID:hostUID:range" syntax taken from lxc itself or did you come up with that? Is it possible to have range be infinite? I personally find this notation to be a bit confusing, though that may just because I've never come across it before. So if I pass --uidmap="0:1:2 and then in the container, I create a user with uid1, that user will have uid2 on the host? Basically, what this notation does is say: "For any UID in the container cuid: if cuid - containerUID >= range, cuid should be mapped to cuid + (hostUID-containerUID) on the host."

Would it make sense to mimic list comprehension with a syntax like: --uidmap=[+1 if 0 <= uid <= 1] I know that is a pain in the arse to parse, and might look a little bit too scary and complex as well.

I might suggest also --uidshift=0-2:3. This would "shift" a range of uid, such that a UID0 would become UID3, UID1 becomes UID4 and UID2 becomes UID5.

It seems really important to me that we be able to map ALL UIDs, as the security flaw with volumes in the current non-mapped model comes from potential UID overlap.

Imagine if there is a PaaS which gives it's users the ability to run docker containers, and also the ability to ssh in to a special non-privileged shell as a special non-privileged user. Within the ssh session, perhaps that user only has the right to run the docker client, in order to check the status of their container... Now if the docker container was able to create an executable file owned by that non-privileged user in a volume somewhere, you could end up with ssh non-privileged shell breakout.

My use case is a bit different. I'm writing a program called subuser. I want each user on the system to be able to run docker containers which have volumes mounted, in order to access user files. Currently, I create a user in the "subuser container" which just happens to have the same UID as the user that is running that subuser container, which makes permissions match up. But it is terribly ugly.

Contributor

timthelion commented Mar 12, 2014

"Many-to-one mappings may be allowed in principle but they are disallowed in current LXC implementation." - That's a pity, as there goes my usecase... Mapping "any/every user in the container including root to a given user."

Is --uidmap="containerUID:hostUID:range" syntax taken from lxc itself or did you come up with that? Is it possible to have range be infinite? I personally find this notation to be a bit confusing, though that may just because I've never come across it before. So if I pass --uidmap="0:1:2 and then in the container, I create a user with uid1, that user will have uid2 on the host? Basically, what this notation does is say: "For any UID in the container cuid: if cuid - containerUID >= range, cuid should be mapped to cuid + (hostUID-containerUID) on the host."

Would it make sense to mimic list comprehension with a syntax like: --uidmap=[+1 if 0 <= uid <= 1] I know that is a pain in the arse to parse, and might look a little bit too scary and complex as well.

I might suggest also --uidshift=0-2:3. This would "shift" a range of uid, such that a UID0 would become UID3, UID1 becomes UID4 and UID2 becomes UID5.

It seems really important to me that we be able to map ALL UIDs, as the security flaw with volumes in the current non-mapped model comes from potential UID overlap.

Imagine if there is a PaaS which gives it's users the ability to run docker containers, and also the ability to ssh in to a special non-privileged shell as a special non-privileged user. Within the ssh session, perhaps that user only has the right to run the docker client, in order to check the status of their container... Now if the docker container was able to create an executable file owned by that non-privileged user in a volume somewhere, you could end up with ssh non-privileged shell breakout.

My use case is a bit different. I'm writing a program called subuser. I want each user on the system to be able to run docker containers which have volumes mounted, in order to access user files. Currently, I create a user in the "subuser container" which just happens to have the same UID as the user that is running that subuser container, which makes permissions match up. But it is terribly ugly.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 12, 2014

Contributor

Non one-to-one mappings are tricky in practice. Say virtual UIDs 100 and 200 map to the same real UID, 1000. Chowning a file to either 100 or 200 would cause its UID on the host to be set to 1000. When a container process subsequently calls a stat, kernel would not know what virtual UID to return.

The solution adopted by the kernel is to retain one-to-one mapping between host and container UIDs but assign a range of real UIDs to individual users which can in turn be mapped to virtual UIDs within the containers created by them. Even though virtual UIDs map to different real UIDs, they can be potentially owned by a single user on the host.

Can you elaborate what you mean by the following:

It seems really important to me that we be able to map ALL UIDs, as the security flaw with volumes in the current non-mapped model comes from potential UID overlap.

Imagine if there is a PaaS which gives it's users the ability to run docker containers, and also the ability to ssh in to a special non-privileged shell as a special non-privileged user. Within the ssh session, perhaps that user only has the right to run the docker client, in order to check the status of their container... Now if the docker container was able to create an executable file owned by that non-privileged user in a volume somewhere, you could end up with ssh non-privileged shell breakout.

Contributor

dineshs-altiscale commented Mar 12, 2014

Non one-to-one mappings are tricky in practice. Say virtual UIDs 100 and 200 map to the same real UID, 1000. Chowning a file to either 100 or 200 would cause its UID on the host to be set to 1000. When a container process subsequently calls a stat, kernel would not know what virtual UID to return.

The solution adopted by the kernel is to retain one-to-one mapping between host and container UIDs but assign a range of real UIDs to individual users which can in turn be mapped to virtual UIDs within the containers created by them. Even though virtual UIDs map to different real UIDs, they can be potentially owned by a single user on the host.

Can you elaborate what you mean by the following:

It seems really important to me that we be able to map ALL UIDs, as the security flaw with volumes in the current non-mapped model comes from potential UID overlap.

Imagine if there is a PaaS which gives it's users the ability to run docker containers, and also the ability to ssh in to a special non-privileged shell as a special non-privileged user. Within the ssh session, perhaps that user only has the right to run the docker client, in order to check the status of their container... Now if the docker container was able to create an executable file owned by that non-privileged user in a volume somewhere, you could end up with ssh non-privileged shell breakout.

@timthelion

This comment has been minimized.

Show comment
Hide comment
@timthelion

timthelion Mar 12, 2014

Contributor

@dineshs-altiscale I see the problem with the many to one mapping now. Perhaps it would be possible to make it so that only a single container-side uid was allowed to access a volume at all.

The security problem, is that you don't want, an untrusted docker container which has a volume mounted to be able to create a file which is owned by some arbitrary user on the host. You want to be able to dictate the host side owner of the files in the volume. If you set a range, then the docker container could still create files owned by users outside that range...

Contributor

timthelion commented Mar 12, 2014

@dineshs-altiscale I see the problem with the many to one mapping now. Perhaps it would be possible to make it so that only a single container-side uid was allowed to access a volume at all.

The security problem, is that you don't want, an untrusted docker container which has a volume mounted to be able to create a file which is owned by some arbitrary user on the host. You want to be able to dictate the host side owner of the files in the volume. If you set a range, then the docker container could still create files owned by users outside that range...

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 13, 2014

Contributor

Even the root in a container can create files only with UIDs explicitly mapped into the container with --uidmap -- no security issue there.

Contributor

dineshs-altiscale commented Mar 13, 2014

Even the root in a container can create files only with UIDs explicitly mapped into the container with --uidmap -- no security issue there.

@timthelion

This comment has been minimized.

Show comment
Hide comment
@timthelion

timthelion Mar 13, 2014

Contributor

@dineshs-altiscale can root create users with unmapped UIDs?

Contributor

timthelion commented Mar 13, 2014

@dineshs-altiscale can root create users with unmapped UIDs?

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 13, 2014

Contributor

No, even root cannot create any artifacts (users, files, processes etc.) with unmapped UIDs. Note that root is always relative in this model where UIDs are hierarchically delegated. In the global host namespace, with the entire UID space mapped in, root can create and administer all users.

Contributor

dineshs-altiscale commented Mar 13, 2014

No, even root cannot create any artifacts (users, files, processes etc.) with unmapped UIDs. Note that root is always relative in this model where UIDs are hierarchically delegated. In the global host namespace, with the entire UID space mapped in, root can create and administer all users.

@timthelion

This comment has been minimized.

Show comment
Hide comment
@timthelion

timthelion Mar 14, 2014

Contributor

So if I do:

docker run --uidmap="1:100000:10" -i -t ubuntu /bin/echo hi

Will I get an error message because the root user cannot exist(it is not mapped in that example, if I understand correctly)?

What if I do:

docker run --uidmap="1:100000:1" -i -t ubuntu /usr/sbin/useradd foo

Will useradd return some sort of permission denied error(since in this case it is not allowed to create any new users as the only mapped UID is 0?

Contributor

timthelion commented Mar 14, 2014

So if I do:

docker run --uidmap="1:100000:10" -i -t ubuntu /bin/echo hi

Will I get an error message because the root user cannot exist(it is not mapped in that example, if I understand correctly)?

What if I do:

docker run --uidmap="1:100000:1" -i -t ubuntu /usr/sbin/useradd foo

Will useradd return some sort of permission denied error(since in this case it is not allowed to create any new users as the only mapped UID is 0?

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 14, 2014

Contributor

Yes, the patch checks for UID 0 and any UID passed through -u.

Contributor

dineshs-altiscale commented Mar 14, 2014

Yes, the patch checks for UID 0 and any UID passed through -u.

dineshs-altiscale added some commits Mar 9, 2014

Support for user namespaces
This exposes UID namespace support.  A new command line option (--uidmap)
maps a set of virtual UIDs to which the application within the container
is confined.  The application could potentially be the root in the
container but unprivileged on the host.

Addresses issue #2918

Docker-DCO-1.1-Signed-off-by: Dinesh Subhraveti <dineshs@altiscale.com> (github: dineshs-altiscale)
Add -x flag to optimize away UID translation
If -x flag is not set, UIDs of the files in the image are assumed to match the specified UID
mappings and no UID translation is performed.

Images with UIDs already translated can be produced by simply committing a container created
with -x flag:

    $ docker commit $(docker run -d -x --uidmap="100000:0:10000" centos true) centos_uid100000
    $ docker run --uidmap="100000:0:10000" -i -t centos_uid100000 bash

Docker-DCO-1.1-Signed-off-by: Dinesh Subhraveti dineshs@altiscale.com (github: dineshs-altiscale)
Add --private-uids option to simplify default use
--private-uids option is introduced to simplify the use of virtual UID space.
A default host UID range is chosen to create the container rather than the user
having to specify a mapping.  If user specifies mappings using --uidmap, they
take precedence.  In either case, the semantics of -x remain the same:

    $ docker commit $(docker run -d -x --private-uids centos true) centos_private_uids
    $ docker run --private-uids -i -t centos_private_uids bash
    # cat /proc/self/uid_map
             0     100000      10000
@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 23, 2014

Contributor

Here is some draft text outlining the usage scenarios. Will update as we go.

Creating an image with translated UIDs is simple.

root@userns ~ # docker commit $(docker run -d -x --private-uids ubuntu true) ubuntu-private-uids

It may take several seconds to translate the UIDs of all files in the image to a default UID range on the host and commit the product as a new image. The image can then be used without -x to avoid the translation latency.

root@userns ~ # docker run --private-uids -i -t ubuntu-private-uids bash
root ~ /# id
uid=0(root) gid=0(root) groups=0(root)
root ~ /# ls -lhd /
drwxr-xr-x 49 root root 4.0K Mar 20 14:22 /

Even though UID appears to be '0', the process really runs as UID 100000 on the host. Similarly the real UID of / on the host would be 100000.

-u option works well with private UIDs. The semantics are the same but it requires a numeric virtual UID of a user within the container rather than a name for simplicity. Mapping between user name and UID is private to the container and unavailable on the host.

root@userns ~ # docker run --private-uids -u dineshs -i -t ubuntu-private-uids bash
2014/03/23 11:17:01 Invalid user: dineshs (-u has to be specified as a valid container UID rather than username when private UID space is used)

Also, the virtual UID provided has to be within the virtual UID range mapped into the container.

root@userns ~ # docker_ run --private-uids -u 20000 -i -t utest bash
2014/03/23 11:29:27 User '20000' must be a part of the UID map
root@userns ~ # docker run --private-uids -u 500 -i -t ubuntu-private-uids bash
dineshs ~ $ cat /proc/self/uid_map
                 0     100000      10000

/proc/self/uid_map shows the UIDs available to the container. In this case virtual UIDs 0 to 10000 are mapped to real UIDs 100000 to 110000 on the host.

An alternate custom UID mapping, rather than the default mapping used by --private-uids, can be supplied through --uidmap option:

root@userns ~ # docker run -d -x --private-uids ---uidmap "200000:0:10000" ubuntu true

Since the mapping is different in this case, the base ubuntu image is used with -x option to translate its UIDs to the new mapping before running the container. The container can be committed to a different image.

The general syntax of --uidmap is as follows:

--uidmap="hostUID:containerUID:size" maps size virtual UIDs in the container starting from containerUID to their respective real UIDs starting from hostUID on the host.

The mappings can be sparse. Multiple ranges of UIDs can be mapped with multiple --uidmap options. If a real to virtual UID mapping doesn't exist, it would show up as nobody.

The mappings are one-to-one and cannot overlap. For example, if real UID 100000 on the host is mapped to virtual UID 0 in the container, real UID 0 cannot also be mapped to virtual UID 0 in the container. This makes any directories in the container image owned by root on the host appear as owned by nobody in the container.

If --uidmap is used, --private-uids is implicit. If some of the fields of --uidmap are dropped, defaults are used.

root@userns ~ # docker run -x --uidmap "200000" -i -t ubuntu bash   # 10000 virtual UIDs mapped from 0 in container to 200000 on host
root@userns ~ # docker run -x --uidmap "::3000" -i -t ubuntu bash   # 3000 virtual UIDs mapped from 0 in container to 100000 on host

Any new empty volumes or volumes populated with contents from the image acquire the UID mappings as well.

root@userns ~ # docker run --name exporter --private-uids -v /var/volume -v /var/log -i -t ubuntu-private-uids bash
root ~ /# ls -lhd /var/{volume,log}
drwxr-xr-x 2 root root 4.0K Mar 16 23:57 /var/log
drwx--x--x 2 root root 4.0K Mar 20 16:28 /var/volume

However UIDs of any volumes imported from other containers or from the host are not translated and would remain inaccessible unless the UIDs of the files belong to the range of UIDs mapped into the container (or the permissions allow).

root@userns ~ # docker run --name importer --private-uids -v /var/log:/mnt:rw --volumes-from exporter -i -t ubuntu-private-uids bash
root ~ /# ls -lhd /mnt /var/volume
drwxr-xr-x 5 nobody nogroup 4.0K Mar 19 17:00 /mnt
drwx--x--x 2 root root 4.0K Mar 20 16:28 /var/volume

In this case, since there is no virtual UID mapping for real UID 0, the volume owned by root on the host appears to belong to nobody within the container. However, the volume imported from exporter shows up correctly because it was created with the same default UID mappings as importer. In general, to be able to import volumes from a container, the UID range mapped into the importer must be a super set of the UIDs mapped into the exporter (or at least the UIDs used by its volumes). For simplicity, containers sharing volumes should use the same UID mappings.

Contributor

dineshs-altiscale commented Mar 23, 2014

Here is some draft text outlining the usage scenarios. Will update as we go.

Creating an image with translated UIDs is simple.

root@userns ~ # docker commit $(docker run -d -x --private-uids ubuntu true) ubuntu-private-uids

It may take several seconds to translate the UIDs of all files in the image to a default UID range on the host and commit the product as a new image. The image can then be used without -x to avoid the translation latency.

root@userns ~ # docker run --private-uids -i -t ubuntu-private-uids bash
root ~ /# id
uid=0(root) gid=0(root) groups=0(root)
root ~ /# ls -lhd /
drwxr-xr-x 49 root root 4.0K Mar 20 14:22 /

Even though UID appears to be '0', the process really runs as UID 100000 on the host. Similarly the real UID of / on the host would be 100000.

-u option works well with private UIDs. The semantics are the same but it requires a numeric virtual UID of a user within the container rather than a name for simplicity. Mapping between user name and UID is private to the container and unavailable on the host.

root@userns ~ # docker run --private-uids -u dineshs -i -t ubuntu-private-uids bash
2014/03/23 11:17:01 Invalid user: dineshs (-u has to be specified as a valid container UID rather than username when private UID space is used)

Also, the virtual UID provided has to be within the virtual UID range mapped into the container.

root@userns ~ # docker_ run --private-uids -u 20000 -i -t utest bash
2014/03/23 11:29:27 User '20000' must be a part of the UID map
root@userns ~ # docker run --private-uids -u 500 -i -t ubuntu-private-uids bash
dineshs ~ $ cat /proc/self/uid_map
                 0     100000      10000

/proc/self/uid_map shows the UIDs available to the container. In this case virtual UIDs 0 to 10000 are mapped to real UIDs 100000 to 110000 on the host.

An alternate custom UID mapping, rather than the default mapping used by --private-uids, can be supplied through --uidmap option:

root@userns ~ # docker run -d -x --private-uids ---uidmap "200000:0:10000" ubuntu true

Since the mapping is different in this case, the base ubuntu image is used with -x option to translate its UIDs to the new mapping before running the container. The container can be committed to a different image.

The general syntax of --uidmap is as follows:

--uidmap="hostUID:containerUID:size" maps size virtual UIDs in the container starting from containerUID to their respective real UIDs starting from hostUID on the host.

The mappings can be sparse. Multiple ranges of UIDs can be mapped with multiple --uidmap options. If a real to virtual UID mapping doesn't exist, it would show up as nobody.

The mappings are one-to-one and cannot overlap. For example, if real UID 100000 on the host is mapped to virtual UID 0 in the container, real UID 0 cannot also be mapped to virtual UID 0 in the container. This makes any directories in the container image owned by root on the host appear as owned by nobody in the container.

If --uidmap is used, --private-uids is implicit. If some of the fields of --uidmap are dropped, defaults are used.

root@userns ~ # docker run -x --uidmap "200000" -i -t ubuntu bash   # 10000 virtual UIDs mapped from 0 in container to 200000 on host
root@userns ~ # docker run -x --uidmap "::3000" -i -t ubuntu bash   # 3000 virtual UIDs mapped from 0 in container to 100000 on host

Any new empty volumes or volumes populated with contents from the image acquire the UID mappings as well.

root@userns ~ # docker run --name exporter --private-uids -v /var/volume -v /var/log -i -t ubuntu-private-uids bash
root ~ /# ls -lhd /var/{volume,log}
drwxr-xr-x 2 root root 4.0K Mar 16 23:57 /var/log
drwx--x--x 2 root root 4.0K Mar 20 16:28 /var/volume

However UIDs of any volumes imported from other containers or from the host are not translated and would remain inaccessible unless the UIDs of the files belong to the range of UIDs mapped into the container (or the permissions allow).

root@userns ~ # docker run --name importer --private-uids -v /var/log:/mnt:rw --volumes-from exporter -i -t ubuntu-private-uids bash
root ~ /# ls -lhd /mnt /var/volume
drwxr-xr-x 5 nobody nogroup 4.0K Mar 19 17:00 /mnt
drwx--x--x 2 root root 4.0K Mar 20 16:28 /var/volume

In this case, since there is no virtual UID mapping for real UID 0, the volume owned by root on the host appears to belong to nobody within the container. However, the volume imported from exporter shows up correctly because it was created with the same default UID mappings as importer. In general, to be able to import volumes from a container, the UID range mapped into the importer must be a super set of the UIDs mapped into the exporter (or at least the UIDs used by its volumes). For simplicity, containers sharing volumes should use the same UID mappings.

@SvenDowideit

This comment has been minimized.

Show comment
Hide comment
@SvenDowideit

SvenDowideit Mar 23, 2014

Contributor

I wonder, can you add Dockerfile equivalents to the commandline params?

something like

UIDMAP      default
USER          500

I think the text you've written above should go into the examples in cli.rst (and we'll move them around later)

Contributor

SvenDowideit commented Mar 23, 2014

I wonder, can you add Dockerfile equivalents to the commandline params?

something like

UIDMAP      default
USER          500

I think the text you've written above should go into the examples in cli.rst (and we'll move them around later)

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 24, 2014

Contributor

@SvenDowideit, yes, Dockerfile commands is to be done. Updated cli.rst with above examples.

Contributor

dineshs-altiscale commented Mar 24, 2014

@SvenDowideit, yes, Dockerfile commands is to be done. Updated cli.rst with above examples.

@alexlarsson

This comment has been minimized.

Show comment
Hide comment
@alexlarsson

alexlarsson Mar 25, 2014

Contributor

This seems a bit low-level to me. I mean, we should probably allow manual specification of uid maps for specialized needs, but in general the uid mapping is complex because it is a global resource on the host system that needs to be allocated and maintained. I guess it depends on exactly what kind of usecases one sees for user namespaces, but I think we could end up with a more useful system if the docker daemon did the allocation of uid ranges, remapping, etc.

There are many complexities involved here: Persistent allocation of host uid ranges, remapping of uids for images, volumes shared between containers need the same uid mappings, how to share images between hosts where the uid ranges are allocated differently (remap image when pushing to repository?). Etc, etc.

Contributor

alexlarsson commented Mar 25, 2014

This seems a bit low-level to me. I mean, we should probably allow manual specification of uid maps for specialized needs, but in general the uid mapping is complex because it is a global resource on the host system that needs to be allocated and maintained. I guess it depends on exactly what kind of usecases one sees for user namespaces, but I think we could end up with a more useful system if the docker daemon did the allocation of uid ranges, remapping, etc.

There are many complexities involved here: Persistent allocation of host uid ranges, remapping of uids for images, volumes shared between containers need the same uid mappings, how to share images between hosts where the uid ranges are allocated differently (remap image when pushing to repository?). Etc, etc.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 25, 2014

Contributor

UIDs are a rather complex resource. This PR attempts to provide a simple model for common use by using implicit defaults, while retaining flexibility for more advanced scenarios by using an additional option. Virtual UIDs is a new feature to Docker and Linux in general -- I think patterns around more complex use cases will evolve over time. In the mean time, the following simple model could cover at least the most common use cases:

Each host has a range of available UIDs and Docker uses a particular subrange as default when --private-uids option is used. It meets the common use case of securely isolating a container to a disjoint set of UIDs without exposing any underlying complexity by implicitly using this subrange. Even though multiple containers map to the same default subrange, they are isolated from the host and from each other. A process can be root in such a container but without privilege on the host. Because they share the same UID mapping specified by the default subrange, sharing data among those containers via volumes would be seamless. When a volume is imported from another container created with --private-uids, the UIDs are already aligned between the containers without having to expose additional detail to the user.

Container portability is preserved by expressing UIDs as host-independent relative offsets. Dockerfile uses a new UIDMAP instruction which would either specify "default" for the simple case or a mapping of form "host UID relative to default : container UID : size". The host UID field is expressed as a relative offset to the default subrange on that host. When the resulting container is run on a different host, container UIDs are mapped to the available subrange on the target host.

Images in the repository are always stored in their "identity mapping" (UID x maps to UID x). The UID space to which they need to be translated is host dependent and the translation is performed on the target before running the container. The image, once translated with -x option, can be committed to speedup subsequent runs. Whether an image is in its identity mapping or already translated should be an image attribute on the host? But I think managing host UID space by giving out arbitrary UID ranges to individual containers and keeping track of the mappings is hard and introduces unnecessary complexity -- UIDs are not a consumable resource and their availability on a host depends on factors such as how UID space is administered and what ranges are assigned to individual users etc. In the future, the daemon could potentially maintain multiple subranges and assign cohesive groups of containers to each. For now I think one default subrange gives adequate milage.

Contributor

dineshs-altiscale commented Mar 25, 2014

UIDs are a rather complex resource. This PR attempts to provide a simple model for common use by using implicit defaults, while retaining flexibility for more advanced scenarios by using an additional option. Virtual UIDs is a new feature to Docker and Linux in general -- I think patterns around more complex use cases will evolve over time. In the mean time, the following simple model could cover at least the most common use cases:

Each host has a range of available UIDs and Docker uses a particular subrange as default when --private-uids option is used. It meets the common use case of securely isolating a container to a disjoint set of UIDs without exposing any underlying complexity by implicitly using this subrange. Even though multiple containers map to the same default subrange, they are isolated from the host and from each other. A process can be root in such a container but without privilege on the host. Because they share the same UID mapping specified by the default subrange, sharing data among those containers via volumes would be seamless. When a volume is imported from another container created with --private-uids, the UIDs are already aligned between the containers without having to expose additional detail to the user.

Container portability is preserved by expressing UIDs as host-independent relative offsets. Dockerfile uses a new UIDMAP instruction which would either specify "default" for the simple case or a mapping of form "host UID relative to default : container UID : size". The host UID field is expressed as a relative offset to the default subrange on that host. When the resulting container is run on a different host, container UIDs are mapped to the available subrange on the target host.

Images in the repository are always stored in their "identity mapping" (UID x maps to UID x). The UID space to which they need to be translated is host dependent and the translation is performed on the target before running the container. The image, once translated with -x option, can be committed to speedup subsequent runs. Whether an image is in its identity mapping or already translated should be an image attribute on the host? But I think managing host UID space by giving out arbitrary UID ranges to individual containers and keeping track of the mappings is hard and introduces unnecessary complexity -- UIDs are not a consumable resource and their availability on a host depends on factors such as how UID space is administered and what ranges are assigned to individual users etc. In the future, the daemon could potentially maintain multiple subranges and assign cohesive groups of containers to each. For now I think one default subrange gives adequate milage.

Add Dockerfile instruction for private UIDs
This adds a Dockerfile instruction called PRIVATEUIDS which indicates
the range of host UIDs to use rather than just one default range of UIDs
(100k-110k) for all containers. It allows the user to specify which
contiguous set of 10k UIDs (referred to as a "bank" of UIDs) are mapped
into the container.  A cohesive group of containers that need to share
data through volumes must use the same bank.

--private-uids option is modified to take a parameter to specify the bank
as well.  The behavior of --uidmap and -x options remains the same.
@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 30, 2014

Contributor

The last commit adds an integer parameter to --private-uids to indicate the range of host UIDs to use rather than just one default range of UIDs (100k-110k) for all containers. It allows the user to specify which contiguous set of 10k UIDs (referred to as a "bank" in the code) are mapped into the container. A cohesive group of containers that need to share data through volumes must use the same bank. The behavior of --uidmap and -x options remains the same.

Sysadmin of a host dedicates an unused portion of the host UID space for Docker, which is organized into a series of UID banks. The range(s) of host UIDs dedicated to Docker and UIDs per bank could be configurable (but hardcoded in current implementation).

For example, docker run -x --private-uids 1 -i -t ubuntu bash maps bank 1 into the container. Any other container created with --private-uids 1 will be able to seamlessly share volumes. For more UIDs, --private-uids option can be specified up to 5 times (kernel limit). It also accepts a range of banks (1-3).

The following picture shows host UID range 100K - 500K dedicated to Docker, which is divided into 40 banks of 10K UIDs each. 20K UIDs from banks 1 and 3 are mapped at the default virtual UID 0 in the container.

                         Host
                       UID space
     0 (root)      +--------------+
     1 (bin)       |              |
     2 (daemon)    |              |
         .         |              |
         .         |              |
                   |              |
   500 (user1)     |              |
   501 (user2)     |              |
         .         |              |
         .         |              |
                   ~              ~
                   ~              ~
                   |              |
             100K  +--------------+  Docker UID Base
                   |    Bank 0    |                              Real      Container      Virtual
             110K  +--------------+                              UIDs      UID space      UIDs
                   |    Bank 1    |------------------------->   110K    +--------------+  0
             120K  +--------------+                                     |              |
                   |    Bank 2    |            +------------>   130K    +--------------+  10K
             130K  +--------------+            |                        |              |
                   |    Bank 3    |------------+                140K    +--------------+  20K
             140K  +--------------+                                     |   Unmapped   |
              .    |    Bank 4    |                                     |              |
              .    +--------------+                                             .
                   |    Bank 5    |                                             .
                   +--------------+                                             
                   |              |
                   ~              ~
                   ~              ~
                   |              |
             500K  +--------------+ Docker UID Limit
                   |              |
                   ~              ~
                   ~              ~
                   |              |
              4G   +--------------+

Dockerfile for the above example would be something like:

FROM ubuntu
PRIVATEUIDS 1 3
USER 2000
CMD ["cat", "/proc/self/uid_map"]

Note that there is no Dockerfile equivalent for --uidmap to preserve portability.

I'll collect feedback on this and update the documentation.

Contributor

dineshs-altiscale commented Mar 30, 2014

The last commit adds an integer parameter to --private-uids to indicate the range of host UIDs to use rather than just one default range of UIDs (100k-110k) for all containers. It allows the user to specify which contiguous set of 10k UIDs (referred to as a "bank" in the code) are mapped into the container. A cohesive group of containers that need to share data through volumes must use the same bank. The behavior of --uidmap and -x options remains the same.

Sysadmin of a host dedicates an unused portion of the host UID space for Docker, which is organized into a series of UID banks. The range(s) of host UIDs dedicated to Docker and UIDs per bank could be configurable (but hardcoded in current implementation).

For example, docker run -x --private-uids 1 -i -t ubuntu bash maps bank 1 into the container. Any other container created with --private-uids 1 will be able to seamlessly share volumes. For more UIDs, --private-uids option can be specified up to 5 times (kernel limit). It also accepts a range of banks (1-3).

The following picture shows host UID range 100K - 500K dedicated to Docker, which is divided into 40 banks of 10K UIDs each. 20K UIDs from banks 1 and 3 are mapped at the default virtual UID 0 in the container.

                         Host
                       UID space
     0 (root)      +--------------+
     1 (bin)       |              |
     2 (daemon)    |              |
         .         |              |
         .         |              |
                   |              |
   500 (user1)     |              |
   501 (user2)     |              |
         .         |              |
         .         |              |
                   ~              ~
                   ~              ~
                   |              |
             100K  +--------------+  Docker UID Base
                   |    Bank 0    |                              Real      Container      Virtual
             110K  +--------------+                              UIDs      UID space      UIDs
                   |    Bank 1    |------------------------->   110K    +--------------+  0
             120K  +--------------+                                     |              |
                   |    Bank 2    |            +------------>   130K    +--------------+  10K
             130K  +--------------+            |                        |              |
                   |    Bank 3    |------------+                140K    +--------------+  20K
             140K  +--------------+                                     |   Unmapped   |
              .    |    Bank 4    |                                     |              |
              .    +--------------+                                             .
                   |    Bank 5    |                                             .
                   +--------------+                                             
                   |              |
                   ~              ~
                   ~              ~
                   |              |
             500K  +--------------+ Docker UID Limit
                   |              |
                   ~              ~
                   ~              ~
                   |              |
              4G   +--------------+

Dockerfile for the above example would be something like:

FROM ubuntu
PRIVATEUIDS 1 3
USER 2000
CMD ["cat", "/proc/self/uid_map"]

Note that there is no Dockerfile equivalent for --uidmap to preserve portability.

I'll collect feedback on this and update the documentation.

@SvenDowideit

This comment has been minimized.

Show comment
Hide comment
@SvenDowideit

SvenDowideit Mar 31, 2014

Contributor

interesting - and then I wonder - can we name the banks - and perhaps get some auto-linkage when using shared volumes (sorry, this is not a request, I'm just thinking aloud)

Contributor

SvenDowideit commented Mar 31, 2014

interesting - and then I wonder - can we name the banks - and perhaps get some auto-linkage when using shared volumes (sorry, this is not a request, I'm just thinking aloud)

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Mar 31, 2014

Contributor

@SvenDowideit, I like the idea in general, names add color to otherwise boring integers.

One way is to define an ordered list of unique names per host, each representing an available UID bank. The list has to be ordered to be able to specify ranges. Otherwise, each container would be limited to 5 banks, given that a namespace can have at most 5 UID mappings.

Since names are host-dependent in this case, image config must store bank IDs for image portability. But then Dockerfiles themselves won't be portable if they reference host-dependent bank names.

Another possibility is to define a well-known universal mapping between integers and names. (Something like atomic numbers in periodic table of elements, but there only ~100 of them.) Then, we'd be able to specify ranges, reference them in Dockerfiles etc.

I am also debating the value of configurable UIDs-per-bank. Hardcoding it to a small enough value like 1000 would make banks portable (a container that needs 1 bank of 10K UIDs on one host may need 10 banks of 1K UIDs on another) and would also reduce configuration burden.

Names aside, automatically pulling in UID banks from the container sharing its volumes is simple enough to implement.

Contributor

dineshs-altiscale commented Mar 31, 2014

@SvenDowideit, I like the idea in general, names add color to otherwise boring integers.

One way is to define an ordered list of unique names per host, each representing an available UID bank. The list has to be ordered to be able to specify ranges. Otherwise, each container would be limited to 5 banks, given that a namespace can have at most 5 UID mappings.

Since names are host-dependent in this case, image config must store bank IDs for image portability. But then Dockerfiles themselves won't be portable if they reference host-dependent bank names.

Another possibility is to define a well-known universal mapping between integers and names. (Something like atomic numbers in periodic table of elements, but there only ~100 of them.) Then, we'd be able to specify ranges, reference them in Dockerfiles etc.

I am also debating the value of configurable UIDs-per-bank. Hardcoding it to a small enough value like 1000 would make banks portable (a container that needs 1 bank of 10K UIDs on one host may need 10 banks of 1K UIDs on another) and would also reduce configuration burden.

Names aside, automatically pulling in UID banks from the container sharing its volumes is simple enough to implement.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Apr 1, 2014

Contributor

@SvenDowideit, just pushed the change to inherit banks from containers sharing volumes.

Contributor

dineshs-altiscale commented Apr 1, 2014

@SvenDowideit, just pushed the change to inherit banks from containers sharing volumes.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale May 22, 2014

Contributor

As of the latest commit (5975178), all user flags are removed in the interest of simplicity. Containers are always created with default UID mappings that map container root to docker-root on host and all other host users except root one-to-one. So your code would get the mappings to enforce from the generic code rather than from a user flag.

Local images are stored with UIDs remapped so that no UID translation is required on container start. Images are reverse translated on push.

Contributor

dineshs-altiscale commented May 22, 2014

As of the latest commit (5975178), all user flags are removed in the interest of simplicity. Containers are always created with default UID mappings that map container root to docker-root on host and all other host users except root one-to-one. So your code would get the mappings to enforce from the generic code rather than from a user flag.

Local images are stored with UIDs remapped so that no UID translation is required on container start. Images are reverse translated on push.

@EvanKrall

This comment has been minimized.

Show comment
Hide comment
@EvanKrall

EvanKrall May 23, 2014

Contributor

Rather than (or in addition to) a --uidmap parameter for docker run, I'd like an option to the docker daemon that prevents it from starting non-uidmapped containers. This would be useful on machines where I know that I'm only going to run stateless services inside containers, and don't care much about volume interoperability.

Contributor

EvanKrall commented May 23, 2014

Rather than (or in addition to) a --uidmap parameter for docker run, I'd like an option to the docker daemon that prevents it from starting non-uidmapped containers. This would be useful on machines where I know that I'm only going to run stateless services inside containers, and don't care much about volume interoperability.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Jun 13, 2014

Contributor

This PR has outgrown itself. As discussed at plumbers conference, I am going to split this up into simpler bite size PRs as follows and add references to them here:

  1. Fix up cache directories with right permissions for running containers with remapped root.
  2. Update UIDs of the images on pull and push under a disabled flag.
  3. Enable the flag and call driver to enforce UID mappings.

@crosbymichael let me know if this looks okay to you and I'll go ahead with the push.

Contributor

dineshs-altiscale commented Jun 13, 2014

This PR has outgrown itself. As discussed at plumbers conference, I am going to split this up into simpler bite size PRs as follows and add references to them here:

  1. Fix up cache directories with right permissions for running containers with remapped root.
  2. Update UIDs of the images on pull and push under a disabled flag.
  3. Enable the flag and call driver to enforce UID mappings.

@crosbymichael let me know if this looks okay to you and I'll go ahead with the push.

@crosbymichael

This comment has been minimized.

Show comment
Hide comment
@crosbymichael
Contributor

crosbymichael commented Jun 13, 2014

@dineshs-altiscale sounds good

@tianon

This comment has been minimized.

Show comment
Hide comment
@tianon

tianon Jun 14, 2014

Member

If we remap on disk at pull time, won't that break non-namespaced
privileged containers?

Member

tianon commented Jun 14, 2014

If we remap on disk at pull time, won't that break non-namespaced
privileged containers?

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Jun 14, 2014

Contributor

I liked my earlier -x option (docker commit $(docker run -d -x ubuntu true) ubuntu-remapped) but then @alexlarsson convinced me otherwise.

I am happy to bring it back : )

Contributor

dineshs-altiscale commented Jun 14, 2014

I liked my earlier -x option (docker commit $(docker run -d -x ubuntu true) ubuntu-remapped) but then @alexlarsson convinced me otherwise.

I am happy to bring it back : )

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Jun 14, 2014

Contributor

Why don't we make the remapping occur during container creation, not on image creation (or pulling)? IMO, it seems that the [UG]ID remapping should be a property of each container -- not the entire image.

Contributor

cyphar commented Jun 14, 2014

Why don't we make the remapping occur during container creation, not on image creation (or pulling)? IMO, it seems that the [UG]ID remapping should be a property of each container -- not the entire image.

@michaelneale

This comment has been minimized.

Show comment
Hide comment
@michaelneale

michaelneale Jun 14, 2014

Contributor

@cyphar perhaps there is concern that it will slow down launch times for standard containers (maybe that suggests that the reverse - ie when running priv it can be mapped back at launch time?)

@dineshs-altiscale - can you link here to the PRs you open so we can test/try them individually?
I think the #4572 will be forever etched in my mind ;)

Contributor

michaelneale commented Jun 14, 2014

@cyphar perhaps there is concern that it will slow down launch times for standard containers (maybe that suggests that the reverse - ie when running priv it can be mapped back at launch time?)

@dineshs-altiscale - can you link here to the PRs you open so we can test/try them individually?
I think the #4572 will be forever etched in my mind ;)

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Jun 14, 2014

Contributor

@michaelneale The problem with that is that some users may want different mappings for the users for each container (such as sharing a volume, where you don't want every container to read the shared data of every other container). That's why I said that it should be a property of the container, not just done at container creation.

Contributor

cyphar commented Jun 14, 2014

@michaelneale The problem with that is that some users may want different mappings for the users for each container (such as sharing a volume, where you don't want every container to read the shared data of every other container). That's why I said that it should be a property of the container, not just done at container creation.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Jun 14, 2014

Contributor

@cyphar to address the core issue of privilege isolation while keeping the usage model simple, we agreed to map just the root user initially. If only the root user is mapped and also always to the same host user, the mappings become static. If the mappings are static, they could be directly applied to the image itself.

That's kind of how we arrived here.

But then, as folks are pointing out, this scheme won't work if user namespace is not used or if the root in different containers are mapped to different host users (isolating containers from each other is as necessary as isolating containers from the host) or if custom UID mappings are supported in the future.

Evidently this is a complex problem and some trade-offs seem to be unavoidable. With that in mind, let me make the following proposal for the initial patch (splitting it up into multiple simple PRs is no problem):

  • Only support mapping container root.
  • Always map it to docker-root host user.
  • User creates the remapped image with -x option.

It's a trade off between the burden of user creating remapped images before first use and transparency. The user should know what images are created and in use and how much storage they are consuming etc. Most backends are poor at efficiently tracking changes to file metadata containing the UIDs.

Then it becomes:

        $ docker commit $(docker run -d -x ubuntu true) ubuntu-remapped
        $ docker run -i -t ubuntu-remapped bash
Contributor

dineshs-altiscale commented Jun 14, 2014

@cyphar to address the core issue of privilege isolation while keeping the usage model simple, we agreed to map just the root user initially. If only the root user is mapped and also always to the same host user, the mappings become static. If the mappings are static, they could be directly applied to the image itself.

That's kind of how we arrived here.

But then, as folks are pointing out, this scheme won't work if user namespace is not used or if the root in different containers are mapped to different host users (isolating containers from each other is as necessary as isolating containers from the host) or if custom UID mappings are supported in the future.

Evidently this is a complex problem and some trade-offs seem to be unavoidable. With that in mind, let me make the following proposal for the initial patch (splitting it up into multiple simple PRs is no problem):

  • Only support mapping container root.
  • Always map it to docker-root host user.
  • User creates the remapped image with -x option.

It's a trade off between the burden of user creating remapped images before first use and transparency. The user should know what images are created and in use and how much storage they are consuming etc. Most backends are poor at efficiently tracking changes to file metadata containing the UIDs.

Then it becomes:

        $ docker commit $(docker run -d -x ubuntu true) ubuntu-remapped
        $ docker run -i -t ubuntu-remapped bash
@tianon

This comment has been minimized.

Show comment
Hide comment
@tianon

tianon Jun 14, 2014

Member

And what do we do when there is no "docker-root" user on the host? Also,
what does this mean for then pushing and sharing that image? Does it get
"unmapped" when I go to share it? Also, what happens if I want/need to
have a user in the container with the same UID as my host's "docker-root"
user?

Member

tianon commented Jun 14, 2014

And what do we do when there is no "docker-root" user on the host? Also,
what does this mean for then pushing and sharing that image? Does it get
"unmapped" when I go to share it? Also, what happens if I want/need to
have a user in the container with the same UID as my host's "docker-root"
user?

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Jun 14, 2014

Contributor

@tianon I already asked the first question, and it boiled down to "it should be part of the install". You can't just use a random uid (since any uid might be used somewhere, and doesn't need to be defined in /etc/passwd), and you shouldn't create users on demand if they are as important as docker-root.

Contributor

cyphar commented Jun 14, 2014

@tianon I already asked the first question, and it boiled down to "it should be part of the install". You can't just use a random uid (since any uid might be used somewhere, and doesn't need to be defined in /etc/passwd), and you shouldn't create users on demand if they are as important as docker-root.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Jun 14, 2014

Contributor

@tianon basically all host users, except root, are mapped into the container. That's 2^32 -1 users. The missing UID in the container is the UID of host docker-root user. Installer should assign it a UID which is not commonly used, but luckily the range is large.

To keep this simple and transparent, the user is in-charge of images -- remap before using and unmap before sharing. More sophisticated automation could be done in the next iteration in another PR.

Contributor

dineshs-altiscale commented Jun 14, 2014

@tianon basically all host users, except root, are mapped into the container. That's 2^32 -1 users. The missing UID in the container is the UID of host docker-root user. Installer should assign it a UID which is not commonly used, but luckily the range is large.

To keep this simple and transparent, the user is in-charge of images -- remap before using and unmap before sharing. More sophisticated automation could be done in the next iteration in another PR.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Jun 15, 2014

Contributor

@crosbymichael could you please share your thoughts? Could something like the following be acceptable?

  1. Fix up cache directories with right permissions for running containers with remapped root.
  2. Map and unmap images with a new option under a disabled flag.
  3. Enable the flag and call driver to enforce UID mappings.
Contributor

dineshs-altiscale commented Jun 15, 2014

@crosbymichael could you please share your thoughts? Could something like the following be acceptable?

  1. Fix up cache directories with right permissions for running containers with remapped root.
  2. Map and unmap images with a new option under a disabled flag.
  3. Enable the flag and call driver to enforce UID mappings.
@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Jun 15, 2014

Contributor

@dineshs-altiscale Would all UIDs other than root map to /proc/sys/kernel/overflowuid? I'm wondering since that is what the docs seem to suggest when it comes to user namespaces in the Linux kernel.

Contributor

cyphar commented Jun 15, 2014

@dineshs-altiscale Would all UIDs other than root map to /proc/sys/kernel/overflowuid? I'm wondering since that is what the docs seem to suggest when it comes to user namespaces in the Linux kernel.

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Jun 15, 2014

Contributor

@cyphar Only the UID of host docker-root user would be unavailable in the container and ends up appearing as nobody (or whatever overflowuid is.) The rest of the 2^32 -1 UIDs are mapped. The specific mappings used in the code are:

Host Container
0 Unmapped
1 to docker-root -1 1 to docker-root -1
docker-root 0
docker-root + 1 to (2^32-1) docker-root + 1 to (2^32-1)

Only mapping root makes the container quite unusable. Any attempt to use any other UID than 0 causes EINVAL -- no adduser, no su, no chown...

Contributor

dineshs-altiscale commented Jun 15, 2014

@cyphar Only the UID of host docker-root user would be unavailable in the container and ends up appearing as nobody (or whatever overflowuid is.) The rest of the 2^32 -1 UIDs are mapped. The specific mappings used in the code are:

Host Container
0 Unmapped
1 to docker-root -1 1 to docker-root -1
docker-root 0
docker-root + 1 to (2^32-1) docker-root + 1 to (2^32-1)

Only mapping root makes the container quite unusable. Any attempt to use any other UID than 0 causes EINVAL -- no adduser, no su, no chown...

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Jun 23, 2014

Contributor

This is refactored into 3 PRs:

  • The first required bit to fix permissions of cache directories which can be merged independently: #6600
  • Subsequent commits based on user-driven image mapping: #6602
  • Subsequent commits based on graph driver-driven image mapping: #6603
Contributor

dineshs-altiscale commented Jun 23, 2014

This is refactored into 3 PRs:

  • The first required bit to fix permissions of cache directories which can be merged independently: #6600
  • Subsequent commits based on user-driven image mapping: #6602
  • Subsequent commits based on graph driver-driven image mapping: #6603
@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale
Contributor

dineshs-altiscale commented Jun 23, 2014

@crosbymichael

This comment has been minimized.

Show comment
Hide comment
@crosbymichael

crosbymichael Jun 23, 2014

Contributor

@dineshs-altiscale do you think we can close this PR infavor of the others?

Contributor

crosbymichael commented Jun 23, 2014

@dineshs-altiscale do you think we can close this PR infavor of the others?

@dineshs-altiscale

This comment has been minimized.

Show comment
Hide comment
@dineshs-altiscale

dineshs-altiscale Jun 23, 2014

Contributor

Yes we could, but I was going to give it a few days to capture any other comments and discussion on the high level approach.

Contributor

dineshs-altiscale commented Jun 23, 2014

Yes we could, but I was going to give it a few days to capture any other comments and discussion on the high level approach.

@tphyahoo

This comment has been minimized.

Show comment
Hide comment
@tphyahoo

tphyahoo Mar 17, 2015

Could this have label project/security added?

tphyahoo commented Mar 17, 2015

Could this have label project/security added?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment