New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless Containers #774

Merged
merged 8 commits into from Mar 27, 2017

Conversation

Projects
None yet
@cyphar
Member

cyphar commented Apr 23, 2016

This enables the support for "rootless container mode". There are
certain restrictions on what non-root users can do, resulting in several
runC features not being available. There are no checks in place at
the moment to make this clear to users.
I've implemented the config
validation.

  • All cgroup operations require having write access to your current
    cgroup directory. By default, the directories are owned by root and have
    the mode 0755. This means that we cannot set up any cgroups, or join
    cgroups. Therefore new cgroup namespace doesn't fix this for us either, but
    hopefully we can get a patch upstream to fix this. But we should still
    improve cgroup handling so that we apply any cgroups we can if we have
    write access to the directory.
  • setgroups(2) cannot be used in a non-privileged user namespace setup.
    We also have to set /proc/self/setgroups to "deny".
  • We cannot map any user other than ourselves in a rootless container,
    which means that any user-related directives won't work. You can only be
    "root".

If you want to use this, you have to make sure you remove the gid=5 entry from the /dev/pts mount, and only map your own user in the namespace.

Here's runc start working in both root and rootless setup:
it works

And here's runc exec working in both root and rootless setup:
it also works

TODO

  • Provide a meaningful error message if the user specifies a configuration which maps more than one user (or doesn't map the effective user).
  • Provide a meaningful error message (don't just ignore it) if the user tries to use the user directive to run as a different user.
  • Provide a meaningful error message (don't just ignore it) if the user tries to specify any cgroup settings. Currently we can't provide cgroup settings (though it should be possible if the user just so happens to own the cgroup they currently reside in).
  • runc exec doesn't work, and we should be able to implement it. This actually complicates the code in nsenter.c which checks whether the container is unprivileged. setgroup needs special treatment.
  • runc exec doesnt' work with root running the exec and a rootless container (ironically). This is because we autodetect the rootless parameter on run, which isn't accurate. This can be fixed by storing the rootless flag in the state.json.
  • rootless should be passed to the init through netlink. Currently we are doing the rootless check in two places and it doesn't make sense to do the check in nsenter.c -- we might actually have to do it with capability checks in the future.
  • runc events doesn't work because the rootless.Manager doesn't appear to manage the paths properly, so it can't get any data from the cgroups.
  • Add runc spec --rootless.
  • loadContainer doesn't properly load the cgroup manager for the container, because the API forces that to happen (libcontainer.New() takes the cgroup manager as an argument). We can probably fix this by making it load the cgroup manager from the container state (but it might be ugly). Without this, we can't even hope to have runc pause and runc resume working.
  • Currently the cgroup setup is binary (either we use cgroups or we don't). But since cgroupv1 lets you have different permissions on different hierarchies, we should check for each subsystem that we have access to create a subtree in that cgroup. Then we can support several setups (as well as the kernel patch I've proposed). This would require many changes in config validation. In addition, we would have to store what cgroups we are using in a bitmask (like in the kernel).
    • This would require adding a bunch of tests to libcontainer/cgroup/rootless. Luckily we already all of the mock stuff we need in cgroupfs.
  • Add unit tests to:
    • libcontainer/config/validate/rootless
    • libcontainer/specconv/spec_linux.go with Rootless == true.
    • libcontainer/cgroup/rootless (not necessary at the moment)
  • Detaching doesn't work due to a bug in runC with --console and user namespaces. #814 and #883
  • Add a setup for testing the rootless containers so we can make sure there's no regressions in this set up. We can try running the whole test suite (minus cgroups and a few other things). All of the sniff tests should work. The sniff tests are no more. THIS IS CURRENTLY BLOCKED ON FIXING THE --console BUG. The console bug has been fixed as part of #1018.
    • We have to refactor all of the tests to use a wrapper around runc (so we can mess around with arguments in rootless mode).
    • Also, we need to skip certain tests in rootless mode, since they are not supported.
  • We currently cannot use network namespaces in a rootless container setup (and still connect to the network). A fix for this is to use the host network, but that currently doesn't work in runC (#799). It should also be noted that things like ping don't work in rootless containers (user namespace issues). #807 fixes the validation issue.
    • Figure out why we still can't talk to the internet.
  • Switch container.Processes() to join the container PID namespace, enumerate the list of PIDs and then send them over a UNIX socket (this causes the PIDs to be translated). The end result is to not require cgroups to enumerate PIDs (which actually isn't a good idea since processes may join sub-cgroups). This is necessary for runc ps and similar things to work properly.
    • There are some unanswered questions about sending the PIDs though, because you need CAP_SYS_ADMIN in order to send a different PID. And there's also a valid question about atomicity (enumeration is not atomic, reading from cgroup.procs is).

Open Questions

What works?

  • As unprivileged user:
    • runc checkpoint (while potentially possible, not implemented)
    • runc create (#814)
    • runc create --console (#814)
    • runc delete
    • runc events (not really useful -- cgroups)
    • runc exec
    • runc exec --console (#814)
    • runc kill
    • runc list
      • with containers not readable by us.
    • runc pause (cgroups)
    • runc ps (cgroups)
    • runc restore (while potentially possible, not implemented)
    • runc resume (cgroups)
    • runc run
    • runc run -d --console (#814)
    • runc spec
    • runc start (create doesn't work -- #814)
    • runc state
    • runc update (cgroups)
  • As root:
    • runc checkpoint (while potentially possible, not implemented)
    • runc delete
    • runc events (not really useful -- cgroups)
    • runc exec
    • runc exec --console (#814)
    • runc kill
    • runc list
    • runc pause (cgroups)
    • runc ps (cgroups)
    • runc restore (while potentially possible, not implemented)
    • runc resume (cgroups)
    • runc run
    • runc run -d --console (#814)
    • runc spec
    • runc start (create doesn't work -- #814)
    • runc state
    • runc update (cgroups)

Kernel Patches

Implements #38.

Signed-off-by: Aleksa Sarai asarai@suse.de

@mrunalp

This comment has been minimized.

Show comment
Hide comment
@mrunalp

mrunalp Apr 23, 2016

Contributor

For cgroups, we can skip doing any setup if cgroupsPath == "" and Resources == nil in the config.
We can either skip or introduce a NoOp cgroups manager.

Contributor

mrunalp commented Apr 23, 2016

For cgroups, we can skip doing any setup if cgroupsPath == "" and Resources == nil in the config.
We can either skip or introduce a NoOp cgroups manager.

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Apr 24, 2016

Member

@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable).

Member

cyphar commented Apr 24, 2016

@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable).

@mrunalp

This comment has been minimized.

Show comment
Hide comment
@mrunalp

mrunalp Apr 24, 2016

Contributor

Sounds good.

Sent from my iPhone

On Apr 23, 2016, at 8:05 PM, Aleksa Sarai notifications@github.com wrote:

@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable).


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

Contributor

mrunalp commented Apr 24, 2016

Sounds good.

Sent from my iPhone

On Apr 23, 2016, at 8:05 PM, Aleksa Sarai notifications@github.com wrote:

@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable).


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@cyphar cyphar changed the title from [WIP] Rootless Containers to Rootless Containers Apr 24, 2016

Show outdated Hide outdated libcontainer/configs/validate/rootless.go
return nil
}
// Used for comparison.

This comment has been minimized.

@crosbymichael

crosbymichael Apr 25, 2016

Member

This is some pretty complex validation. So some cgroups are ok but others are not?

@crosbymichael

crosbymichael Apr 25, 2016

Member

This is some pretty complex validation. So some cgroups are ok but others are not?

This comment has been minimized.

@mrunalp

mrunalp Apr 25, 2016

Contributor

We could just fail the check if any values are set at all. No need to go check for defaults. User can create the config without setting resources or path which is simple to check.

@mrunalp

mrunalp Apr 25, 2016

Contributor

We could just fail the check if any values are set at all. No need to go check for defaults. User can create the config without setting resources or path which is simple to check.

This comment has been minimized.

@cyphar

cyphar Apr 25, 2016

Member

The /actual/ check is that all cgroups have no non-default settings. Unfortunately, specconv adds device cgroup settings that get merged with the config the user specified -- so a simple "is this equal to the zero value" check doesn't cut it. I haven't figured out a nice way of dealing with that (specconv runs long before we get to this part, and we need it to run before we can do anything with the config (like figuring out if we're rootless)).

Maybe we can do our rootless check before specconv, then specconv doesn't modify the cgroup settings if we're rootless, and then we do the config checks for rootless (do we have mapping rights).

@cyphar

cyphar Apr 25, 2016

Member

The /actual/ check is that all cgroups have no non-default settings. Unfortunately, specconv adds device cgroup settings that get merged with the config the user specified -- so a simple "is this equal to the zero value" check doesn't cut it. I haven't figured out a nice way of dealing with that (specconv runs long before we get to this part, and we need it to run before we can do anything with the config (like figuring out if we're rootless)).

Maybe we can do our rootless check before specconv, then specconv doesn't modify the cgroup settings if we're rootless, and then we do the config checks for rootless (do we have mapping rights).

This comment has been minimized.

@crosbymichael

crosbymichael Apr 25, 2016

Member

Ya, that sounds better. We should be looking for this in runc not in libcontainer.

@crosbymichael

crosbymichael Apr 25, 2016

Member

Ya, that sounds better. We should be looking for this in runc not in libcontainer.

This comment has been minimized.

@cyphar

cyphar Apr 25, 2016

Member

I'm not really convinced that we shouldn't be doing any checking in libcontainer. The same question can be asked about libcontainer/configs/validate -- why do we do any config verification inside libcontainer? There's also a question of whether or not libcontainer should autodetect rootless mode or whether it should be passed as an option (you can't use rootless containers with root as far as I can tell -- and it's definitely a bad idea).

@cyphar

cyphar Apr 25, 2016

Member

I'm not really convinced that we shouldn't be doing any checking in libcontainer. The same question can be asked about libcontainer/configs/validate -- why do we do any config verification inside libcontainer? There's also a question of whether or not libcontainer should autodetect rootless mode or whether it should be passed as an option (you can't use rootless containers with root as far as I can tell -- and it's definitely a bad idea).

This comment has been minimized.

@crosbymichael

crosbymichael Apr 25, 2016

Member

runc should populate the correct config that libcontainer gets and it should not be modified inside libcontainer. All the changes that need to be made should happen while we generate the config not after it is made.

@crosbymichael

crosbymichael Apr 25, 2016

Member

runc should populate the correct config that libcontainer gets and it should not be modified inside libcontainer. All the changes that need to be made should happen while we generate the config not after it is made.

This comment has been minimized.

@cyphar

cyphar Apr 26, 2016

Member

The current state of this patchset doesn't modify the config inside libcontainer. The issue is that specconv adds device options to []Device even if the user doesn't specify anything. So we can either do the verification of the cgroups in runC (which means that if someone uses libcontainer directly they probably won't immediately realise they can't set cgroup settings) or we do the verification in libcontainer and make specconv not generate any cgroup options in rootless mode.

I'll also move the isRootless checks to RootlessValidator.

EDIT: I've fixed this.

@cyphar

cyphar Apr 26, 2016

Member

The current state of this patchset doesn't modify the config inside libcontainer. The issue is that specconv adds device options to []Device even if the user doesn't specify anything. So we can either do the verification of the cgroups in runC (which means that if someone uses libcontainer directly they probably won't immediately realise they can't set cgroup settings) or we do the verification in libcontainer and make specconv not generate any cgroup options in rootless mode.

I'll also move the isRootless checks to RootlessValidator.

EDIT: I've fixed this.

@jessfraz

This comment has been minimized.

Show comment
Hide comment
@jessfraz

jessfraz Apr 26, 2016

Contributor

i think the rootless cgroups manager is good, then it can be expanded when the cgroups ns is added :) just my opinion but ianam, thanks for this

Contributor

jessfraz commented Apr 26, 2016

i think the rootless cgroups manager is good, then it can be expanded when the cgroups ns is added :) just my opinion but ianam, thanks for this

@jessfraz

This comment has been minimized.

Show comment
Hide comment
@jessfraz

jessfraz Apr 26, 2016

Contributor

also wrt the features that might not be possible or are hard for the time being, they could be disabled, and then slowly turned back on as implementations evolve, kinda like how we did userns in docker, and then slowly added more features back in wrt sharing namespaces
it's easier to make a smaller change then iterate on it, then one huge one

Contributor

jessfraz commented Apr 26, 2016

also wrt the features that might not be possible or are hard for the time being, they could be disabled, and then slowly turned back on as implementations evolve, kinda like how we did userns in docker, and then slowly added more features back in wrt sharing namespaces
it's easier to make a smaller change then iterate on it, then one huge one

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Apr 26, 2016

Member

@jfrazelle AFAICS all of the core features work. But some of them either just require root (criu IIRC) or currently can't be done under user namespaces (cgroups -- which I'm working on a patch for). But all of the others should still work, and in principle I want to try to get all of the features working for root operating on a rootless container.

Member

cyphar commented Apr 26, 2016

@jfrazelle AFAICS all of the core features work. But some of them either just require root (criu IIRC) or currently can't be done under user namespaces (cgroups -- which I'm working on a patch for). But all of the others should still work, and in principle I want to try to get all of the features working for root operating on a rootless container.

@jessfraz

This comment has been minimized.

Show comment
Hide comment
@jessfraz

jessfraz Apr 26, 2016

Contributor

Nice :)

On Tuesday, April 26, 2016, Aleksa Sarai notifications@github.com wrote:

@jfrazelle https://github.com/jfrazelle AFAICS all of the core features
work. But some of them either just require root (criu IIRC) or currently
can't be done under user namespaces (cgroups -- which I'm working on a
patch for). But all of the others should still work, and in principle I
want to try to get all of the features working for root operating on a
rootless container.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#774 (comment)

Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3

Contributor

jessfraz commented Apr 26, 2016

Nice :)

On Tuesday, April 26, 2016, Aleksa Sarai notifications@github.com wrote:

@jfrazelle https://github.com/jfrazelle AFAICS all of the core features
work. But some of them either just require root (criu IIRC) or currently
can't be done under user namespaces (cgroups -- which I'm working on a
patch for). But all of the others should still work, and in principle I
want to try to get all of the features working for root operating on a
rootless container.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#774 (comment)

Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Apr 26, 2016

Member

Note: as far as I can see, the only thing left to do before we can clear most of the checkpoints is for me to fix up the rootless cgroup manager so that it stores the process's actual cgroup path. Then runc events should work, as well as all of the freezer code (for root). Checkpoint and restore might also work just by doing that.

Member

cyphar commented Apr 26, 2016

Note: as far as I can see, the only thing left to do before we can clear most of the checkpoints is for me to fix up the rootless cgroup manager so that it stores the process's actual cgroup path. Then runc events should work, as well as all of the freezer code (for root). Checkpoint and restore might also work just by doing that.

@cyphar cyphar closed this Apr 26, 2016

@cyphar cyphar reopened this Apr 26, 2016

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Apr 26, 2016

Member

Whoops wrong button. ;)

Member

cyphar commented Apr 26, 2016

Whoops wrong button. ;)

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Apr 29, 2016

Member

@avagin Do you know if it's possible to run criu as an unprivileged user (specifically from the kernel side when interacting with unprivileged user namespaces)? If not, is there any intention from upstream to get that to work? And if not, should we disable all checkpoint/restore functionality for rootless containers (even if the person doing the checkpoint/restore is root -- since the restore setup might break in weird ways)?

Member

cyphar commented Apr 29, 2016

@avagin Do you know if it's possible to run criu as an unprivileged user (specifically from the kernel side when interacting with unprivileged user namespaces)? If not, is there any intention from upstream to get that to work? And if not, should we disable all checkpoint/restore functionality for rootless containers (even if the person doing the checkpoint/restore is root -- since the restore setup might break in weird ways)?

@davidlt

This comment has been minimized.

Show comment
Hide comment
@davidlt

davidlt May 2, 2016

Looks like this is moving forward, very interesting. Thus I started testing it. Built on Fedora 24 (updated on May 2nd). Looks like mount-bind works fine without root permissions. I wasn't lucky with internet connectivity without sudo. Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!

[davidlt@pccms205 magic_dir]$ pwd
/home/davidlt/magic_dir
[davidlt@pccms205 magic_dir]$ ls
[davidlt@pccms205 magic_dir]$ touch HOST_FILE
[davidlt@pccms205 magic_dir]$ ls
HOST_FILE
[davidlt@pccms205 magic_dir]$ cat /etc/os-release | grep PRETTY_NAME >> HOST_FILE
[davidlt@pccms205 magic_dir]$ cat HOST_FILE
PRETTY_NAME="Fedora 24 (Workstation Edition)"
[davidlt@pccms205 test]$ runc --root $PWD start test_container
sh-4.2# cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="CentOS Linux 7 (Core)"
sh-4.2# cd /some
sh-4.2# cat /etc/os-release | grep PRETTY_NAME >> HOST_FILE
sh-4.2# touch NEW_FILE
sh-4.2# exit
[davidlt@pccms205 test]$ cat ~/magic_dir/HOST_FILE 
PRETTY_NAME="Fedora 24 (Workstation Edition)"
PRETTY_NAME="CentOS Linux 7 (Core)"
[davidlt@pccms205 test]$ file ~/magic_dir/NEW_FILE 
/home/davidlt/magic_dir/NEW_FILE: empty

For anyone else who wants to test this I am also sharing diff between original and modified config.json:

--- config.json 2016-05-02 13:25:24.468181348 +0200
+++ config.json.correct 2016-05-02 13:25:14.661496040 +0200
@@ -6,7 +6,7 @@
    },
    "process": {
        "terminal": true,
-       "user": {},
+       "user": { "uid": 0, "gid": 0, "additionalGids": null },
        "args": [
            "sh"
        ],
@@ -35,6 +35,12 @@
    },
    "hostname": "runc",
    "mounts": [
+                {
+                        "destination": "/some",
+                        "type": "bind",
+                        "source": "/home/davidlt/magic_dir",
+                        "options": [ "rbind" ]
+                },
        {
            "destination": "/proc",
            "type": "proc",
@@ -60,8 +66,7 @@
                "noexec",
                "newinstance",
                "ptmxmode=0666",
-               "mode=0620",
-               "gid=5"
+               "mode=0620"
            ]
        },
        {
@@ -112,14 +117,20 @@
    ],
    "hooks": {},
    "linux": {
-       "resources": {
-           "devices": [
-               {
-                   "allow": false,
-                   "access": "rwm"
-               }
-           ]
-       },
+                "uidMappings": [
+                        {
+                                "hostID": 1000,
+                                "containerID": 0,
+                                "size": 1
+                        }
+                ],
+                "gidMappings": [
+                        {
+                                "hostID": 1000,
+                                "containerID": 0,
+                                "size": 1
+                        }
+                ],
        "namespaces": [
            {
                "type": "pid"
@@ -135,7 +146,10 @@
            },
            {
                "type": "mount"
-           }
+           },
+                        {
+                                "type": "user"
+                        }
        ],
        "maskedPaths": [
            "/proc/kcore",
@@ -152,4 +166,4 @@
            "/proc/sysrq-trigger"
        ]
    }
-}
\ No newline at end of file
+}

davidlt commented May 2, 2016

Looks like this is moving forward, very interesting. Thus I started testing it. Built on Fedora 24 (updated on May 2nd). Looks like mount-bind works fine without root permissions. I wasn't lucky with internet connectivity without sudo. Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!

[davidlt@pccms205 magic_dir]$ pwd
/home/davidlt/magic_dir
[davidlt@pccms205 magic_dir]$ ls
[davidlt@pccms205 magic_dir]$ touch HOST_FILE
[davidlt@pccms205 magic_dir]$ ls
HOST_FILE
[davidlt@pccms205 magic_dir]$ cat /etc/os-release | grep PRETTY_NAME >> HOST_FILE
[davidlt@pccms205 magic_dir]$ cat HOST_FILE
PRETTY_NAME="Fedora 24 (Workstation Edition)"
[davidlt@pccms205 test]$ runc --root $PWD start test_container
sh-4.2# cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="CentOS Linux 7 (Core)"
sh-4.2# cd /some
sh-4.2# cat /etc/os-release | grep PRETTY_NAME >> HOST_FILE
sh-4.2# touch NEW_FILE
sh-4.2# exit
[davidlt@pccms205 test]$ cat ~/magic_dir/HOST_FILE 
PRETTY_NAME="Fedora 24 (Workstation Edition)"
PRETTY_NAME="CentOS Linux 7 (Core)"
[davidlt@pccms205 test]$ file ~/magic_dir/NEW_FILE 
/home/davidlt/magic_dir/NEW_FILE: empty

For anyone else who wants to test this I am also sharing diff between original and modified config.json:

--- config.json 2016-05-02 13:25:24.468181348 +0200
+++ config.json.correct 2016-05-02 13:25:14.661496040 +0200
@@ -6,7 +6,7 @@
    },
    "process": {
        "terminal": true,
-       "user": {},
+       "user": { "uid": 0, "gid": 0, "additionalGids": null },
        "args": [
            "sh"
        ],
@@ -35,6 +35,12 @@
    },
    "hostname": "runc",
    "mounts": [
+                {
+                        "destination": "/some",
+                        "type": "bind",
+                        "source": "/home/davidlt/magic_dir",
+                        "options": [ "rbind" ]
+                },
        {
            "destination": "/proc",
            "type": "proc",
@@ -60,8 +66,7 @@
                "noexec",
                "newinstance",
                "ptmxmode=0666",
-               "mode=0620",
-               "gid=5"
+               "mode=0620"
            ]
        },
        {
@@ -112,14 +117,20 @@
    ],
    "hooks": {},
    "linux": {
-       "resources": {
-           "devices": [
-               {
-                   "allow": false,
-                   "access": "rwm"
-               }
-           ]
-       },
+                "uidMappings": [
+                        {
+                                "hostID": 1000,
+                                "containerID": 0,
+                                "size": 1
+                        }
+                ],
+                "gidMappings": [
+                        {
+                                "hostID": 1000,
+                                "containerID": 0,
+                                "size": 1
+                        }
+                ],
        "namespaces": [
            {
                "type": "pid"
@@ -135,7 +146,10 @@
            },
            {
                "type": "mount"
-           }
+           },
+                        {
+                                "type": "user"
+                        }
        ],
        "maskedPaths": [
            "/proc/kcore",
@@ -152,4 +166,4 @@
            "/proc/sysrq-trigger"
        ]
    }
-}
\ No newline at end of file
+}

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar May 2, 2016

Member

@davidlt

Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!

Unfortunately this is not possible, due to restrictions within the kernel. Essentially this is a logcal result of these two restrictions on user namespaces:

  1. All user namespaces must provide a mapping for root inside a container.
  2. Unprivileged user namespaces can only provide a mapping for one user, the user which created the namespace.

As a result, the only user that is mapped inside the container is your user (as root). You can discuss with the kernel upstream about restriction number 1, because it's the only restriction which it might be possible to fix. The second restriction is just a security issue. But at the moment, there isn't a way to do what you want.

Member

cyphar commented May 2, 2016

@davidlt

Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!

Unfortunately this is not possible, due to restrictions within the kernel. Essentially this is a logcal result of these two restrictions on user namespaces:

  1. All user namespaces must provide a mapping for root inside a container.
  2. Unprivileged user namespaces can only provide a mapping for one user, the user which created the namespace.

As a result, the only user that is mapped inside the container is your user (as root). You can discuss with the kernel upstream about restriction number 1, because it's the only restriction which it might be possible to fix. The second restriction is just a security issue. But at the moment, there isn't a way to do what you want.

@davidlt

This comment has been minimized.

Show comment
Hide comment
@davidlt

davidlt May 2, 2016

Yeah, I tried a few thing. I even added a user within a container using Docker, which I can see while running with runC, but cannot launch anything under that user.

What about internet connectivity?

davidlt commented May 2, 2016

Yeah, I tried a few thing. I even added a user within a container using Docker, which I can see while running with runC, but cannot launch anything under that user.

What about internet connectivity?

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar May 2, 2016

Member

Unfortunately, creating bridges between a container's network namespace and the hosts's network namespace requires creating a virtual interface in the host's network namespace. AFAIK that requires root (but I may be wrong). One potential solution would be to not create a network namespace (this currently doesn't work due to some bindmount issues, but that can be fixed). Obviously, this means you don't get the benefits of network namespacing (such as iptables rules without root).

This could be something else we could push the kernel about. Unfortunately, I'm not familiar enough with networking to be able to help with writing a kernel patch. My only kernel experience thus far has been with cgroups.

Member

cyphar commented May 2, 2016

Unfortunately, creating bridges between a container's network namespace and the hosts's network namespace requires creating a virtual interface in the host's network namespace. AFAIK that requires root (but I may be wrong). One potential solution would be to not create a network namespace (this currently doesn't work due to some bindmount issues, but that can be fixed). Obviously, this means you don't get the benefits of network namespacing (such as iptables rules without root).

This could be something else we could push the kernel about. Unfortunately, I'm not familiar enough with networking to be able to help with writing a kernel patch. My only kernel experience thus far has been with cgroups.

@davidlt

This comment has been minimized.

Show comment
Hide comment
@davidlt

davidlt May 2, 2016

Okay. Looks like internet connectivity will arrive at some point. This is needed in my case because majority of data is not local (i.e. not available on some shared file system mounted via bind/slave mount to the container). It most cases it has to be streamed (network IO). I think, at this point just having internet connectivity is good step forward.

Do you know what was the reasoning for namespaces to provide root inside the container? Quick googling didn't reveal too much documentation around this.

davidlt commented May 2, 2016

Okay. Looks like internet connectivity will arrive at some point. This is needed in my case because majority of data is not local (i.e. not available on some shared file system mounted via bind/slave mount to the container). It most cases it has to be streamed (network IO). I think, at this point just having internet connectivity is good step forward.

Do you know what was the reasoning for namespaces to provide root inside the container? Quick googling didn't reveal too much documentation around this.

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar May 2, 2016

Member

To be honest, I just quickly read through the kernel code and I'm not sure this is a restriction imposed by the kernel. It's possible it's just how we've set up user namespaced containers to work. Currently my runC build is failing with the error:

process_linux.go:247: getting pipe fds for pid 16189 caused "readlink /proc/16189/fd/0: permission denied"

Which tells me there's some permission issues with the /proc setup we have (where we read the pipe file descriptors over stdin -- but for some reason it looks like the process can't open its own stdin?).

Member

cyphar commented May 2, 2016

To be honest, I just quickly read through the kernel code and I'm not sure this is a restriction imposed by the kernel. It's possible it's just how we've set up user namespaced containers to work. Currently my runC build is failing with the error:

process_linux.go:247: getting pipe fds for pid 16189 caused "readlink /proc/16189/fd/0: permission denied"

Which tells me there's some permission issues with the /proc setup we have (where we read the pipe file descriptors over stdin -- but for some reason it looks like the process can't open its own stdin?).

@mrunalp

This comment has been minimized.

Show comment
Hide comment
@mrunalp

mrunalp May 2, 2016

Contributor

@davidlt Best bet for networking would be to use the host network stack (i.e. don't add it to the config).
The way typically networking is setup requires moving network devices from host network namespace to the container's network namespace. With privileged user namespaces, the runc hooks can do that work but that isn't possible with unprivileged containers. It would need discussion with upstream as @cyphar suggested.

Contributor

mrunalp commented May 2, 2016

@davidlt Best bet for networking would be to use the host network stack (i.e. don't add it to the config).
The way typically networking is setup requires moving network devices from host network namespace to the container's network namespace. With privileged user namespaces, the runc hooks can do that work but that isn't possible with unprivileged containers. It would need discussion with upstream as @cyphar suggested.

@avagin

This comment has been minimized.

Show comment
Hide comment
@avagin

avagin May 3, 2016

Contributor

All cgroup operations require having CAP_SYS_ADMIN in the root user
namespace. This means that we cannot set up any cgroups, or join
cgroups. The new cgroup namespace doesn't fix this for us either, but
hopefully we can get a patch upstream to fix this.

I don't understand this passage. cgroups works for unprivileged users by the same way as other file systems. I've read the kernel code and haven't found places in cgroup code which are protected by CAP_SYS_ADMIN.

[avagin@laptop ~]$ whoami 
avagin
[avagin@laptop ~]$ sudo mkdir /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/tasks 
bash: /sys/fs/cgroup/cpu/test/tasks: Permission denied
[avagin@laptop ~]$ mkdir /sys/fs/cgroup/cpu/test/sub
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/sub/tasks 
[avagin@laptop ~]$ cat /sys/fs/cgroup/cpu/test/sub/cpu.shares 
1024
[avagin@laptop ~]$ echo 2014 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ echo 512 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/
total 0
-rw-r--r--. 1 root   root   0 May  3 14:26 cgroup.clone_children
-rw-r--r--. 1 root   root   0 May  3 14:26 cgroup.procs
-r--r--r--. 1 root   root   0 May  3 14:26 cpuacct.stat
-rw-r--r--. 1 root   root   0 May  3 14:26 cpuacct.usage
-r--r--r--. 1 root   root   0 May  3 14:26 cpuacct.usage_percpu
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.cfs_period_us
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.cfs_quota_us
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.shares
-r--r--r--. 1 root   root   0 May  3 14:26 cpu.stat
-rw-r--r--. 1 root   root   0 May  3 14:26 notify_on_release
drwxrwxr-x. 2 avagin avagin 0 May  3 14:25 sub
-rw-r--r--. 1 root   root   0 May  3 14:25 tasks
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/sub/
total 0
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cgroup.clone_children
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cgroup.procs
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.stat
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.usage
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.usage_percpu
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.cfs_period_us
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.cfs_quota_us
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.shares
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpu.stat
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 notify_on_release
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 tasks
Contributor

avagin commented May 3, 2016

All cgroup operations require having CAP_SYS_ADMIN in the root user
namespace. This means that we cannot set up any cgroups, or join
cgroups. The new cgroup namespace doesn't fix this for us either, but
hopefully we can get a patch upstream to fix this.

I don't understand this passage. cgroups works for unprivileged users by the same way as other file systems. I've read the kernel code and haven't found places in cgroup code which are protected by CAP_SYS_ADMIN.

[avagin@laptop ~]$ whoami 
avagin
[avagin@laptop ~]$ sudo mkdir /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/tasks 
bash: /sys/fs/cgroup/cpu/test/tasks: Permission denied
[avagin@laptop ~]$ mkdir /sys/fs/cgroup/cpu/test/sub
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/sub/tasks 
[avagin@laptop ~]$ cat /sys/fs/cgroup/cpu/test/sub/cpu.shares 
1024
[avagin@laptop ~]$ echo 2014 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ echo 512 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/
total 0
-rw-r--r--. 1 root   root   0 May  3 14:26 cgroup.clone_children
-rw-r--r--. 1 root   root   0 May  3 14:26 cgroup.procs
-r--r--r--. 1 root   root   0 May  3 14:26 cpuacct.stat
-rw-r--r--. 1 root   root   0 May  3 14:26 cpuacct.usage
-r--r--r--. 1 root   root   0 May  3 14:26 cpuacct.usage_percpu
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.cfs_period_us
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.cfs_quota_us
-rw-r--r--. 1 root   root   0 May  3 14:26 cpu.shares
-r--r--r--. 1 root   root   0 May  3 14:26 cpu.stat
-rw-r--r--. 1 root   root   0 May  3 14:26 notify_on_release
drwxrwxr-x. 2 avagin avagin 0 May  3 14:25 sub
-rw-r--r--. 1 root   root   0 May  3 14:25 tasks
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/sub/
total 0
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cgroup.clone_children
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cgroup.procs
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.stat
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.usage
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpuacct.usage_percpu
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.cfs_period_us
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.cfs_quota_us
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 cpu.shares
-r--r--r--. 1 avagin avagin 0 May  3 14:25 cpu.stat
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 notify_on_release
-rw-r--r--. 1 avagin avagin 0 May  3 14:25 tasks
@jessfraz

This comment has been minimized.

Show comment
Hide comment
@jessfraz

jessfraz May 3, 2016

Contributor

Just device cgroups will fail

On Tue, May 3, 2016 at 2:29 PM, Andrew Vagin notifications@github.com
wrote:

All cgroup operations require having CAP_SYS_ADMIN in the root user
namespace. This means that we cannot set up any cgroups, or join
cgroups. The new cgroup namespace doesn't fix this for us either, but
hopefully we can get a patch upstream to fix this.

I don't understand this passage. I tried and cgroups works for
unprivileged users by the same way as other file systems. I've read the
kernel code and haven't found places in cgroup code which are protected by
CAP_SYS_ADMIN.

[avagin@laptop ~]$ whoami
avagin
[avagin@laptop ~]$ sudo mkdir /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/tasks
bash: /sys/fs/cgroup/cpu/test/tasks: Permission denied
[avagin@laptop ~]$ mkdir /sys/fs/cgroup/cpu/test/sub
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/sub/tasks
[avagin@laptop ~]$ cat /sys/fs/cgroup/cpu/test/sub/cpu.shares
1024
[avagin@laptop ~]$ echo 2014 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ echo 512 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/
total 0
-rw-r--r--. 1 root root 0 May 3 14:26 cgroup.clone_children
-rw-r--r--. 1 root root 0 May 3 14:26 cgroup.procs
-r--r--r--. 1 root root 0 May 3 14:26 cpuacct.stat
-rw-r--r--. 1 root root 0 May 3 14:26 cpuacct.usage
-r--r--r--. 1 root root 0 May 3 14:26 cpuacct.usage_percpu
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.cfs_period_us
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.cfs_quota_us
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.shares
-r--r--r--. 1 root root 0 May 3 14:26 cpu.stat
-rw-r--r--. 1 root root 0 May 3 14:26 notify_on_release
drwxrwxr-x. 2 avagin avagin 0 May 3 14:25 sub
-rw-r--r--. 1 root root 0 May 3 14:25 tasks
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/sub/
total 0
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cgroup.clone_children
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cgroup.procs
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.stat
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.usage
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.usage_percpu
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.cfs_period_us
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.cfs_quota_us
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.shares
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpu.stat
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 notify_on_release
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 tasks


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#774 (comment)

Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3

Contributor

jessfraz commented May 3, 2016

Just device cgroups will fail

On Tue, May 3, 2016 at 2:29 PM, Andrew Vagin notifications@github.com
wrote:

All cgroup operations require having CAP_SYS_ADMIN in the root user
namespace. This means that we cannot set up any cgroups, or join
cgroups. The new cgroup namespace doesn't fix this for us either, but
hopefully we can get a patch upstream to fix this.

I don't understand this passage. I tried and cgroups works for
unprivileged users by the same way as other file systems. I've read the
kernel code and haven't found places in cgroup code which are protected by
CAP_SYS_ADMIN.

[avagin@laptop ~]$ whoami
avagin
[avagin@laptop ~]$ sudo mkdir /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/tasks
bash: /sys/fs/cgroup/cpu/test/tasks: Permission denied
[avagin@laptop ~]$ mkdir /sys/fs/cgroup/cpu/test/sub
[avagin@laptop ~]$ echo $$ > /sys/fs/cgroup/cpu/test/sub/tasks
[avagin@laptop ~]$ cat /sys/fs/cgroup/cpu/test/sub/cpu.shares
1024
[avagin@laptop ~]$ echo 2014 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ echo 512 > /sys/fs/cgroup/cpu/test/sub/cpu.shares
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/
total 0
-rw-r--r--. 1 root root 0 May 3 14:26 cgroup.clone_children
-rw-r--r--. 1 root root 0 May 3 14:26 cgroup.procs
-r--r--r--. 1 root root 0 May 3 14:26 cpuacct.stat
-rw-r--r--. 1 root root 0 May 3 14:26 cpuacct.usage
-r--r--r--. 1 root root 0 May 3 14:26 cpuacct.usage_percpu
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.cfs_period_us
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.cfs_quota_us
-rw-r--r--. 1 root root 0 May 3 14:26 cpu.shares
-r--r--r--. 1 root root 0 May 3 14:26 cpu.stat
-rw-r--r--. 1 root root 0 May 3 14:26 notify_on_release
drwxrwxr-x. 2 avagin avagin 0 May 3 14:25 sub
-rw-r--r--. 1 root root 0 May 3 14:25 tasks
[avagin@laptop ~]$ ls -l /sys/fs/cgroup/cpu/test/sub/
total 0
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cgroup.clone_children
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cgroup.procs
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.stat
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.usage
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpuacct.usage_percpu
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.cfs_period_us
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.cfs_quota_us
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 cpu.shares
-r--r--r--. 1 avagin avagin 0 May 3 14:25 cpu.stat
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 notify_on_release
-rw-r--r--. 1 avagin avagin 0 May 3 14:25 tasks


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#774 (comment)

Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar May 3, 2016

Member

@avagin Sorry, I need to update the first paragraph. But if you look at your session log:

$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test

If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers. I've been working upstream to allow an unprivileged cgroup namespace to create their own subtrees, which is what is necessary to make rootless containers mostly feature-complete.

But yeah, CAP_SYS_ADMIN is definitely the wrong thing and I'll fix up what I wrote.

Do you know anything about whether you can use criu as an unprivileged user? My guess is that you probably can't, since it messes around with saving and restoring kernel state.

Member

cyphar commented May 3, 2016

@avagin Sorry, I need to update the first paragraph. But if you look at your session log:

$ sudo chown avagin:avagin /sys/fs/cgroup/cpu/test

If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers. I've been working upstream to allow an unprivileged cgroup namespace to create their own subtrees, which is what is necessary to make rootless containers mostly feature-complete.

But yeah, CAP_SYS_ADMIN is definitely the wrong thing and I'll fix up what I wrote.

Do you know anything about whether you can use criu as an unprivileged user? My guess is that you probably can't, since it messes around with saving and restoring kernel state.

@avagin

This comment has been minimized.

Show comment
Hide comment
@avagin

avagin May 3, 2016

Contributor

@cyphar

If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers.

Are you sure that this should be fixed in a kernel space? Maybe we need to fix this in systemd? How does LXC handles this problem.

Do you know anything about whether you can use criu as an unprivileged user? My guess is that you probably can't, since it messes around with saving and restoring kernel state.

We announced "Unprivileged dump" in CRIU 2.0 and now we are working on "Unprivileged restore". I don't know how good it will work for root-less containers, but I think it isn't unsolvable task.

Contributor

avagin commented May 3, 2016

@cyphar

If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers.

Are you sure that this should be fixed in a kernel space? Maybe we need to fix this in systemd? How does LXC handles this problem.

Do you know anything about whether you can use criu as an unprivileged user? My guess is that you probably can't, since it messes around with saving and restoring kernel state.

We announced "Unprivileged dump" in CRIU 2.0 and now we are working on "Unprivileged restore". I don't know how good it will work for root-less containers, but I think it isn't unsolvable task.

@davidlt

This comment has been minimized.

Show comment
Hide comment
@davidlt

davidlt May 4, 2016

I have been testing DMTCP (transparently checkpoints a single-host or distributed computation in user-space) and that worked in user-land for checkpointing and restoring complex applications. Thus CRIU in user-land should technologically possible.

davidlt commented May 4, 2016

I have been testing DMTCP (transparently checkpoints a single-host or distributed computation in user-space) and that worked in user-land for checkpointing and restoring complex applications. Thus CRIU in user-land should technologically possible.

@avagin

This comment has been minimized.

Show comment
Hide comment
@avagin

avagin May 4, 2016

Contributor

On Tue, May 3, 2016 at 2:31 PM, Jess Frazelle notifications@github.com
wrote:

Just device cgroups will fail

Yes, you are right. It isn't only one problem with devices. mknod is
protected by CAP_MKNOD too.

Contributor

avagin commented May 4, 2016

On Tue, May 3, 2016 at 2:31 PM, Jess Frazelle notifications@github.com
wrote:

Just device cgroups will fail

Yes, you are right. It isn't only one problem with devices. mknod is
protected by CAP_MKNOD too.

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar May 4, 2016

Member

@avagin

If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers.

Are you sure that this should be fixed in a kernel space? Maybe we need to fix this in systemd? How does LXC handles this problem.

Reliance on systemd is a bad idea, since not all systems use systemd, systemd doesn't support all of the cgroups we want to use, systemd is a daemon running as root, and there's definitely a more general solution. LXC requires a daemon (cgmanager) that provides an authenticated way of requesting cgroup setup, which requires installing software that runs as root.

The reason I think this should be solved in the kernel is actually not because of containers, it's because cgroups are meant to be a general resource limiting system -- why can't an unprivileged process set up resource limiting for its own subprocesses? And it looks like upstream agrees with me on this point (the details are still a bit murky, but cgroup namespaces will probably be part of the solution).

Yes, you are right. It isn't only one problem with devices. mknod is

protected by CAP_MKNOD too.

note: CAP_MKNOD in the root user namespace.

Member

cyphar commented May 4, 2016

@avagin

If you have to use root to enable using cgroups, it's not useful for some usecases of rootless containers.

Are you sure that this should be fixed in a kernel space? Maybe we need to fix this in systemd? How does LXC handles this problem.

Reliance on systemd is a bad idea, since not all systems use systemd, systemd doesn't support all of the cgroups we want to use, systemd is a daemon running as root, and there's definitely a more general solution. LXC requires a daemon (cgmanager) that provides an authenticated way of requesting cgroup setup, which requires installing software that runs as root.

The reason I think this should be solved in the kernel is actually not because of containers, it's because cgroups are meant to be a general resource limiting system -- why can't an unprivileged process set up resource limiting for its own subprocesses? And it looks like upstream agrees with me on this point (the details are still a bit murky, but cgroup namespaces will probably be part of the solution).

Yes, you are right. It isn't only one problem with devices. mknod is

protected by CAP_MKNOD too.

note: CAP_MKNOD in the root user namespace.

@avagin

This comment has been minimized.

Show comment
Hide comment
@avagin

avagin May 4, 2016

Contributor

The reason I think this should be solved in the kernel is actually not because
of containers, it's because cgroups are meant to be a general resource limiting
system -- why can't an unprivileged process set up resource limiting for its
own subprocesses? And it looks like upstream agrees with me on this point (the
details are still a bit murky, but cgroup namespaces will probably be part of
the solution).

This looks reasonable. Thank you for the explanation.

Contributor

avagin commented May 4, 2016

The reason I think this should be solved in the kernel is actually not because
of containers, it's because cgroups are meant to be a general resource limiting
system -- why can't an unprivileged process set up resource limiting for its
own subprocesses? And it looks like upstream agrees with me on this point (the
details are still a bit murky, but cgroup namespaces will probably be part of
the solution).

This looks reasonable. Thank you for the explanation.

@cyphar cyphar added this to the 0.2.0 milestone May 9, 2016

@cyphar cyphar added enhancement and removed enhancement labels May 13, 2016

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar May 22, 2016

Member

I've opened #836 and #837 to take some of the more general cleanup patches and apply them to runC while this feature is being worked on.

Member

cyphar commented May 22, 2016

I've opened #836 and #837 to take some of the more general cleanup patches and apply them to runC while this feature is being worked on.

@avagin

This comment has been minimized.

Show comment
Hide comment
@avagin

avagin Mar 22, 2017

Contributor

LGTM

Approved with PullApprove

Contributor

avagin commented Mar 22, 2017

LGTM

Approved with PullApprove

@mrunalp

This comment has been minimized.

Show comment
Hide comment
@mrunalp

mrunalp Mar 22, 2017

Contributor

I will also review this. Please wait before merge :)

Contributor

mrunalp commented Mar 22, 2017

I will also review this. Please wait before merge :)

@discordianfish

This comment has been minimized.

Show comment
Hide comment

Awesome :)

if (config.namespaces) {
if (prctl(PR_SET_DUMPABLE, 1, 0, 0, 0) < 0)
bail("failed to set process as dumpable");
}

This comment has been minimized.

@hqhq

hqhq Mar 23, 2017

Contributor

Why do we need this?

@hqhq

hqhq Mar 23, 2017

Contributor

Why do we need this?

This comment has been minimized.

@cyphar

cyphar Mar 23, 2017

Member

/proc/self/uid_map and /proc/self/gid_map become root-owned if you're not dumpable. So the process doing mapping doesn't have privileges to do the mapping.

@cyphar

cyphar Mar 23, 2017

Member

/proc/self/uid_map and /proc/self/gid_map become root-owned if you're not dumpable. So the process doing mapping doesn't have privileges to do the mapping.

This comment has been minimized.

@hqhq

hqhq Mar 23, 2017

Contributor

I mean, shouldn't this unset and reset dumpable happen in parent process?

@hqhq

hqhq Mar 23, 2017

Contributor

I mean, shouldn't this unset and reset dumpable happen in parent process?

This comment has been minimized.

@hqhq

hqhq Mar 23, 2017

Contributor

OK, I see, misunderstood this.

@hqhq

hqhq Mar 23, 2017

Contributor

OK, I see, misunderstood this.

return getCgroupPathHelper(subsystem, cgroup)
}
func getCgroupPathHelper(subsystem, cgroup string) (string, error) {

This comment has been minimized.

@hqhq

hqhq Mar 23, 2017

Contributor

This function is kind of subtle, can you keep the comments?

@hqhq

hqhq Mar 23, 2017

Contributor

This function is kind of subtle, can you keep the comments?

This comment has been minimized.

@cyphar

cyphar Mar 23, 2017

Member

The only comment that makes sense now is on the filepath.Rel is that the one you want me to keep?

@cyphar

cyphar Mar 23, 2017

Member

The only comment that makes sense now is on the filepath.Rel is that the one you want me to keep?

This comment has been minimized.

@hqhq

hqhq Mar 23, 2017

Contributor

In this function yes, and maybe leave the other comment to raw.path() to specify why we use GetOwnCgroupPath instead of GetInitCgroupPath, I used to get a lot of people ask me about these cgroup path functions...

@hqhq

hqhq Mar 23, 2017

Contributor

In this function yes, and maybe leave the other comment to raw.path() to specify why we use GetOwnCgroupPath instead of GetInitCgroupPath, I used to get a lot of people ask me about these cgroup path functions...

This comment has been minimized.

@cyphar

cyphar Mar 23, 2017

Member

Fair enough. Not sure why I removed them in the first place.

@cyphar

cyphar Mar 23, 2017

Member

Fair enough. Not sure why I removed them in the first place.

Show outdated Hide outdated Makefile
@@ -91,11 +91,18 @@ localunittest: all
go test -timeout 3m -tags "$(BUILDTAGS)" ${TESTFLAGS} -v $(allpackages)
integration: runcimage
docker run -e TESTFLAGS -t --privileged --rm -v $(CURDIR):/go/src/$(PROJECT) $(RUNC_IMAGE) make localintegration
docker run -e TESTFLAGS -ti --privileged --rm -v $(CURDIR):/go/src/$(PROJECT) $(RUNC_IMAGE) make localintegration

This comment has been minimized.

@hqhq

hqhq Mar 23, 2017

Contributor

You don't really need -i here, do you? We removed this to fix CI in non-interactive mode, see #1252 .

@hqhq

hqhq Mar 23, 2017

Contributor

You don't really need -i here, do you? We removed this to fix CI in non-interactive mode, see #1252 .

This comment has been minimized.

@cyphar

cyphar Mar 23, 2017

Member

No I don't think we need it.

@cyphar

cyphar Mar 23, 2017

Member

No I don't think we need it.

@@ -223,6 +231,29 @@ func GetInitCgroupDir(subsystem string) (string, error) {
return getControllerPath(subsystem, cgroups)
}
func GetInitCgroupPath(subsystem string) (string, error) {

This comment has been minimized.

@hqhq

hqhq Mar 23, 2017

Contributor

Do we need this since it's not used? And I doubt it'll be any usage.

@hqhq

hqhq Mar 23, 2017

Contributor

Do we need this since it's not used? And I doubt it'll be any usage.

This comment has been minimized.

@cyphar

cyphar Mar 23, 2017

Member

We don't need it now, but the systemd cgroup manager does use GetInitCgroup. I can drop it if you prefer, this is more for the benefit of users of libcontainer.

@cyphar

cyphar Mar 23, 2017

Member

We don't need it now, but the systemd cgroup manager does use GetInitCgroup. I can drop it if you prefer, this is more for the benefit of users of libcontainer.

This comment has been minimized.

@hqhq

hqhq Mar 23, 2017

Contributor

That's probably erroneous, maybe just nobody using systemd cgroup inside a container, I'm OK we keep it, so we'll say no to subsequent PR which'll try to remove this unused function :)

@hqhq

hqhq Mar 23, 2017

Contributor

That's probably erroneous, maybe just nobody using systemd cgroup inside a container, I'm OK we keep it, so we'll say no to subsequent PR which'll try to remove this unused function :)

@hqhq

This comment has been minimized.

Show comment
Hide comment
@hqhq

hqhq Mar 23, 2017

Contributor

LGTM

Approved with PullApprove

Contributor

hqhq commented Mar 23, 2017

LGTM

Approved with PullApprove

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Mar 23, 2017

Member

Hang on, lemme squash+rebase it first. 😉

Member

cyphar commented Mar 23, 2017

Hang on, lemme squash+rebase it first. 😉

cyphar added some commits Jan 17, 2017

*: handle unprivileged operations and !dumpable
Effectively, !dumpable makes implementing rootless containers quite
hard, due to a bunch of different operations on /proc/self no longer
being possible without reordering everything.

!dumpable only really makes sense when you are switching between
different security contexts, which is only the case when we are joining
namespaces. Unfortunately this means that !dumpable will still have
issues in this instance, and it should only be necessary to set
!dumpable if we are not joining USER namespaces (new kernels have
protections that make !dumpable no longer necessary). But that's a topic
for another time.

This also includes code to unset and then re-set dumpable when doing the
USER namespace mappings. This should also be safe because in principle
processes in a container can't see us until after we fork into the PID
namespace (which happens after the user mapping).

In rootless containers, it is not possible to set a non-dumpable
process's /proc/self/oom_score_adj (it's owned by root and thus not
writeable). Thus, it needs to be set inside nsexec before we set
ourselves as non-dumpable.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
runc: add support for rootless containers
This enables the support for the rootless container mode. There are many
restrictions on what rootless containers can do, so many different runC
commands have been disabled:

* runc checkpoint
* runc events
* runc pause
* runc ps
* runc restore
* runc resume
* runc update

The following commands work:

* runc create
* runc delete
* runc exec
* runc kill
* runc list
* runc run
* runc spec
* runc state

In addition, any specification options that imply joining cgroups have
also been disabled. This is due to support for unprivileged subtree
management not being available from Linux upstream.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
rootless: add rootless cgroup manager
The rootless cgroup manager acts as a noop for all set and apply
operations. It is just used for rootless setups. Currently this is far
too simple (we need to add opportunistic cgroup management), but is good
enough as a first-pass at a noop cgroup manager.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
libcontainer: configs: add proper HostUID and HostGID
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
rootless: add autogenerated rootless config from `runc spec`
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).

In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
integration: added root requires
This is in preperation of allowing us to run the integration test suite
on rootless containers.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
libcontainer: init: fix unmapped console fchown
If the stdio of the container is owned by a group which is not mapped in
the user namespace, attempting to fchown the file descriptor will result
in EINVAL. Counteract this by simply not doing an fchown if the group
owner of the file descriptor has no host mapping according to the
configured GIDMappings.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
tests: add rootless integration tests
This adds targets for rootless integration tests, as well as all of the
required setup in order to get the tests to run. This includes quite a
few changes, because of a lot of assumptions about things running as
root within the bats scripts (which is not true when setting up rootless
containers).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Mar 23, 2017

Member

@hqhq Squashed and rebased.

Member

cyphar commented Mar 23, 2017

@hqhq Squashed and rebased.

@hqhq

This comment has been minimized.

Show comment
Hide comment
@hqhq

hqhq Mar 23, 2017

Contributor

LGTM

Approved with PullApprove

Contributor

hqhq commented Mar 23, 2017

LGTM

Approved with PullApprove

@mrunalp

This comment has been minimized.

Show comment
Hide comment
@mrunalp

mrunalp Mar 23, 2017

Contributor

Should we drop groups that are unmapped?

[mrunal@acme busybox]$ ./runc --root ~/runc/state run 1234
/ # id
uid=0(root) gid=0(root) groups=65534,0(root)
Contributor

mrunalp commented Mar 23, 2017

Should we drop groups that are unmapped?

[mrunal@acme busybox]$ ./runc --root ~/runc/state run 1234
/ # id
uid=0(root) gid=0(root) groups=65534,0(root)
@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Mar 24, 2017

Member

@mrunalp We don't have privileges to do that. In fact, it's a security feature of the kernel to not allow unprivileged users to drop supplementary groups because of paths with modes such as 0707. Such ACLs make it easy to blacklist a group from accessing something.

Member

cyphar commented Mar 24, 2017

@mrunalp We don't have privileges to do that. In fact, it's a security feature of the kernel to not allow unprivileged users to drop supplementary groups because of paths with modes such as 0707. Such ACLs make it easy to blacklist a group from accessing something.

@crosbymichael

This comment has been minimized.

Show comment
Hide comment
Member

crosbymichael commented Mar 27, 2017

ping @mrunalp

@mrunalp

This comment has been minimized.

Show comment
Hide comment
@mrunalp

mrunalp Mar 27, 2017

Contributor

LGTM

Approved with PullApprove

Contributor

mrunalp commented Mar 27, 2017

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 653207b into opencontainers:master Mar 27, 2017

3 checks passed

code-review/pullapprove Approved by hqhq, mrunalp
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
janky Jenkins build runc-PRs 2956 has succeeded
Details
@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Mar 27, 2017

Member

🎉

Member

cyphar commented Mar 27, 2017

🎉

@davidlt

This comment has been minimized.

Show comment
Hide comment
@davidlt

davidlt Mar 27, 2017

Looks like it's party time! 11 months in development. Someone should post this on Hacker News.

davidlt commented Mar 27, 2017

Looks like it's party time! 11 months in development. Someone should post this on Hacker News.

@marcosnils

This comment has been minimized.

Show comment
Hide comment
@marcosnils

marcosnils Mar 27, 2017

Contributor

image

:D

Contributor

marcosnils commented Mar 27, 2017

image

:D

@muayyad-alsadi

This comment has been minimized.

Show comment
Hide comment
@muayyad-alsadi

muayyad-alsadi Mar 27, 2017

Any link to updated docs. Blog post?

Any link to updated docs. Blog post?

@cyphar

This comment has been minimized.

Show comment
Hide comment
Member

cyphar commented Mar 27, 2017

@muayyad-alsadi No doc updates, I'll follow up with those. Here's a blog post from last year and my talk at Linux.conf.au from earlier this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment