Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rootless Containers #774

Merged
merged 8 commits into from Mar 27, 2017
Merged

Rootless Containers #774

merged 8 commits into from Mar 27, 2017

Conversation

@cyphar
Copy link
Member

cyphar commented Apr 23, 2016

This enables the support for "rootless container mode". There are
certain restrictions on what non-root users can do, resulting in several
runC features not being available. There are no checks in place at
the moment to make this clear to users.
I've implemented the config
validation.

  • All cgroup operations require having write access to your current
    cgroup directory. By default, the directories are owned by root and have
    the mode 0755. This means that we cannot set up any cgroups, or join
    cgroups. Therefore new cgroup namespace doesn't fix this for us either, but
    hopefully we can get a patch upstream to fix this. But we should still
    improve cgroup handling so that we apply any cgroups we can if we have
    write access to the directory.
  • setgroups(2) cannot be used in a non-privileged user namespace setup.
    We also have to set /proc/self/setgroups to "deny".
  • We cannot map any user other than ourselves in a rootless container,
    which means that any user-related directives won't work. You can only be
    "root".

If you want to use this, you have to make sure you remove the gid=5 entry from the /dev/pts mount, and only map your own user in the namespace.

Here's runc start working in both root and rootless setup:
it works

And here's runc exec working in both root and rootless setup:
it also works

TODO

  • Provide a meaningful error message if the user specifies a configuration which maps more than one user (or doesn't map the effective user).
  • Provide a meaningful error message (don't just ignore it) if the user tries to use the user directive to run as a different user.
  • Provide a meaningful error message (don't just ignore it) if the user tries to specify any cgroup settings. Currently we can't provide cgroup settings (though it should be possible if the user just so happens to own the cgroup they currently reside in).
  • runc exec doesn't work, and we should be able to implement it. This actually complicates the code in nsenter.c which checks whether the container is unprivileged. setgroup needs special treatment.
  • runc exec doesnt' work with root running the exec and a rootless container (ironically). This is because we autodetect the rootless parameter on run, which isn't accurate. This can be fixed by storing the rootless flag in the state.json.
  • rootless should be passed to the init through netlink. Currently we are doing the rootless check in two places and it doesn't make sense to do the check in nsenter.c -- we might actually have to do it with capability checks in the future.
  • runc events doesn't work because the rootless.Manager doesn't appear to manage the paths properly, so it can't get any data from the cgroups.
  • Add runc spec --rootless.
  • loadContainer doesn't properly load the cgroup manager for the container, because the API forces that to happen (libcontainer.New() takes the cgroup manager as an argument). We can probably fix this by making it load the cgroup manager from the container state (but it might be ugly). Without this, we can't even hope to have runc pause and runc resume working.
  • Currently the cgroup setup is binary (either we use cgroups or we don't). But since cgroupv1 lets you have different permissions on different hierarchies, we should check for each subsystem that we have access to create a subtree in that cgroup. Then we can support several setups (as well as the kernel patch I've proposed). This would require many changes in config validation. In addition, we would have to store what cgroups we are using in a bitmask (like in the kernel).
    • This would require adding a bunch of tests to libcontainer/cgroup/rootless. Luckily we already all of the mock stuff we need in cgroupfs.
  • Add unit tests to:
    • libcontainer/config/validate/rootless
    • libcontainer/specconv/spec_linux.go with Rootless == true.
    • libcontainer/cgroup/rootless (not necessary at the moment)
  • Detaching doesn't work due to a bug in runC with --console and user namespaces. #814 and #883
  • Add a setup for testing the rootless containers so we can make sure there's no regressions in this set up. We can try running the whole test suite (minus cgroups and a few other things). All of the sniff tests should work. The sniff tests are no more. THIS IS CURRENTLY BLOCKED ON FIXING THE --console BUG. The console bug has been fixed as part of #1018.
    • We have to refactor all of the tests to use a wrapper around runc (so we can mess around with arguments in rootless mode).
    • Also, we need to skip certain tests in rootless mode, since they are not supported.
  • We currently cannot use network namespaces in a rootless container setup (and still connect to the network). A fix for this is to use the host network, but that currently doesn't work in runC (#799). It should also be noted that things like ping don't work in rootless containers (user namespace issues). #807 fixes the validation issue.
    • Figure out why we still can't talk to the internet.
  • Switch container.Processes() to join the container PID namespace, enumerate the list of PIDs and then send them over a UNIX socket (this causes the PIDs to be translated). The end result is to not require cgroups to enumerate PIDs (which actually isn't a good idea since processes may join sub-cgroups). This is necessary for runc ps and similar things to work properly.
    • There are some unanswered questions about sending the PIDs though, because you need CAP_SYS_ADMIN in order to send a different PID. And there's also a valid question about atomicity (enumeration is not atomic, reading from cgroup.procs is).

Open Questions

What works?

  • As unprivileged user:
    • runc checkpoint (while potentially possible, not implemented)
    • runc create (#814)
    • runc create --console (#814)
    • runc delete
    • runc events (not really useful -- cgroups)
    • runc exec
    • runc exec --console (#814)
    • runc kill
    • runc list
      • with containers not readable by us.
    • runc pause (cgroups)
    • runc ps (cgroups)
    • runc restore (while potentially possible, not implemented)
    • runc resume (cgroups)
    • runc run
    • runc run -d --console (#814)
    • runc spec
    • runc start (create doesn't work -- #814)
    • runc state
    • runc update (cgroups)
  • As root:
    • runc checkpoint (while potentially possible, not implemented)
    • runc delete
    • runc events (not really useful -- cgroups)
    • runc exec
    • runc exec --console (#814)
    • runc kill
    • runc list
    • runc pause (cgroups)
    • runc ps (cgroups)
    • runc restore (while potentially possible, not implemented)
    • runc resume (cgroups)
    • runc run
    • runc run -d --console (#814)
    • runc spec
    • runc start (create doesn't work -- #814)
    • runc state
    • runc update (cgroups)

Kernel Patches

Implements #38.

Signed-off-by: Aleksa Sarai asarai@suse.de

@cyphar cyphar force-pushed the cyphar:rootless-containers branch 4 times, most recently from 97d38f4 to cef7834 Apr 23, 2016
@mrunalp
Copy link
Contributor

mrunalp commented Apr 23, 2016

For cgroups, we can skip doing any setup if cgroupsPath == "" and Resources == nil in the config.
We can either skip or introduce a NoOp cgroups manager.

@cyphar
Copy link
Member Author

cyphar commented Apr 24, 2016

@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable).

@mrunalp
Copy link
Contributor

mrunalp commented Apr 24, 2016

Sounds good.

Sent from my iPhone

On Apr 23, 2016, at 8:05 PM, Aleksa Sarai notifications@github.com wrote:

@mrunalp I'm going to go with a rootless cgroup manager so we might expand it later (there are some upcoming kernel features that might make cgroups in rootless containers usable).


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub

@cyphar cyphar changed the title [WIP] Rootless Containers Rootless Containers Apr 24, 2016
@cyphar cyphar force-pushed the cyphar:rootless-containers branch 2 times, most recently from 74a802d to 43e810a Apr 25, 2016
return nil
}

// Used for comparison.

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Apr 25, 2016

Member

This is some pretty complex validation. So some cgroups are ok but others are not?

This comment has been minimized.

Copy link
@mrunalp

mrunalp Apr 25, 2016

Contributor

We could just fail the check if any values are set at all. No need to go check for defaults. User can create the config without setting resources or path which is simple to check.

This comment has been minimized.

Copy link
@cyphar

cyphar Apr 25, 2016

Author Member

The /actual/ check is that all cgroups have no non-default settings. Unfortunately, specconv adds device cgroup settings that get merged with the config the user specified -- so a simple "is this equal to the zero value" check doesn't cut it. I haven't figured out a nice way of dealing with that (specconv runs long before we get to this part, and we need it to run before we can do anything with the config (like figuring out if we're rootless)).

Maybe we can do our rootless check before specconv, then specconv doesn't modify the cgroup settings if we're rootless, and then we do the config checks for rootless (do we have mapping rights).

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Apr 25, 2016

Member

Ya, that sounds better. We should be looking for this in runc not in libcontainer.

This comment has been minimized.

Copy link
@cyphar

cyphar Apr 25, 2016

Author Member

I'm not really convinced that we shouldn't be doing any checking in libcontainer. The same question can be asked about libcontainer/configs/validate -- why do we do any config verification inside libcontainer? There's also a question of whether or not libcontainer should autodetect rootless mode or whether it should be passed as an option (you can't use rootless containers with root as far as I can tell -- and it's definitely a bad idea).

This comment has been minimized.

Copy link
@crosbymichael

crosbymichael Apr 25, 2016

Member

runc should populate the correct config that libcontainer gets and it should not be modified inside libcontainer. All the changes that need to be made should happen while we generate the config not after it is made.

This comment has been minimized.

Copy link
@cyphar

cyphar Apr 26, 2016

Author Member

The current state of this patchset doesn't modify the config inside libcontainer. The issue is that specconv adds device options to []Device even if the user doesn't specify anything. So we can either do the verification of the cgroups in runC (which means that if someone uses libcontainer directly they probably won't immediately realise they can't set cgroup settings) or we do the verification in libcontainer and make specconv not generate any cgroup options in rootless mode.

I'll also move the isRootless checks to RootlessValidator.

EDIT: I've fixed this.

@cyphar cyphar force-pushed the cyphar:rootless-containers branch 5 times, most recently from 2f04027 to 7fbd302 Apr 26, 2016
@jessfraz
Copy link
Contributor

jessfraz commented Apr 26, 2016

i think the rootless cgroups manager is good, then it can be expanded when the cgroups ns is added :) just my opinion but ianam, thanks for this

@jessfraz
Copy link
Contributor

jessfraz commented Apr 26, 2016

also wrt the features that might not be possible or are hard for the time being, they could be disabled, and then slowly turned back on as implementations evolve, kinda like how we did userns in docker, and then slowly added more features back in wrt sharing namespaces
it's easier to make a smaller change then iterate on it, then one huge one

@cyphar
Copy link
Member Author

cyphar commented Apr 26, 2016

@jfrazelle AFAICS all of the core features work. But some of them either just require root (criu IIRC) or currently can't be done under user namespaces (cgroups -- which I'm working on a patch for). But all of the others should still work, and in principle I want to try to get all of the features working for root operating on a rootless container.

@jessfraz
Copy link
Contributor

jessfraz commented Apr 26, 2016

Nice :)

On Tuesday, April 26, 2016, Aleksa Sarai notifications@github.com wrote:

@jfrazelle https://github.com/jfrazelle AFAICS all of the core features
work. But some of them either just require root (criu IIRC) or currently
can't be done under user namespaces (cgroups -- which I'm working on a
patch for). But all of the others should still work, and in principle I
want to try to get all of the features working for root operating on a
rootless container.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#774 (comment)

Jessie Frazelle
4096R / D4C4 DD60 0D66 F65A 8EFC 511E 18F3 685C 0022 BFF3
pgp.mit.edu http://pgp.mit.edu/pks/lookup?op=get&search=0x18F3685C0022BFF3

@cyphar
Copy link
Member Author

cyphar commented Apr 26, 2016

Note: as far as I can see, the only thing left to do before we can clear most of the checkpoints is for me to fix up the rootless cgroup manager so that it stores the process's actual cgroup path. Then runc events should work, as well as all of the freezer code (for root). Checkpoint and restore might also work just by doing that.

@cyphar cyphar closed this Apr 26, 2016
@cyphar cyphar reopened this Apr 26, 2016
@cyphar
Copy link
Member Author

cyphar commented Apr 26, 2016

Whoops wrong button. ;)

@cyphar
Copy link
Member Author

cyphar commented Apr 29, 2016

@avagin Do you know if it's possible to run criu as an unprivileged user (specifically from the kernel side when interacting with unprivileged user namespaces)? If not, is there any intention from upstream to get that to work? And if not, should we disable all checkpoint/restore functionality for rootless containers (even if the person doing the checkpoint/restore is root -- since the restore setup might break in weird ways)?

@cyphar cyphar force-pushed the cyphar:rootless-containers branch from 999e505 to fc23a67 May 1, 2016
@davidlt
Copy link

davidlt commented May 2, 2016

Looks like this is moving forward, very interesting. Thus I started testing it. Built on Fedora 24 (updated on May 2nd). Looks like mount-bind works fine without root permissions. I wasn't lucky with internet connectivity without sudo. Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!

[davidlt@pccms205 magic_dir]$ pwd
/home/davidlt/magic_dir
[davidlt@pccms205 magic_dir]$ ls
[davidlt@pccms205 magic_dir]$ touch HOST_FILE
[davidlt@pccms205 magic_dir]$ ls
HOST_FILE
[davidlt@pccms205 magic_dir]$ cat /etc/os-release | grep PRETTY_NAME >> HOST_FILE
[davidlt@pccms205 magic_dir]$ cat HOST_FILE
PRETTY_NAME="Fedora 24 (Workstation Edition)"
[davidlt@pccms205 test]$ runc --root $PWD start test_container
sh-4.2# cat /etc/os-release | grep PRETTY_NAME
PRETTY_NAME="CentOS Linux 7 (Core)"
sh-4.2# cd /some
sh-4.2# cat /etc/os-release | grep PRETTY_NAME >> HOST_FILE
sh-4.2# touch NEW_FILE
sh-4.2# exit
[davidlt@pccms205 test]$ cat ~/magic_dir/HOST_FILE 
PRETTY_NAME="Fedora 24 (Workstation Edition)"
PRETTY_NAME="CentOS Linux 7 (Core)"
[davidlt@pccms205 test]$ file ~/magic_dir/NEW_FILE 
/home/davidlt/magic_dir/NEW_FILE: empty

For anyone else who wants to test this I am also sharing diff between original and modified config.json:

--- config.json 2016-05-02 13:25:24.468181348 +0200
+++ config.json.correct 2016-05-02 13:25:14.661496040 +0200
@@ -6,7 +6,7 @@
    },
    "process": {
        "terminal": true,
-       "user": {},
+       "user": { "uid": 0, "gid": 0, "additionalGids": null },
        "args": [
            "sh"
        ],
@@ -35,6 +35,12 @@
    },
    "hostname": "runc",
    "mounts": [
+                {
+                        "destination": "/some",
+                        "type": "bind",
+                        "source": "/home/davidlt/magic_dir",
+                        "options": [ "rbind" ]
+                },
        {
            "destination": "/proc",
            "type": "proc",
@@ -60,8 +66,7 @@
                "noexec",
                "newinstance",
                "ptmxmode=0666",
-               "mode=0620",
-               "gid=5"
+               "mode=0620"
            ]
        },
        {
@@ -112,14 +117,20 @@
    ],
    "hooks": {},
    "linux": {
-       "resources": {
-           "devices": [
-               {
-                   "allow": false,
-                   "access": "rwm"
-               }
-           ]
-       },
+                "uidMappings": [
+                        {
+                                "hostID": 1000,
+                                "containerID": 0,
+                                "size": 1
+                        }
+                ],
+                "gidMappings": [
+                        {
+                                "hostID": 1000,
+                                "containerID": 0,
+                                "size": 1
+                        }
+                ],
        "namespaces": [
            {
                "type": "pid"
@@ -135,7 +146,10 @@
            },
            {
                "type": "mount"
-           }
+           },
+                        {
+                                "type": "user"
+                        }
        ],
        "maskedPaths": [
            "/proc/kcore",
@@ -152,4 +166,4 @@
            "/proc/sysrq-trigger"
        ]
    }
-}
\ No newline at end of file
+}

@cyphar
Copy link
Member Author

cyphar commented May 2, 2016

@davidlt

Later on I need to check if there is a way not to be a root user within container (our software will not using root account). Nice progress!

Unfortunately this is not possible, due to restrictions within the kernel. Essentially this is a logcal result of these two restrictions on user namespaces:

  1. All user namespaces must provide a mapping for root inside a container.
  2. Unprivileged user namespaces can only provide a mapping for one user, the user which created the namespace.

As a result, the only user that is mapped inside the container is your user (as root). You can discuss with the kernel upstream about restriction number 1, because it's the only restriction which it might be possible to fix. The second restriction is just a security issue. But at the moment, there isn't a way to do what you want.

cyphar added 5 commits Mar 17, 2017
Previously Host{U,G}ID only gave you the root mapping, which isn't very
useful if you are trying to do other things with the IDMaps.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
Since this is a runC-specific feature, this belongs here over in
opencontainers/ocitools (which is for generic OCI runtimes).

In addition, we don't create a new network namespace. This is because
currently if you want to set up a veth bridge you need CAP_NET_ADMIN in
both network namespaces' pinned user namespace to create the necessary
interfaces in each network namespace.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
This is in preperation of allowing us to run the integration test suite
on rootless containers.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
If the stdio of the container is owned by a group which is not mapped in
the user namespace, attempting to fchown the file descriptor will result
in EINVAL. Counteract this by simply not doing an fchown if the group
owner of the file descriptor has no host mapping according to the
configured GIDMappings.

Signed-off-by: Aleksa Sarai <asarai@suse.de>
This adds targets for rootless integration tests, as well as all of the
required setup in order to get the tests to run. This includes quite a
few changes, because of a lot of assumptions about things running as
root within the bats scripts (which is not true when setting up rootless
containers).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar cyphar force-pushed the cyphar:rootless-containers branch from 51e88c4 to ba38383 Mar 23, 2017
@cyphar
Copy link
Member Author

cyphar commented Mar 23, 2017

@hqhq Squashed and rebased.

@hqhq
Copy link
Contributor

hqhq commented Mar 23, 2017

LGTM

Approved with PullApprove

@mrunalp
Copy link
Contributor

mrunalp commented Mar 23, 2017

Should we drop groups that are unmapped?

[mrunal@acme busybox]$ ./runc --root ~/runc/state run 1234
/ # id
uid=0(root) gid=0(root) groups=65534,0(root)
@cyphar
Copy link
Member Author

cyphar commented Mar 24, 2017

@mrunalp We don't have privileges to do that. In fact, it's a security feature of the kernel to not allow unprivileged users to drop supplementary groups because of paths with modes such as 0707. Such ACLs make it easy to blacklist a group from accessing something.

@crosbymichael
Copy link
Member

crosbymichael commented Mar 27, 2017

ping @mrunalp

@mrunalp
Copy link
Contributor

mrunalp commented Mar 27, 2017

LGTM

Approved with PullApprove

@mrunalp mrunalp merged commit 653207b into opencontainers:master Mar 27, 2017
3 checks passed
3 checks passed
code-review/pullapprove Approved by hqhq, mrunalp
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
janky Jenkins build runc-PRs 2956 has succeeded
Details
@cyphar
Copy link
Member Author

cyphar commented Mar 27, 2017

🎉

@davidlt
Copy link

davidlt commented Mar 27, 2017

Looks like it's party time! 11 months in development. Someone should post this on Hacker News.

@marcosnils
Copy link
Contributor

marcosnils commented Mar 27, 2017

image

:D

@muayyad-alsadi
Copy link

muayyad-alsadi commented Mar 27, 2017

Any link to updated docs. Blog post?

@cyphar
Copy link
Member Author

cyphar commented Mar 27, 2017

@muayyad-alsadi No doc updates, I'll follow up with those. Here's a blog post from last year and my talk at Linux.conf.au from earlier this year.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

You can’t perform that action at this time.