Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cgroups with limits as rootless #1540

Merged

Conversation

williammartin
Copy link

Following on from the rootless cgroups discussion in #1457, this PR provides an initial implementation for rootless cgroups that fulfils Garden's needs. This does not enable all cgroup related functiionality, but we think it's a good starting point for other currently disabled features to work.

In more detail, this PR removes the current rootless cgroup implementation, and using cgroups in rootless mode should now work broadly the same as rootful, providing that runc has permissions on the cgroup path.

Differences with Rootful

Typically, without permissions on the cgroup path, we would expect an error when applying (entering a pid into a controller), however we now support a kind of "opportunistic" cgroup usage when no limits have been set and no cgroup path has been provided. Since child processes are entered into the same cgroup as their parent, this is effectively the same (regarding cgroup enforcement) as not providing a cgroup path and not setting limits. Attempting to set (change a resource limit) will still result in a permission denied error either during creation or via runc update but hopefully with a slightly more informative error.

The devices cgroup doesn't have all functionality available for setting limits. This is because there is a requirement on CAP_SYS_ADMIN for the devices.allow and devices.deny files. We're not sure of a way to provide different device white/blacklisting per container, but a possible solution for some use cases is to set a static list in a parent cgroup that is inherited. This works because applying does not require CAP_SYS_ADMIN.

Disabled features

We haven't enabled OOM notification or Memory pressure notificiation but this is hopefully as simple as removing the rootless conditional:

if c.config.Rootless {

We haven't enabled CRIU features because we aren't familiar with what is required.

Notes

The BATS added seem to cover the right features, however, we aren't super happy with the changes to the Makefile. Thoughts on how to solve the requirement that a cgroup exists, chowned to rootless in a nicer way would be much appreciated.

Signed-off-by: Ed King eking@pivotal.io

Makefile Outdated
rootlessintegration: runcimage
docker run -e TESTFLAGS -t --privileged --rm -v $(CURDIR):/go/src/$(PROJECT) --cap-drop=ALL -u rootless $(RUNC_IMAGE) make localintegration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why have you removed --cap-drop=all? That was necessary to ensure that rootless didn't depend on extra capabilities.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems --cap-drop=ALL doesn't do anything useful when also using --privileged. With or without it, once we switch to the rootless user our caps are:

CapInh: 0000003fffffffff
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

So, as the tests are right now in master, and in this PR, we don't have any effective caps. If we want to guard against picking up caps somehow from files, we could perhaps figure out a way to drop from the bounding/inheritable set as well but I'm not sure that really makes sense within the test context.

Makefile Outdated
@@ -96,10 +96,14 @@ integration: runcimage
localintegration: all
bats -t tests/integration${TESTFLAGS}

rootlessintegrationstub:
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/rootless_cgroup
chown rootless:rootless -R /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/rootless_cgroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is crucial that when we add new tests for opportunistic functionality in rootless that we make sure that the base functionality works. So if we want to add tests for this (and also newuidmap in the future) we need to add them as effectively re-runs of the rootless tests under different conditions (with some extra tests enabled).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure we understand this, could you elaborate please?

Copy link
Member

@cyphar cyphar Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we have the require rootless_cgroup class of tests, and the rest of the tests that don't have that requirement, I'm worried about not explicitly testing the case where this setup code is not run (in other words, the case were runc has no permissions to touch any cgroups). I appreciate that code is still tested, but I just want to be cautious about testing "opportunistic" functionality like rootless cgroup handling or (in the future) newuidmap handling.

I think the best solution to this would be to have something like the set of additional rootless features: (cgroups newuidmap etc) and then test base + <the power set of the other features>. So in the above case we would test base, base+cgroups, base+newuidmap, base+cgroups+newuidmap. But that's just an idea.

return err
}
m.Paths[sys.Name()] = p
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC we moved this before Apply to handle some race conditions, maybe @hqhq remembers this better than me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if some subsystems Apply failed, we are not able to cleanup cgroup directories if we haven't recorded it in m.Paths, see #1196

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@williammartin What if you keep the m.Paths[sys.Name()] = p where it is, but do delete(m.Paths, sys.Name()) if there is a permission error that is ignore-able?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable, we can try it out

@@ -198,6 +206,10 @@ func (m *Manager) Set(container *configs.Config) error {
for _, sys := range subsystems {
path := paths[sys.Name()]
if err := sys.Set(path, container.Cgroups); err != nil {
if path == "" {
// cgroup never applied
return fmt.Errorf("cannot set limits on the %s cgroup, as the container has not joined it. The cgroup controller may not be mounted or you may not have previously had permissions to create the cgroup subdirectory", sys.Name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is way too long.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
cannot set limits on the %s cgroups due to container not joining it or something like that seems enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, no problem.

@@ -83,8 +83,7 @@ func (p *setnsProcess) start() (err error) {
if err = p.execSetns(); err != nil {
return newSystemErrorWithCause(err, "executing setns process")
}
// We can't join cgroups if we're in a rootless container.
if !p.config.Rootless && len(p.cgroupPaths) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is still correct in some cases, but I guess erroring out is acceptable if someone explicitly asked for an impossible cgroup configuration (now that we could in principle nest things). I would like to see a test for this though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not clear under what circumstances the rootless check still makes sense? Can you give an example please?

p.cgroupPaths is loaded from the state.json which best we can tell has to be the same as the cgroup paths the init process is in, so barring people doing weird things, if you succeeded on create, you should succeed now?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure what I meant either, sorry about that. This change is fine, but I would like to see a runc exec test with rootless cgroups in use, to make sure this works fine.

@cyphar
Copy link
Member

cyphar commented Aug 3, 2017

Just to be clear, I'm very impressed that there were only two changes necessary to make the cgroups/fs driver opportunistic and this looks like it's on the right track. My main gripes are just with how we're testing stuff.

I'll look over the limitations you've listed when I get a chance.

@cyphar cyphar added this to the 1.1.0 milestone Aug 11, 2017
@yastij
Copy link

yastij commented Aug 11, 2017

Thanks so much guys for the PR !


if err := sys.Apply(d); err != nil {
if os.IsPermission(err) && m.Cgroups.Path == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing I just thought of is that we should only be permissive if we're running in rootless mode (though I imagine that such cases are unlikely, and passing config.Rootless down here might be a bit ugly).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We hadn't done this because we couldn't think of a way in which root could get a permission error. Maybe from a Mandatory Access Control?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think you're right. We can fix it later if we hit it some other time.

@teddyking
Copy link
Contributor

Thanks very much for the review. Hope we responded to all the comments.

--
@williammartin and Me.

@teddyking
Copy link
Contributor

teddyking commented Sep 1, 2017

We've rebased and pushed a few changes, specifically:

  • Separate out rootless and rootless+cgroups integration tests
  • Added a rootless+cgroups exec test
  • Shortened the very long error message
  • Moved the m.Paths[sys.Name()] = p back to where it was originally

Hopefully that addresses most of the comments. We can also squash the additional commits if that'd be preferred.

@cyphar
Copy link
Member

cyphar commented Sep 2, 2017

First-pass this looks good, squashing into relevant commits would be preferred. I'm going to test this out over the next few days, and then LGTM if it all looks good. Thanks so much for working on this @williammartin, @jszroberto, and @teddyking! ❤️

@williammartin
Copy link
Author

@cyphar No problem, thanks for keeping on top of this and providing useful pointers!

@cyphar
Copy link
Member

cyphar commented Sep 7, 2017

The issues with test extensibility and making sure opportunistic features are tested correctly are solved in #1529. I can help you rebase this PR's tests on the new tests/rootless.sh setup once it's merged.

@teddyking
Copy link
Contributor

Awesome, sounds good to me. I've squashed the commits here into one so hopefully it shouldn't be too difficult to rebase once #1529 is merged in.

@teddyking teddyking force-pushed the rootless-cgroups branch 2 times, most recently from b0c4933 to 484a915 Compare September 13, 2017 11:18
@teddyking
Copy link
Contributor

We've rebased on master and taken a stab at updating this PR's tests to fit the new tests/rootless.sh way of donig things.
We had a few issues with the update.bats tests (and I've just seen the jenkins build has failed on these ... will take a look at that now).

Here's a brief overview of the changes since last push:

  • Add $TESTFLAGS to tests/rootless.sh
  • Add "cgroups" to ALL_FEATURES with corresponding enable/disable_ funcs
  • Remove requires root from cgroup integration tests that previously had it
  • In helpers.bash we updated the runc_spec function to ensure that the config.json included a Linux.Resources object, as this is expected by the sed command in setup() func in updates.bash
  • This does raise the question as to whether runc spec --rootless should generate this by default?

@cyphar
Copy link
Member

cyphar commented Sep 13, 2017

@teddyking Cool, thanks. I'm currently at OSS but I'll take a look at this next week.

@teddyking
Copy link
Contributor

@cyphar great! I pushed a fix for the failing travis build, but it's now failing on the go 1.8.x job only for a totally unrelated reason (error downloading busybox.tar.xz), so it probably just needs to be kicked off again.

Copy link
Member

@cyphar cyphar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my comments. They are very minor, and I will do my final review when I get back from OSS (and I have a chance to play with it some more). Overall it looks pretty great, thanks!

function enable_cgroup() {
# Set up cgroups for use in rootless containers.
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test
chown rootless:rootless -R /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if we did chown rootless:rootless -R /sys/fs/cgroup/*/runc-cgroups-integration-test here.


function disable_cgroup() {
# Remove cgroups used in rootless containers.
[ -d /sys/fs/cgroup/devices/runc-cgroups-integration-test ] && rmdir /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And rmdir /sys/fs/cgroup/*/runc-cgroups-integration-test here.

}

@test "runc create (rootless + no limits + cgrouppath + no permission) fails with permission error" {
requires rootless
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will think about this a little more when I get back from OSS, but I am not sure that we'd want the "fails with permission error" tests.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why wouldn't you want these? Test bloat?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it ends up working in the future (which may happen if systemd decides to enable the cgroupv2 delegation code), then the test will break. On the other hand, we could just include these and deal with that later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can buy that. As long as getting a permission error isn't the desired behaviour right now (rather than say, some other error that might bubble up) and we don't want this test for regression, then I say kill the test and save test bloat/potential confusion later.

Would you mind pointing me to some info on the delegation code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a delegation section in the cgroupv2 documentation. Effectively you can set -o nsdelegate on the root mount of cgroupv2 and then you get permissions to write to a sub-cgroup when you're in a cgroup namespace. I did a quick search for the actual patch but couldn't find it.

# XXX: Also, this test should be split into separate sections so that we
# can skip kmem without skipping update tests overall.
requires cgroups_kmem
[[ "$ROOTLESS" -ne 0 ]] && requires rootless_cgroup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: swap the order of the checks to match the rest of the tests.


function enable_cgroup() {
# Set up cgroups for use in rootless containers.
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to source this list from /proc/self/cgroups, but I can do that in a follow-up if you prefer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will leave that up to you to decide.

@cyphar
Copy link
Member

cyphar commented Sep 14, 2017

Note to myself: We should add some documentation on how to use the lxcfs PAM module to allow a "normal" user to set up delegated cgroups on their host.

@cyphar
Copy link
Member

cyphar commented Sep 29, 2017

Okay, the one thing left is the notification scheme for OOM and memory pressure. Effectively I think that we should not allow someone to register for notifications if we did not join a cgroup (and inherited the original cgroups) -- so we'd have to save somewhere whether the cgroup manager (silently) failed to setup cgroups.

I'll work on this after we merge this, since it's not a blocker IMO.

Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io>
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
@cyphar cyphar force-pushed the rootless-cgroups branch 3 times, most recently from 2b72c96 to 2fba94d Compare October 15, 2017 17:05
This ensures that we don't hard-code the set of cgroups on the host, as
well as making the permissions granted by rootless.sh much more
restrictive (to improve the scope of testing).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar
Copy link
Member

cyphar commented Oct 16, 2017

LGTM. The other issues can be fixed in future patches (that I'm working on).

/cc @opencontainers/runc-maintainers

Approved with PullApprove

@crosbymichael
Copy link
Member

crosbymichael commented Oct 16, 2017

LGTM

Approved with PullApprove

@crosbymichael crosbymichael merged commit ff4481d into opencontainers:master Oct 16, 2017
@williammartin
Copy link
Author

Great stuff. Thanks @cyphar and all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants