Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support cgroups with limits as rootless #1540

Merged

Conversation

@williammartin
Copy link
Contributor

@williammartin williammartin commented Aug 1, 2017

Following on from the rootless cgroups discussion in #1457, this PR provides an initial implementation for rootless cgroups that fulfils Garden's needs. This does not enable all cgroup related functiionality, but we think it's a good starting point for other currently disabled features to work.

In more detail, this PR removes the current rootless cgroup implementation, and using cgroups in rootless mode should now work broadly the same as rootful, providing that runc has permissions on the cgroup path.

Differences with Rootful

Typically, without permissions on the cgroup path, we would expect an error when applying (entering a pid into a controller), however we now support a kind of "opportunistic" cgroup usage when no limits have been set and no cgroup path has been provided. Since child processes are entered into the same cgroup as their parent, this is effectively the same (regarding cgroup enforcement) as not providing a cgroup path and not setting limits. Attempting to set (change a resource limit) will still result in a permission denied error either during creation or via runc update but hopefully with a slightly more informative error.

The devices cgroup doesn't have all functionality available for setting limits. This is because there is a requirement on CAP_SYS_ADMIN for the devices.allow and devices.deny files. We're not sure of a way to provide different device white/blacklisting per container, but a possible solution for some use cases is to set a static list in a parent cgroup that is inherited. This works because applying does not require CAP_SYS_ADMIN.

Disabled features

We haven't enabled OOM notification or Memory pressure notificiation but this is hopefully as simple as removing the rootless conditional:

if c.config.Rootless {

We haven't enabled CRIU features because we aren't familiar with what is required.

Notes

The BATS added seem to cover the right features, however, we aren't super happy with the changes to the Makefile. Thoughts on how to solve the requirement that a cgroup exists, chowned to rootless in a nicer way would be much appreciated.

Signed-off-by: Ed King eking@pivotal.io

Makefile Outdated
rootlessintegration: runcimage
docker run -e TESTFLAGS -t --privileged --rm -v $(CURDIR):/go/src/$(PROJECT) --cap-drop=ALL -u rootless $(RUNC_IMAGE) make localintegration

This comment has been minimized.

@cyphar

cyphar Aug 3, 2017
Member

Why have you removed --cap-drop=all? That was necessary to ensure that rootless didn't depend on extra capabilities.

This comment has been minimized.

@teddyking

teddyking Aug 16, 2017
Contributor

It seems --cap-drop=ALL doesn't do anything useful when also using --privileged. With or without it, once we switch to the rootless user our caps are:

CapInh: 0000003fffffffff
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000

So, as the tests are right now in master, and in this PR, we don't have any effective caps. If we want to guard against picking up caps somehow from files, we could perhaps figure out a way to drop from the bounding/inheritable set as well but I'm not sure that really makes sense within the test context.

Makefile Outdated
@@ -96,10 +96,14 @@ integration: runcimage
localintegration: all
bats -t tests/integration${TESTFLAGS}

rootlessintegrationstub:
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/rootless_cgroup
chown rootless:rootless -R /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/rootless_cgroup

This comment has been minimized.

@cyphar

cyphar Aug 3, 2017
Member

It is crucial that when we add new tests for opportunistic functionality in rootless that we make sure that the base functionality works. So if we want to add tests for this (and also newuidmap in the future) we need to add them as effectively re-runs of the rootless tests under different conditions (with some extra tests enabled).

This comment has been minimized.

@teddyking

teddyking Aug 16, 2017
Contributor

Not quite sure we understand this, could you elaborate please?

This comment has been minimized.

@cyphar

cyphar Aug 21, 2017
Member

While we have the require rootless_cgroup class of tests, and the rest of the tests that don't have that requirement, I'm worried about not explicitly testing the case where this setup code is not run (in other words, the case were runc has no permissions to touch any cgroups). I appreciate that code is still tested, but I just want to be cautious about testing "opportunistic" functionality like rootless cgroup handling or (in the future) newuidmap handling.

I think the best solution to this would be to have something like the set of additional rootless features: (cgroups newuidmap etc) and then test base + <the power set of the other features>. So in the above case we would test base, base+cgroups, base+newuidmap, base+cgroups+newuidmap. But that's just an idea.

return err
}
m.Paths[sys.Name()] = p

This comment has been minimized.

@cyphar

cyphar Aug 3, 2017
Member

IIRC we moved this before Apply to handle some race conditions, maybe @hqhq remembers this better than me.

This comment has been minimized.

@hqhq

hqhq Aug 14, 2017
Contributor

Yeah, if some subsystems Apply failed, we are not able to cleanup cgroup directories if we haven't recorded it in m.Paths, see #1196

This comment has been minimized.

@cyphar

cyphar Aug 14, 2017
Member

@williammartin What if you keep the m.Paths[sys.Name()] = p where it is, but do delete(m.Paths, sys.Name()) if there is a permission error that is ignore-able?

This comment has been minimized.

@teddyking

teddyking Aug 16, 2017
Contributor

That seems reasonable, we can try it out

@@ -198,6 +206,10 @@ func (m *Manager) Set(container *configs.Config) error {
for _, sys := range subsystems {
path := paths[sys.Name()]
if err := sys.Set(path, container.Cgroups); err != nil {
if path == "" {
// cgroup never applied
return fmt.Errorf("cannot set limits on the %s cgroup, as the container has not joined it. The cgroup controller may not be mounted or you may not have previously had permissions to create the cgroup subdirectory", sys.Name())

This comment has been minimized.

@cyphar

cyphar Aug 3, 2017
Member

This error is way too long.

This comment has been minimized.

@yastij

yastij Aug 11, 2017

+1
cannot set limits on the %s cgroups due to container not joining it or something like that seems enough

This comment has been minimized.

@teddyking

teddyking Aug 16, 2017
Contributor

Yep, no problem.

@@ -83,8 +83,7 @@ func (p *setnsProcess) start() (err error) {
if err = p.execSetns(); err != nil {
return newSystemErrorWithCause(err, "executing setns process")
}
// We can't join cgroups if we're in a rootless container.
if !p.config.Rootless && len(p.cgroupPaths) > 0 {

This comment has been minimized.

@cyphar

cyphar Aug 3, 2017
Member

This check is still correct in some cases, but I guess erroring out is acceptable if someone explicitly asked for an impossible cgroup configuration (now that we could in principle nest things). I would like to see a test for this though.

This comment has been minimized.

@teddyking

teddyking Aug 16, 2017
Contributor

We're not clear under what circumstances the rootless check still makes sense? Can you give an example please?

p.cgroupPaths is loaded from the state.json which best we can tell has to be the same as the cgroup paths the init process is in, so barring people doing weird things, if you succeeded on create, you should succeed now?

This comment has been minimized.

@cyphar

cyphar Aug 21, 2017
Member

I'm actually not sure what I meant either, sorry about that. This change is fine, but I would like to see a runc exec test with rootless cgroups in use, to make sure this works fine.

@cyphar
Copy link
Member

@cyphar cyphar commented Aug 3, 2017

Just to be clear, I'm very impressed that there were only two changes necessary to make the cgroups/fs driver opportunistic and this looks like it's on the right track. My main gripes are just with how we're testing stuff.

I'll look over the limitations you've listed when I get a chance.

@cyphar cyphar added this to the 1.1.0 milestone Aug 11, 2017
@yastij
Copy link

@yastij yastij commented Aug 11, 2017

Thanks so much guys for the PR !


if err := sys.Apply(d); err != nil {
if os.IsPermission(err) && m.Cgroups.Path == "" {

This comment has been minimized.

@cyphar

cyphar Aug 14, 2017
Member

Another thing I just thought of is that we should only be permissive if we're running in rootless mode (though I imagine that such cases are unlikely, and passing config.Rootless down here might be a bit ugly).

This comment has been minimized.

@teddyking

teddyking Aug 16, 2017
Contributor

We hadn't done this because we couldn't think of a way in which root could get a permission error. Maybe from a Mandatory Access Control?

This comment has been minimized.

@cyphar

cyphar Aug 21, 2017
Member

Yeah, I think you're right. We can fix it later if we hit it some other time.

@teddyking
Copy link
Contributor

@teddyking teddyking commented Aug 16, 2017

Thanks very much for the review. Hope we responded to all the comments.

--
@williammartin and Me.

@teddyking teddyking force-pushed the cloudfoundry-attic:rootless-cgroups branch from c99adee to 107d26d Sep 1, 2017
@teddyking
Copy link
Contributor

@teddyking teddyking commented Sep 1, 2017

We've rebased and pushed a few changes, specifically:

  • Separate out rootless and rootless+cgroups integration tests
  • Added a rootless+cgroups exec test
  • Shortened the very long error message
  • Moved the m.Paths[sys.Name()] = p back to where it was originally

Hopefully that addresses most of the comments. We can also squash the additional commits if that'd be preferred.

@cyphar
Copy link
Member

@cyphar cyphar commented Sep 2, 2017

First-pass this looks good, squashing into relevant commits would be preferred. I'm going to test this out over the next few days, and then LGTM if it all looks good. Thanks so much for working on this @williammartin, @jszroberto, and @teddyking! ❤️

@williammartin
Copy link
Contributor Author

@williammartin williammartin commented Sep 3, 2017

@cyphar No problem, thanks for keeping on top of this and providing useful pointers!

@cyphar
Copy link
Member

@cyphar cyphar commented Sep 7, 2017

The issues with test extensibility and making sure opportunistic features are tested correctly are solved in #1529. I can help you rebase this PR's tests on the new tests/rootless.sh setup once it's merged.

@teddyking teddyking force-pushed the cloudfoundry-attic:rootless-cgroups branch from b72c42b to 7b24009 Sep 7, 2017
@teddyking
Copy link
Contributor

@teddyking teddyking commented Sep 7, 2017

Awesome, sounds good to me. I've squashed the commits here into one so hopefully it shouldn't be too difficult to rebase once #1529 is merged in.

@teddyking teddyking force-pushed the cloudfoundry-attic:rootless-cgroups branch 2 times, most recently from b0c4933 to 484a915 Sep 13, 2017
@teddyking
Copy link
Contributor

@teddyking teddyking commented Sep 13, 2017

We've rebased on master and taken a stab at updating this PR's tests to fit the new tests/rootless.sh way of donig things.
We had a few issues with the update.bats tests (and I've just seen the jenkins build has failed on these ... will take a look at that now).

Here's a brief overview of the changes since last push:

  • Add $TESTFLAGS to tests/rootless.sh
  • Add "cgroups" to ALL_FEATURES with corresponding enable/disable_ funcs
  • Remove requires root from cgroup integration tests that previously had it
  • In helpers.bash we updated the runc_spec function to ensure that the config.json included a Linux.Resources object, as this is expected by the sed command in setup() func in updates.bash
  • This does raise the question as to whether runc spec --rootless should generate this by default?
@teddyking teddyking force-pushed the cloudfoundry-attic:rootless-cgroups branch from 484a915 to c13bd2f Sep 13, 2017
@cyphar
Copy link
Member

@cyphar cyphar commented Sep 13, 2017

@teddyking Cool, thanks. I'm currently at OSS but I'll take a look at this next week.

@teddyking teddyking force-pushed the cloudfoundry-attic:rootless-cgroups branch from c13bd2f to 7feeaab Sep 14, 2017
@teddyking
Copy link
Contributor

@teddyking teddyking commented Sep 14, 2017

@cyphar great! I pushed a fix for the failing travis build, but it's now failing on the go 1.8.x job only for a totally unrelated reason (error downloading busybox.tar.xz), so it probably just needs to be kicked off again.

Copy link
Member

@cyphar cyphar left a comment

Here's my comments. They are very minor, and I will do my final review when I get back from OSS (and I have a chance to play with it some more). Overall it looks pretty great, thanks!

function enable_cgroup() {
# Set up cgroups for use in rootless containers.
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test
chown rootless:rootless -R /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test

This comment has been minimized.

@cyphar

cyphar Sep 14, 2017
Member

I would prefer if we did chown rootless:rootless -R /sys/fs/cgroup/*/runc-cgroups-integration-test here.


function disable_cgroup() {
# Remove cgroups used in rootless containers.
[ -d /sys/fs/cgroup/devices/runc-cgroups-integration-test ] && rmdir /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test

This comment has been minimized.

@cyphar

cyphar Sep 14, 2017
Member

And rmdir /sys/fs/cgroup/*/runc-cgroups-integration-test here.

}

@test "runc create (rootless + no limits + cgrouppath + no permission) fails with permission error" {
requires rootless

This comment has been minimized.

@cyphar

cyphar Sep 14, 2017
Member

I will think about this a little more when I get back from OSS, but I am not sure that we'd want the "fails with permission error" tests.

This comment has been minimized.

@williammartin

williammartin Sep 25, 2017
Author Contributor

Why wouldn't you want these? Test bloat?

This comment has been minimized.

@cyphar

cyphar Sep 25, 2017
Member

If it ends up working in the future (which may happen if systemd decides to enable the cgroupv2 delegation code), then the test will break. On the other hand, we could just include these and deal with that later.

This comment has been minimized.

@williammartin

williammartin Sep 25, 2017
Author Contributor

I can buy that. As long as getting a permission error isn't the desired behaviour right now (rather than say, some other error that might bubble up) and we don't want this test for regression, then I say kill the test and save test bloat/potential confusion later.

Would you mind pointing me to some info on the delegation code?

This comment has been minimized.

@cyphar

cyphar Sep 25, 2017
Member

There's a delegation section in the cgroupv2 documentation. Effectively you can set -o nsdelegate on the root mount of cgroupv2 and then you get permissions to write to a sub-cgroup when you're in a cgroup namespace. I did a quick search for the actual patch but couldn't find it.

# XXX: Also, this test should be split into separate sections so that we
# can skip kmem without skipping update tests overall.
requires cgroups_kmem
[[ "$ROOTLESS" -ne 0 ]] && requires rootless_cgroup

This comment has been minimized.

@cyphar

cyphar Sep 14, 2017
Member

Minor nit: swap the order of the checks to match the rest of the tests.


function enable_cgroup() {
# Set up cgroups for use in rootless containers.
mkdir -p /sys/fs/cgroup/{blkio,cpu,cpuacct,cpuset,devices,freezer,hugetlb,memory,net_cls,net_prio,openrc,perf_event,pids,systemd}/runc-cgroups-integration-test

This comment has been minimized.

@cyphar

cyphar Sep 14, 2017
Member

We might want to source this list from /proc/self/cgroups, but I can do that in a follow-up if you prefer.

This comment has been minimized.

@teddyking

teddyking Sep 15, 2017
Contributor

Sure, will leave that up to you to decide.

@cyphar
Copy link
Member

@cyphar cyphar commented Sep 14, 2017

Note to myself: We should add some documentation on how to use the lxcfs PAM module to allow a "normal" user to set up delegated cgroups on their host.

@teddyking teddyking force-pushed the cloudfoundry-attic:rootless-cgroups branch from 7feeaab to dc12997 Sep 15, 2017
@jszroberto jszroberto force-pushed the cloudfoundry-attic:rootless-cgroups branch from dc12997 to 3c10a6e Sep 27, 2017
@cyphar
Copy link
Member

@cyphar cyphar commented Sep 29, 2017

Okay, the one thing left is the notification scheme for OOM and memory pressure. Effectively I think that we should not allow someone to register for notifications if we did not join a cgroup (and inherited the original cgroups) -- so we'd have to save somewhere whether the cgroup manager (silently) failed to setup cgroups.

I'll work on this after we merge this, since it's not a blocker IMO.

Will Martin CF Garden
Signed-off-by: Ed King <eking@pivotal.io>
Signed-off-by: Gabriel Rosenhouse <grosenhouse@pivotal.io>
Signed-off-by: Konstantinos Karampogias <konstantinos.karampogias@swisscom.com>
@yulianedyalkova yulianedyalkova force-pushed the cloudfoundry-attic:rootless-cgroups branch from 3c10a6e to ca4f427 Oct 5, 2017
@cyphar cyphar force-pushed the cloudfoundry-attic:rootless-cgroups branch 3 times, most recently from 2b72c96 to 2fba94d Oct 15, 2017
This ensures that we don't hard-code the set of cgroups on the host, as
well as making the permissions granted by rootless.sh much more
restrictive (to improve the scope of testing).

Signed-off-by: Aleksa Sarai <asarai@suse.de>
@cyphar cyphar force-pushed the cloudfoundry-attic:rootless-cgroups branch from 2fba94d to 23f4d31 Oct 16, 2017
@cyphar
Copy link
Member

@cyphar cyphar commented Oct 16, 2017

LGTM. The other issues can be fixed in future patches (that I'm working on).

/cc @opencontainers/runc-maintainers

Approved with PullApprove

@crosbymichael
Copy link
Member

@crosbymichael crosbymichael commented Oct 16, 2017

LGTM

Approved with PullApprove

@crosbymichael crosbymichael merged commit ff4481d into opencontainers:master Oct 16, 2017
2 checks passed
2 checks passed
code-review/pullapprove Approved by crosbymichael, cyphar
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@williammartin
Copy link
Contributor Author

@williammartin williammartin commented Oct 16, 2017

Great stuff. Thanks @cyphar and all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

6 participants
You can’t perform that action at this time.