New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support cgroup v2 (unified hierarchy) #654

Open
sols1 opened this Issue Mar 17, 2016 · 24 comments

Comments

Projects
None yet
10 participants
@sols1

sols1 commented Mar 17, 2016

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Mar 22, 2016

Member

cgroupv2 still doesn't support many of the cgroup controllers we need for runc. The most important one is the device "cgroup", which is a hard requirement for security. As far as I can see, CPU still hasn't been implemented either. Also, many of the other cgroups provide us with protections against other resource exhaustion attacks.

Member

cyphar commented Mar 22, 2016

cgroupv2 still doesn't support many of the cgroup controllers we need for runc. The most important one is the device "cgroup", which is a hard requirement for security. As far as I can see, CPU still hasn't been implemented either. Also, many of the other cgroups provide us with protections against other resource exhaustion attacks.

@sols1

This comment has been minimized.

Show comment
Hide comment
@sols1

sols1 Apr 27, 2016

It is possible to do cgroup v2 for some controllers and cgroup v1 for others, which are still not available for cgroup v2.

Memory is the most difficult resource to manage and that's what is fixed in cgroup v2.

The device cgroup seems to be fairly straightforward to convert to cgroup v2: add device permissions to existing single hierarchy.

sols1 commented Apr 27, 2016

It is possible to do cgroup v2 for some controllers and cgroup v1 for others, which are still not available for cgroup v2.

Memory is the most difficult resource to manage and that's what is fixed in cgroup v2.

The device cgroup seems to be fairly straightforward to convert to cgroup v2: add device permissions to existing single hierarchy.

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Apr 27, 2016

Member

The other issue is that we need to be running on a distribution which supports cgroupv2 as the default setup with systemd (which is essentially none of them). We can't really use cgroupv2 otherwise because it would require either:

  • Moving all of the processes in the system to the v2 equivalent. But because of the internal node (and threadgroup) constraints this won't be pretty and we'd be changing distro policy.
  • Moving just the subtree to the v2 equivalent. While this is technically allowed, the documentation makes it clear that it's a development tool and shouldn't be used for production purposes.

For me, one of the biggest benefits of cgroupv2 is that cgroup namespaces make more sense on v2. Unfortunately, cgroup namespaces don't implement features that would make them useful at the moment (see #774 and #781). So there's that.

And yes, we can use both v2 and v1 at the same time, but that doesn't make the implementation any nicer (now we'd have to use two managers with two different "cgroup paths").

Member

cyphar commented Apr 27, 2016

The other issue is that we need to be running on a distribution which supports cgroupv2 as the default setup with systemd (which is essentially none of them). We can't really use cgroupv2 otherwise because it would require either:

  • Moving all of the processes in the system to the v2 equivalent. But because of the internal node (and threadgroup) constraints this won't be pretty and we'd be changing distro policy.
  • Moving just the subtree to the v2 equivalent. While this is technically allowed, the documentation makes it clear that it's a development tool and shouldn't be used for production purposes.

For me, one of the biggest benefits of cgroupv2 is that cgroup namespaces make more sense on v2. Unfortunately, cgroup namespaces don't implement features that would make them useful at the moment (see #774 and #781). So there's that.

And yes, we can use both v2 and v1 at the same time, but that doesn't make the implementation any nicer (now we'd have to use two managers with two different "cgroup paths").

@rodionos

This comment has been minimized.

Show comment
Hide comment
@rodionos

rodionos May 1, 2016

For context, Ubuntu 16 LTS is on kernel version 4.4
https://wiki.ubuntu.com/XenialXerus/ReleaseNotes#Linux_kernel_4.4

rodionos commented May 1, 2016

For context, Ubuntu 16 LTS is on kernel version 4.4
https://wiki.ubuntu.com/XenialXerus/ReleaseNotes#Linux_kernel_4.4

@sols1

This comment has been minimized.

Show comment
Hide comment
@sols1

sols1 May 24, 2016

Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak.

For example, Paralles/Virtuozzo used containers in production for 10+ years and they ended up back porting memory cgroup v2 to the old kernel that they used (RHEL6, if I'm not mistaken).

Also, as far as I understand Google used containers in production for a long time and they had some kernel patches to deal with memory accounting and management.

sols1 commented May 24, 2016

Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak.

For example, Paralles/Virtuozzo used containers in production for 10+ years and they ended up back porting memory cgroup v2 to the old kernel that they used (RHEL6, if I'm not mistaken).

Also, as far as I understand Google used containers in production for a long time and they had some kernel patches to deal with memory accounting and management.

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar May 24, 2016

Member

@sols1

Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but ...

cgroup namespaces was a benefit of cgroupv2 😉. The general issue with cgroupv2 is that there just aren't enough controllers enabled for us to be able to use it properly (at a minimum, we'd need the freezer and device cgroups), and using both cgroupv2 and cgroupv1 together will make the implementation more complicated than it needs to be. On the plus side, we don't need the net_* controllers in cgroupv2 (they won't ever be added to cgroupv2) because you can now specify iptable rules by cgroup path (which AFAIK is namespaced by cgroups).

I'd be happy to work on kernel patches to add support for the controllers, but I'd recommend pushing upstream to get more controllers enabled for cgroupv2 -- they just aren't feature complete for us right now and I don't feel good about adding hacks to our cgroup management implementation to deal with cgroupv2's shortcomings.

but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak.

I understand, but there's also the problem that I'm not sure how we could test our use of cgroupv2 because systemd uses the cgroupv1 hierarchy on almost every distribution (I tried to switch to cgroupv2 on my laptop while my system was running -- it did not end well).

Member

cyphar commented May 24, 2016

@sols1

Not sure I understand all the issues related to cgroup namespaces. It would be nice to resolve all conceptual issues before doing this but ...

cgroup namespaces was a benefit of cgroupv2 😉. The general issue with cgroupv2 is that there just aren't enough controllers enabled for us to be able to use it properly (at a minimum, we'd need the freezer and device cgroups), and using both cgroupv2 and cgroupv1 together will make the implementation more complicated than it needs to be. On the plus side, we don't need the net_* controllers in cgroupv2 (they won't ever be added to cgroupv2) because you can now specify iptable rules by cgroup path (which AFAIK is namespaced by cgroups).

I'd be happy to work on kernel patches to add support for the controllers, but I'd recommend pushing upstream to get more controllers enabled for cgroupv2 -- they just aren't feature complete for us right now and I don't feel good about adding hacks to our cgroup management implementation to deal with cgroupv2's shortcomings.

but for practical production use of containers resource management is a big issue and memory is the most difficult resource to manage because of its "non-renewable" nature so to speak.

I understand, but there's also the problem that I'm not sure how we could test our use of cgroupv2 because systemd uses the cgroupv1 hierarchy on almost every distribution (I tried to switch to cgroupv2 on my laptop while my system was running -- it did not end well).

@justincormack

This comment has been minimized.

Show comment
Hide comment
@justincormack

justincormack Oct 19, 2016

Contributor

@cyphar we are in the merge window for 4.9 which will be next LTS, so it is getting quite late to get support in for the next few years for most distros - any chance of looking at the kernel patches?

I am happy to help testing, it should be fairly easy on Alpine Linux as it does not use systemd so can change more easily.

Contributor

justincormack commented Oct 19, 2016

@cyphar we are in the merge window for 4.9 which will be next LTS, so it is getting quite late to get support in for the next few years for most distros - any chance of looking at the kernel patches?

I am happy to help testing, it should be fairly easy on Alpine Linux as it does not use systemd so can change more easily.

@sols1

This comment has been minimized.

Show comment
Hide comment
@sols1

sols1 Oct 19, 2016

RancherOS (https://github.com/rancher/os) is another option. It does not use systemd and even systemd emulation was removed AFAIK.

sols1 commented Oct 19, 2016

RancherOS (https://github.com/rancher/os) is another option. It does not use systemd and even systemd emulation was removed AFAIK.

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Oct 20, 2016

Member

I haven't really had a chance to work on kernel patches recently. However, I did try a few months ago to implement freezer so it worked with cgroupv2 -- as far as I can tell it's not really that trivial to do. Namely there are some edge cases that made the handling non-clear. And I looked at the devices code but its quite a bit more complicated than the freezer code.

I might take look sometime next month, but I can't really guarantee anything (I've been swamped quite recently).

Member

cyphar commented Oct 20, 2016

I haven't really had a chance to work on kernel patches recently. However, I did try a few months ago to implement freezer so it worked with cgroupv2 -- as far as I can tell it's not really that trivial to do. Namely there are some edge cases that made the handling non-clear. And I looked at the devices code but its quite a bit more complicated than the freezer code.

I might take look sometime next month, but I can't really guarantee anything (I've been swamped quite recently).

@hustcat

This comment has been minimized.

Show comment
Hide comment
@hustcat

hustcat Nov 29, 2016

Contributor

Buffer io throttle is another biggest benefits of cgroupv2.

Contributor

hustcat commented Nov 29, 2016

Buffer io throttle is another biggest benefits of cgroupv2.

@rhatdan

This comment has been minimized.

Show comment
Hide comment
@rhatdan

rhatdan Jan 9, 2017

Contributor

Rawhide just moved to CgroupV2. Causing docker/runc to blow up.

https://bugzilla.redhat.com/show_bug.cgi?id=1411286

docker run -ti fedora bash
/usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\\"rootfs_linux.go:54: mounting \\\\\\\"cgroup\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay/e1432a26e33bebbc27619c9802d9218f3da8938b7f1696ca9be0890a2e75ac65/merged\\\\\\\" at \\\\\\\"/sys/fs/cgroup\\\\\\\" caused \\\\\\\"no subsystem for mount\\\\\\\"\\\"\"\n".

uname -r

4.10.0-0.rc2.git4.1.fc26.x86_64

Contributor

rhatdan commented Jan 9, 2017

Rawhide just moved to CgroupV2. Causing docker/runc to blow up.

https://bugzilla.redhat.com/show_bug.cgi?id=1411286

docker run -ti fedora bash
/usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:359: container init caused \\\"rootfs_linux.go:54: mounting \\\\\\\"cgroup\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay/e1432a26e33bebbc27619c9802d9218f3da8938b7f1696ca9be0890a2e75ac65/merged\\\\\\\" at \\\\\\\"/sys/fs/cgroup\\\\\\\" caused \\\\\\\"no subsystem for mount\\\\\\\"\\\"\"\n".

uname -r

4.10.0-0.rc2.git4.1.fc26.x86_64

stefanberger pushed a commit to stefanberger/runc that referenced this issue Sep 8, 2017

Merge pull request opencontainers#654 from wking/unique-within-this-map
config: Bring "unique... within this map" back together

stefanberger pushed a commit to stefanberger/runc that referenced this issue Sep 8, 2017

config: Drop redundant "unique within this map" annotation requirement
This condition landed in 27a05de (Add text about extensions,
2016-06-26, opencontainers#510) with subsequent wording tweaks in 3f0440b
(config.md: add empty limit for key of annotations, Dec 28 10:35:19
2016, opencontainers#645) and 2c8feeb (config: Bring "unique... within this map"
back together, 2017-01-12, opencontainers#654).  However, since eeaccfa (glossary:
Make objects explicitly unordered and forbid duplicate names,
2016-09-27, opencontainers#584) we forbid duplicate keys on *all* objects (not just
annotations), so this PR removes the redundant annotation-specific
condition.

Signed-off-by: W. Trevor King <wking@tremily.us>
@webczat

This comment has been minimized.

Show comment
Hide comment
@webczat

webczat Oct 6, 2017

isn't cpu controller merged for 4.14 already?

webczat commented Oct 6, 2017

isn't cpu controller merged for 4.14 already?

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Oct 7, 2017

Member

4.14 isn't out yet 😉. CPU and memory have been merged, but there's still some disagreements over some bits (I still have to read through some patches I saw on the ML).

@brauner (from the LXC project) gave a nice talk about the more generic issues about cgroupv2: https://www.youtube.com/watch?v=P6Xnm0IhiSo .

Member

cyphar commented Oct 7, 2017

4.14 isn't out yet 😉. CPU and memory have been merged, but there's still some disagreements over some bits (I still have to read through some patches I saw on the ML).

@brauner (from the LXC project) gave a nice talk about the more generic issues about cgroupv2: https://www.youtube.com/watch?v=P6Xnm0IhiSo .

@webczat

This comment has been minimized.

Show comment
Hide comment
@webczat

webczat Oct 7, 2017

webczat commented Oct 7, 2017

@sargun

This comment has been minimized.

Show comment
Hide comment
@sargun

sargun Nov 13, 2017

4.14 is out now.

sargun commented Nov 13, 2017

4.14 is out now.

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Nov 14, 2017

Member

My reservations about cgroupv2's shortcomings (and the issues with the "hybrid" mode of operation) still hold. Not to mention that (last I tried) I wasn't able to get a system to boot with cgroupv2 enabled -- which doesn't bode well for testing any of that code.

Member

cyphar commented Nov 14, 2017

My reservations about cgroupv2's shortcomings (and the issues with the "hybrid" mode of operation) still hold. Not to mention that (last I tried) I wasn't able to get a system to boot with cgroupv2 enabled -- which doesn't bode well for testing any of that code.

@redbaron

This comment has been minimized.

Show comment
Hide comment
@redbaron

redbaron May 23, 2018

Is there any news/development regarding cgroups v2?

redbaron commented May 23, 2018

Is there any news/development regarding cgroups v2?

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar May 25, 2018

Member

Not really. freezer/devices is still not enabled on cgroupv2 and there are still arguments about the threaded mode of operation that was merged in 4.14.

Member

cyphar commented May 25, 2018

Not really. freezer/devices is still not enabled on cgroupv2 and there are still arguments about the threaded mode of operation that was merged in 4.14.

@sargun

This comment has been minimized.

Show comment
Hide comment
@sargun

sargun May 25, 2018

sargun commented May 25, 2018

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Jun 3, 2018

Member

You don't need it, but you do want it. The main problem is that we'd still need to have a hybrid mode (which is something I've always felt uncomfortable with the idea of).

Member

cyphar commented Jun 3, 2018

You don't need it, but you do want it. The main problem is that we'd still need to have a hybrid mode (which is something I've always felt uncomfortable with the idea of).

@sargun

This comment has been minimized.

Show comment
Hide comment
@sargun

sargun Sep 24, 2018

@cyphar For users who do not use freezer (because they have PID namespaces) and they aren't trying to take live snapshots, do you think it's reasonable to have cgroupv2 support, and be able to have runc use the cgroupv2 "alternate" mode?

sargun commented Sep 24, 2018

@cyphar For users who do not use freezer (because they have PID namespaces) and they aren't trying to take live snapshots, do you think it's reasonable to have cgroupv2 support, and be able to have runc use the cgroupv2 "alternate" mode?

@cyphar

This comment has been minimized.

Show comment
Hide comment
@cyphar

cyphar Sep 25, 2018

Member

I don't mind having a pure-cgroupv2 implementation, but I don't think it would be ultimately useful. As far as I know, no distribution actually uses cgroupv2 controllers "for real" (to be fair, we are also probably the reason it hasn't happened yet). I unfortunately think that we must have a hybrid implementation otherwise we won't be able to implement the cgroup parts OCI spec fully on ordinary systems (I mean, we can error out and that's compliant but it's not correct). Maybe for a first step pure-cgroupv2 would be fine but I'm not 100% on that.

But my main concern is that this actually is going to be harder than you might think to implement. @brauner gave a talk about this last year, specifically in the context of LXC and container runtimes in general. The no-internal-process constraint in particular means that container runtimes will have to do a very large amount of dodgy things in order to be able to run containers inside a new cgroup (you have to move the processes from any parent cgroups into a new leaf-node). In addition, subtree_control gives you quite a few headaches because some parent cgroup could limit your ability to create new

In the Docker case this won't be as awful (though it will still be bad) because you can just create a new cgroup at /docker/FOO which will avoid some of the internal-process constraint issues (it's very unlikely that the cgroup is completely unused and so / will not be a leaf node). But I have a feeling systemd will cause many headaches if we start doing things like that in cgroupv2 -- especially since in cgroupv2 they have the same problem as us with the internal-process constraint.

Member

cyphar commented Sep 25, 2018

I don't mind having a pure-cgroupv2 implementation, but I don't think it would be ultimately useful. As far as I know, no distribution actually uses cgroupv2 controllers "for real" (to be fair, we are also probably the reason it hasn't happened yet). I unfortunately think that we must have a hybrid implementation otherwise we won't be able to implement the cgroup parts OCI spec fully on ordinary systems (I mean, we can error out and that's compliant but it's not correct). Maybe for a first step pure-cgroupv2 would be fine but I'm not 100% on that.

But my main concern is that this actually is going to be harder than you might think to implement. @brauner gave a talk about this last year, specifically in the context of LXC and container runtimes in general. The no-internal-process constraint in particular means that container runtimes will have to do a very large amount of dodgy things in order to be able to run containers inside a new cgroup (you have to move the processes from any parent cgroups into a new leaf-node). In addition, subtree_control gives you quite a few headaches because some parent cgroup could limit your ability to create new

In the Docker case this won't be as awful (though it will still be bad) because you can just create a new cgroup at /docker/FOO which will avoid some of the internal-process constraint issues (it's very unlikely that the cgroup is completely unused and so / will not be a leaf node). But I have a feeling systemd will cause many headaches if we start doing things like that in cgroupv2 -- especially since in cgroupv2 they have the same problem as us with the internal-process constraint.

@alban

This comment has been minimized.

Show comment
Hide comment
@alban

alban Sep 25, 2018

Contributor

we won't be able to implement the cgroup parts OCI spec fully on ordinary systems

I agree, the current OCI spec has been written with cgroup-v1 in mind... the device cgroup and the network classID are tied to cgroup-v1.

In cgroup-v2, the same features can be achieved with some equivalents for device cgroup and net_cls but that's different API.

So in my opinion, the OCI spec would need an update for cgroup-v2... either include some croup-v2 concepts or be abstracted.

But I have a feeling systemd will cause many headaches if we start doing things like that in cgroupv2

Do you refer to the systemd in the container, on the host, or using the container runtime systemd-nspawn?

For reference, systemd (on the host) supports 3 options for container runtimes with cgroup-v2.

Contributor

alban commented Sep 25, 2018

we won't be able to implement the cgroup parts OCI spec fully on ordinary systems

I agree, the current OCI spec has been written with cgroup-v1 in mind... the device cgroup and the network classID are tied to cgroup-v1.

In cgroup-v2, the same features can be achieved with some equivalents for device cgroup and net_cls but that's different API.

So in my opinion, the OCI spec would need an update for cgroup-v2... either include some croup-v2 concepts or be abstracted.

But I have a feeling systemd will cause many headaches if we start doing things like that in cgroupv2

Do you refer to the systemd in the container, on the host, or using the container runtime systemd-nspawn?

For reference, systemd (on the host) supports 3 options for container runtimes with cgroup-v2.

@sargun

This comment has been minimized.

Show comment
Hide comment
@sargun

sargun Sep 25, 2018

Yeah, I think there are two threads here:

  1. We need to change the OCI spec to accomodate cgroupv2, and not be as "prescriptive" about how cgroups are implemented.
  2. We need a cgroupv2 engine

I think that the engine should ideally have pluggable backends. The first one should probably just make RPCs to systemd to create slices and scopes. For example, in our system today, we run all containers under /containers.slice. I can imagine something like this:

/containers.slice/..
        (The following scopes are created by systemd with Delegate=true)
        /container-1.scope (Resource constraints exist here)
        /container-2.scope

It might make sense for us to do our own cgroup control eventually, but given how poorly systemd plays with others, and how much investment goes into it, I see no reason to reinvent the wheel.

sargun commented Sep 25, 2018

Yeah, I think there are two threads here:

  1. We need to change the OCI spec to accomodate cgroupv2, and not be as "prescriptive" about how cgroups are implemented.
  2. We need a cgroupv2 engine

I think that the engine should ideally have pluggable backends. The first one should probably just make RPCs to systemd to create slices and scopes. For example, in our system today, we run all containers under /containers.slice. I can imagine something like this:

/containers.slice/..
        (The following scopes are created by systemd with Delegate=true)
        /container-1.scope (Resource constraints exist here)
        /container-2.scope

It might make sense for us to do our own cgroup control eventually, but given how poorly systemd plays with others, and how much investment goes into it, I see no reason to reinvent the wheel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment