Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Support custom cgroups #8551

Closed
ibuildthecloud opened this issue Oct 14, 2014 · 38 comments
Closed

Proposal: Support custom cgroups #8551

ibuildthecloud opened this issue Oct 14, 2014 · 38 comments
Labels
area/runtime kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny

Comments

@ibuildthecloud
Copy link
Contributor

Containers are mostly the combination of capabilities, namespaces, and cgroups. Docker already has custom capabilities support with --cap-add and --cap-drop. Custom namespace support is halfway there already with --net=* and --ipc=* is being worked on. The last piece of the puzzle is to be able to control is cgroups.

I propose that the cgroup paths be added to the HostConfig such that on start custom cgroup paths can optionally be used instead of the cgroups that Docker would setup. This would allow component outside of docker to control, create, and manage the cgroups but then the Docker container would just join them.

One could find many use cases for this I assume, but initially this feature can be used to better tie together cgroups managed by systemd. The background of this is rooted in a hack (https://github.com/ibuildthecloud/systemd-docker) that I've put together to better managed Docker under systemd. systemd-docker does various things that are useful that better integrates Docker with systemd and most of it should probably stay as a project outside of Docker. One critical piece that makes systemd-docker work today is that it moves the running processes from one cgroup to the service's cgroup. This is what makes it a hack and also not 100% reliable. If Docker could just support the ability to use a custom cgroup, then systemd-docker could become a production worthy stop gap solution until a superior integration between systemd and docker existed natively.

@Ulexus
Copy link

Ulexus commented Oct 14, 2014

+1

1 similar comment
@jonboulle
Copy link
Contributor

👍

@j0hnsmith
Copy link

+1

@j0hnsmith
Copy link

To elaborate on my use case, I have a multiple containers working together to provide a service and I want to limit the resources the service can consume.

I'd like to be able to create a cgroup (manually) then tell docker to run the containers for my service with that cgroup (eg multiple containers using the same cgroup).

Something like

docker run --cgroup my_cgroup ...

@hustcat
Copy link

hustcat commented Dec 2, 2014

+1

@bgrant0607
Copy link

More use cases:

One use case is the pod-like scenario (#8781) -- multiple co-deployed containers.

Another is differentiated quality of service. Some workloads need a high degree of predictability, while others just want to use whatever resources are available. We'd like to protect the former from the latter by putting all of the latter into a bucket that is constrained such that it can't interfere with the predictable workload. This same approach can be applied hierarchically in order to support more than 2 QoS tiers. More discussion of this can be found in presentations and documentation about lmctfy:

We'd like to similarly protect Docker and other system agents/daemons from user containers. We've received a number of reports from users who bricked nodes due to using up all the memory.

Not all 3 of these cases necessarily need to be expressed in the same way in the API.

For the pod case, one might be tempted to apply the current pattern of referring to other containers, such as with VolumesFrom and NetworkMode=container:id, using something like CgroupParent=container-id. However, this approach is problematic for a number of reasons. One is due to the coupling of the container lifetime and process lifetime. In the case of a system OOM, for example, such processes can die, even if they use minimal resources, which creates complicated failure modes. Another is the lack of reasonable mechanisms for managing and introspecting groups of related containers.

For differentiated quality of service, I'd like to specify higher-level semantic intent rather than concrete slices or cgroup paths, but there needs to be a general way to pass extra options down to the exec driver, and such mechanisms keep getting shot down, or even removed after being added (e.g., #4833). Alternatively, we'd be happy to make a proposal for first-class support in the API.

Configuration to protect Docker and other system agents could be specified with specified with flags when starting the daemon.

/cc @vishh @rjnagal @thockin @vmarmol @dchen1107

@thockin
Copy link
Contributor

thockin commented Jan 8, 2015

+1

@rjnagal
Copy link
Contributor

rjnagal commented Jan 8, 2015

Having a way to expose the parent cgroup to use to create new cgroup under would go a long way in solving the issues @bgrant0607 pointed out. If the container cgroups are not tied directly to where docker daemon runs, it would help a lot in better protecting critical system daemons. libcontainer already accepts parent as a parameter. I think the actual work required to make this happen would be minimal.

@vishh
Copy link
Contributor

vishh commented Jan 15, 2015

+1. Ping @crosbymichael

@crosbymichael
Copy link
Contributor

What do you think the user facing API should look like? API and flags for solving your issues?

@vishh
Copy link
Contributor

vishh commented Jan 19, 2015

@crosbymichael: For many of the use cases mentioned above, adding a --parent_cgroup=/<cgroup_path> flag to docker run is what is needed. With this option set, docker would create container cgroups under the hierarchy mentioned in --parent_cgroup.
To provide differentiated quality of service, this option would let users group low priority containers into a bucket and cap the total amount of resources the low priority containers can consume. This option can be used to place resource restrictions across a Pod. In the case of systemd, the service cgroup hierarchy can be used to group docker containers into a logical systemd cgroup.

@crosbymichael
Copy link
Contributor

And that's it? Nothing else required?

@thockin
Copy link
Contributor

thockin commented Jan 20, 2015

I think that is pretty much right. What else were tiu expecting to see?
On Jan 19, 2015 5:17 PM, "Michael Crosby" notifications@github.com wrote:

And that's it? Nothing else required?

Reply to this email directly or view it on GitHub
#8551 (comment).

@chakri-nelluri
Copy link

+1

@timothysc
Copy link

+1, this simplifies process tracking for cluster managers.

@vishh
Copy link
Contributor

vishh commented Jan 20, 2015

@crosbymichael: We will need another flag to alter the oom_score_adj on containers to offer differentiated QOS.

@bgrant0607
Copy link

Let's keep the oom_score_adj issue separate -- please file a separate issue, since I don't see one. The cgroup parent alone will solve several problems for us.

@tnachen
Copy link

tnachen commented Jan 20, 2015

+1 as well.
I think one nice to have for all systems integrating with docker, is that the cgroup path for a container is also available via docker inspect.

@thockin
Copy link
Contributor

thockin commented Jan 20, 2015

FWIW, we should mention that this will remove the ability to examine the
/docker cgroup and see all containers. I'm OK with that.

On Tue, Jan 20, 2015 at 11:01 AM, Timothy Chen notifications@github.com
wrote:

+1 as well.
I think one nice to have for all systems integrating with docker, is that
the cgroup path for a container is also available via docker inspect,
therefore it's at a set place to look up.

Reply to this email directly or view it on GitHub
#8551 (comment).

@vishh
Copy link
Contributor

vishh commented Feb 2, 2015

@crosbymichael: Can we get a +1 for this feature? I can send out a PR soon. This is an important feature that will help improve system reliability a lot for kubernetes.

@crosbymichael
Copy link
Contributor

@vishh yes, I will bring it up with the other maintainers today

@bytesandwich
Copy link

+1

@vishh
Copy link
Contributor

vishh commented Feb 20, 2015

Ping @crosbymichael!

On Thu, Feb 19, 2015 at 10:06 PM, Jack notifications@github.com wrote:

+1


Reply to this email directly or view it on GitHub
#8551 (comment).

@jessfraz jessfraz removed the kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny label Feb 26, 2015
@jessfraz jessfraz added the kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny label Feb 26, 2015
@vishh
Copy link
Contributor

vishh commented Feb 27, 2015

I plan to post a PR soon since no concerns have been expressed for this feature.

@jessfraz
Copy link
Contributor

I think that is awesome @vishh I am +1 :)

@jessfraz
Copy link
Contributor

anything to get rid of systemd cgroups :P

@bgrant0607
Copy link

I discussed this with @crosbymichael at the last DGAB meeting, and my understanding is that we have the go-ahead for this.

@crosbymichael
Copy link
Contributor

+1 for --cgroup-parent

@ConnorDoyle
Copy link

+1, this will strengthen the multi-tenancy story when running Kubernetes as a Mesos framework and --cgroup-parent should be sufficient for our use case.

@mohitsoni
Copy link
Contributor

+1 for --cgroup-parent

@jdef
Copy link
Contributor

jdef commented Mar 14, 2015

+1 for --cgroup_parent

@crosbymichael
Copy link
Contributor

merged in #11428

@vishh
Copy link
Contributor

vishh commented Mar 19, 2015

Thanks for the quick review everyone :)

@thockin
Copy link
Contributor

thockin commented Mar 19, 2015

w00t! This is a good one. Thanks everyone.

On Thu, Mar 19, 2015 at 3:12 PM, Vish Kannan notifications@github.com
wrote:

Thanks for the quick review everyone :)
If the integration tests happen to flaky, ping me and I can fix them. There
should be enough logs now to identify any issue.

On Thu, Mar 19, 2015 at 2:43 PM, Michael Crosby notifications@github.com
wrote:

merged in #11428 #11428


Reply to this email directly or view it on GitHub
#8551 (comment).


Reply to this email directly or view it on GitHub
#8551 (comment).

@vmarmol
Copy link
Contributor

vmarmol commented Mar 19, 2015

Yay cgroups! :D That was one of the fastest feature merges I've seen.

@ConnorDoyle
Copy link

👍

@bgrant0607
Copy link

Awesome, thanks a lot.

@jdef
Copy link
Contributor

jdef commented Mar 19, 2015

+1

On Thu, Mar 19, 2015 at 7:41 PM, Brian Grant notifications@github.com
wrote:

Awesome, thanks a lot.


Reply to this email directly or view it on GitHub
#8551 (comment).

James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime kind/feature Functionality or other elements that the project doesn't currently have. Features are new and shiny
Projects
None yet
Development

No branches or pull requests