Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd node spec proposal #17688

Merged
merged 1 commit into from
May 20, 2016

Conversation

derekwaynecarr
Copy link
Member

@derekwaynecarr derekwaynecarr commented Nov 23, 2015

The following outlines changes that I want to make to the kubelet in order to better integrate with systemd systems, and to better isolate containers in their own cgroup based on the qos tier.

I think this is a precursor to getting more intelligent low compute resource eviction.

/cc @smarterclayton @ncdc @pmorie @dchen1107 @vishh @bgrant0607


This change is Reviewable

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 23, 2015
@dchen1107
Copy link
Member

cc/ @jonboulle to make sure this is align with rkt

@yifan-gu
Copy link
Contributor

cc @yifan-gu @alban

* ```service``` - a file that describes how to manage a service or application on the server; how to start/stop a service; when it should
be started; under what circumstances it should be restarted; and any resource controls that should be applied to the service.
* ```slice``` - a file that describes the inner-leaves of the cgroup hierarchy. the name aligns with its placement in the hierarchy tree;
for example, ```kubelet-besteffort.slice``` denotes a cgroup node ```/sys/fs/cgroup/<controller>/kubelet.slice/besteffort.slice```. A
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It actually denotes a cgroup node /sys/fs/cgroup/<controller>/kubelet.slice/kubelet-besteffort.slice. The kubelet part is repeated in the cgroup path.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: foo-bar.slice is a slice that is located within foo.slice, which in turn is located in the root slice -.slice.

http://www.freedesktop.org/software/systemd/man/systemd.slice.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alban - thanks for the typo catch! will update.

@vishh
Copy link
Contributor

vishh commented Nov 24, 2015

I'd prefer kubelet having complete access to croups rather than be bound to systemd cgroups APIs, which is not complete yet.

@vishh
Copy link
Contributor

vishh commented Nov 24, 2015

I'd suggest splitting this proposal into a separate bootstrapping/setup phase and a runtime phase.
The bootstrapping suggestions seems fine to me, but the runtime cgroups setup has some issues.
Let's get the first part of of the way, since the runtime is anyways still controlled by docker as of now.

@jonboulle
Copy link
Contributor

I'd prefer kubelet having complete access to croups rather than be bound to systemd cgroups APIs, which is not complete yet.

This is a pretty significant decision either way. On the rkt side we'd definitely have a preference for sticking to the systemd cgroups API for now since it's a lot simpler and provides a much cleaner abstraction/integration with the way rkt pods are structured. I think as was discussed in the systemd integration meeting the other week, now that unified hierarchy is landing we're at a point where we could start to push on upstream to add the different things we need.

But if it's a dealbreaker we can make something work the other way...

@derekwaynecarr
Copy link
Member Author

@vishh - I know we discussed setting up memory soft limits (which is in systemd, was not in docker last I looked, but looking at master it looks like some work went into adding --reservation field; i need to verify the release that is actually in). Are there specific controllers or properties not yet exposed that you want to exploit in the near term?

Either way, I agree splitting the proposal into the phases proposed.

Thanks!


**Status**: Proposed

*This document presents a specification for how the ```kubelet``` interfaces with ```systemd``` with a focus on quality of service concerns.*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: triple backticks unnecessary, single is fine.

@pmorie
Copy link
Member

pmorie commented Nov 25, 2015

Nit: wrapping lines at 100 chars helps make proposals easier to comment on.

@vishh
Copy link
Contributor

vishh commented Nov 30, 2015

@derekwaynecarr: Acknowledged.

@bgrant0607 bgrant0607 assigned dchen1107 and unassigned bgrant0607 Nov 30, 2015
@derekwaynecarr derekwaynecarr force-pushed the systemd_nodespec branch 2 times, most recently from a6d0d9a to 47c56b5 Compare December 4, 2015 23:46
@derekwaynecarr derekwaynecarr changed the title WIP: systemd node spec proposal systemd node spec proposal Dec 4, 2015
@derekwaynecarr
Copy link
Member Author

@vishh - I hope this clarifies what I am looking to enable.

@mrunalp - your review is appreciated as well.

Anyone else please take a look and provide input.


## kubelet cgroups per quality of service (qos) tier

It is desired to parent containers in a cgroup based on their qos tier to facilitate local node accounting scenarios.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr: Kubelet doesn't do any qos specific cgroup management as of now. Until the overall plan for cgroup management is finalized, can we exclude this section from this proposal?
Let's get the node initial bootstrapping finalized while the cgroups part gets finalized.
WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine , hope idea was clear for future discussion topics. I just wanted to relate with cgroup-root that we have today.

@derekwaynecarr
Copy link
Member Author

I am going to take another pass at this to update based on the state of Kubernetes 1.2.

Aspects of this document are stale.

@derekwaynecarr
Copy link
Member Author

add a point on volume storage driver daemons:
#23491 (comment)

@derekwaynecarr derekwaynecarr added this to the v1.3 milestone May 5, 2016
@derekwaynecarr
Copy link
Member Author

one more item to note:

when systemd manages docker, and resource accounting is off for that unit, the container runtime stats defaults to the cpu cgroup container, which will be /, which means its the same as node level stats. We need to require that the unit file that manages docker has the following:

[Unit]
Description=Docker Application Container Engine
Documentation=http://docs.docker.com
After=network.target

[Service]
CPUAccounting=true
MemoryAccounting=true

Otherwise, runtime accounting will be wrong.

@dchen1107
Copy link
Member

cc/ @adityakali @andyzheng0831 this is the proposal I mentioned to you earlier related to NodeAllocatable configuration for GCI image.

@andyzheng0831
Copy link

@dchen1107 not sure if this proposal covers cAdvisor. So far I feel cAdvisor does not work well with systemd in some cases.

@dchen1107
Copy link
Member

@andyzheng0831 The proposal doesn't cover cAdvisor since it is part of kubelet today. This one is trying to standardize systemd node configuration, especially on resource management side.


### Docker runtime support for --cgroup-parent

Docker versions <= 1.0.9 did not have proper support for `-cgroup-parent` flag on `systemd`. This
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker 1.7 is minimal version Kubelet support since 1.1 release. Kubelet will mark the node not_ready today if docker version is below that minimal version.

@dchen1107
Copy link
Member

LGTM

@dchen1107 dchen1107 added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2016
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2016
@vishh
Copy link
Contributor

vishh commented May 19, 2016

@andyzheng0831 @dchen1107 It would help enumerate cAdvisor issues to see if any of them falls within the scope of this PR. WDYT?

@dchen1107 dchen1107 added release-note Denotes a PR that will be considered when it comes time to generate release notes. lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed release-note-label-needed labels May 19, 2016
@k8s-bot
Copy link

k8s-bot commented May 19, 2016

GCE e2e build/test passed for commit eee9a58.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 53b5b41 into kubernetes:master May 20, 2016
@derekwaynecarr
Copy link
Member Author

I am happy to look at cAdvisor issues if enumerated. I had added some stuff this release to ignore .mount cgroups. Happy to do others if there is a known list.

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
Automatic merge from submit-queue

systemd node spec proposal

The following outlines changes that I want to make to the ```kubelet``` in order to better integrate with ```systemd``` systems, and to better isolate containers in their own ```cgroup``` based on the qos tier.

I think this is a precursor to getting more intelligent low compute resource eviction.

/cc @smarterclayton @ncdc @pmorie @dchen1107 @vishh @bgrant0607

<!-- Reviewable:start -->
---
This change is [<img src="http://reviewable.k8s.io/review_button.svg" height="35" align="absmiddle" alt="Reviewable"/>](http://reviewable.k8s.io/reviews/kubernetes/kubernetes/17688)
<!-- Reviewable:end -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.