Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add KEP for cgroups v2 support #1370

Open
wants to merge 1 commit into
base: master
from
Open

Conversation

@giuseppe
Copy link

giuseppe commented Nov 18, 2019

Signed-off-by: Giuseppe Scrivano gscrivan@redhat.com

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 18, 2019

Welcome @giuseppe!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 18, 2019

Hi @giuseppe. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Nov 18, 2019

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: giuseppe
To complete the pull request process, please assign derekwaynecarr
You can assign the PR to them by writing /assign @derekwaynecarr in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

AkihiroSuda left a comment

Let's unshare cgroupns by default

* Add support for cgroups v2 to cAdvisor ([https://github.com/google/cadvisor/pull/2309](https://github.com/google/cadvisor/pull/2309))
* Add support for cgroups v2 to the Kubelet

## User Stories

This comment has been minimized.

Copy link
@AkihiroSuda

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

I would keep that separate

This comment has been minimized.

Copy link
@AkihiroSuda
- GA: Assuming no negative user feedback based on production
experience, promote after 2 releases in beta.
## Testing

This comment has been minimized.

Copy link
@AkihiroSuda

AkihiroSuda Nov 18, 2019

Minimum kernel version?
I think 5.2 (introduced freezer) is fine.

@AkihiroSuda

This comment has been minimized.

Copy link

AkihiroSuda commented Nov 18, 2019

Do you think we should open a separate KEP for rootless?

@giuseppe

This comment has been minimized.

Copy link
Author

giuseppe commented Nov 18, 2019

Do you think we should open a separate KEP for rootless?

yes let's keep it separate from cgroups v2, which is already touching enough components :-)

Copy link
Member

derekwaynecarr left a comment

we need to add more detail on kubelet changes where it interacts with cgroups directly

- sig-architecture
reviewers:
- TBD
approvers:

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

Add yuju, Dawn, and myself.

* Add support for cgroups v2 to cAdvisor ([https://github.com/google/cadvisor/pull/2309](https://github.com/google/cadvisor/pull/2309))
* Add support for cgroups v2 to the Kubelet

## User Stories

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

I would keep that separate


## User Stories

* The Kubelet can run on a cgroups v2 host

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

Clarify that kubelet will support v1 or v2


|Kubernetes cgroups v1|Kubernetes cgroups v2 behavior|
|---|---|
|Killing processes using runc uses the freezer|There will be freezer for cgroups v2.|

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

Kube doesn’t use freezer yet

This comment has been minimized.

Copy link
@giuseppe

giuseppe Nov 19, 2019

Author

I think I can drop this line at this point. When I've started working on the KEP there was no freezer yet for cgroups v2 so the OCI runtime could not use it.

|Killing processes using runc uses the freezer|There will be freezer for cgroups v2.|
|CPU stats for Horizontal Pod Autoscaling|No .percpu cpuacct stats. Check cadvisor.|
|CPU pinning based on integral cores|Cpuset controller available|
|Memory limits|Not changed, different naming|

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

Can you add detail on hugetlb and pids (both used today in kube)

This comment has been minimized.

Copy link
@giuseppe

giuseppe Nov 21, 2019

Author

sadly hugetlb is still missing in the kernel for cgroups v2.

I've sent a patch to add support for hugetlb on cgroups v2: https://marc.info/?l=linux-cgroups&m=157437087608728&w=2

let's see how that goes...

|CPU pinning based on integral cores|Cpuset controller available|
|Memory limits|Not changed, different naming|

## Proposal

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

Can you add some detail on how kubelet startup may change? For example, if kubelet detects it’s on a cgroup v2 host, it needs to update how it manages QOS and pod level cgroups.

This comment has been minimized.

Copy link
@AkihiroSuda

Kubelet PR: [https://github.com/kubernetes/kubernetes/pull/85218](https://github.com/kubernetes/kubernetes/pull/85218)

### Option 2: Use cgroups v2 throughout the stack

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

For option 1, kubelet still has to write cgroups v2 for its interactions with cgroup. Please add some detail around kubelet / cm package and eviction manager.

## Graduation Criteria
- Alpha: Basic support for running Kubernetes on a cgroups v2 host.

This comment has been minimized.

Copy link
@derekwaynecarr

derekwaynecarr Nov 18, 2019

Member

Worth discussing if we should add test coverage on fedora as a good candidate upstream.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

derekwaynecarr commented Nov 18, 2019

/assign

@giuseppe giuseppe force-pushed the giuseppe:cgroupsv2 branch 2 times, most recently from 83faa61 to c0c871d Nov 19, 2019
_net_prio_ are not available with the new version. The alternative to
these controllers is to use eBPF.

The lack of the network controllers probably affects some CNI

This comment has been minimized.

Copy link
@squeed

squeed Nov 22, 2019

I don't currently know of any CNI plugins that use cgroups for network management. (My list is not exhaustive, of course...). Bandwidth limiting, if implemented, is done via a tc filter / qdisc for all of the plugins I'm aware of.


|Kubernetes cgroups v1|Kubernetes cgroups v2 behavior|
|---|---|
|CPU stats for Horizontal Pod Autoscaling|No .percpu cpuacct stats. Check cadvisor.|

This comment has been minimized.

Copy link
@PatrickLang

PatrickLang Dec 3, 2019

Contributor

Does this have to depend on CAdvisor? I thought we were trying to depend on the stats API instead of requiring cadvisor.

This comment has been minimized.

Copy link
@giuseppe

giuseppe Dec 4, 2019

Author

is dropping CAdvisor tracked somewhere? I've added only because I saw that it is still used.

Also the changes requires for CAdvisor were already merged upstream. Another iteration might be required once hugetlb is supported though.

* Add support for cgroups v2 to cAdvisor ([https://github.com/google/cadvisor/pull/2309](https://github.com/google/cadvisor/pull/2309))
* Add support for cgroups v2 to the Kubelet

## User Stories

This comment has been minimized.

Copy link
@yujuhong

yujuhong Dec 3, 2019

Member

The user stories don't seem too user-focused. How are users going to use cgroup v2 and why? Is there additional benefit, or is it just that users have nodes using v2 and they don't want to be aware of this?

### Dependencies on OCI and container runtimes

Cgroups v2 requires changes to the OCI runtime specs. Changes to the
OCI specs are out of the scope for this proposal.

This comment has been minimized.

Copy link
@yujuhong

yujuhong Dec 3, 2019

Member

Does this mean OCI runtime spec already supports v2? If not, why don't we start by changing the OCI spec?

This comment has been minimized.

Copy link
@giuseppe

giuseppe Dec 4, 2019

Author

that will be done in the second phase where we will also expose/use cgroup v2 only features.

At this point I am not sure the KEP should account for that, as it is a much broader scope, and each functionality requires a new discussion.

In the first part we need to have feature parity with what we have now, so that Kubernetes can run on a cgroup v2 host.

## User Stories

* The Kubelet can run on a host using either cgroups v1 or v2
* Scheduler should be able to use cgroups v2

This comment has been minimized.

Copy link
@yujuhong

yujuhong Dec 3, 2019

Member

Why calling out scheduler? Does cgroup v2 require changing the resource representation in pod specs?

This comment has been minimized.

Copy link
@giuseppe

giuseppe Dec 9, 2019

Author

from the discussion we had, it seems there is no need to change the scheduler. I am going to drop it


## Risk and Mitigations

Some cgroups v1 features are not available with cgroups v2,

This comment has been minimized.

Copy link
@yujuhong

yujuhong Dec 3, 2019

Member

Would be good to have a list of unsupported features (that are used by kubernetes) in cgroups v2.


## Graduation Criteria

- Alpha: Basic support for running Kubernetes on a cgroups v2 host.

This comment has been minimized.

Copy link
@yujuhong

yujuhong Dec 3, 2019

Member

There should be some test coverage even for alpha, whether it's CRI validation suite, or node e2e tests.

This comment has been minimized.

Copy link
@giuseppe

giuseppe Dec 9, 2019

Author

I'll move the tests coverage to the Alpha phase

experience, promote after 2 releases in beta.

## Testing

This comment has been minimized.

Copy link
@yujuhong

yujuhong Dec 3, 2019

Member

Suggestion: add testing for performance



With this approach cAdvisor would have to read back values from
cgroups v2 files (already done).

This comment has been minimized.

Copy link
@yujuhong

yujuhong Dec 3, 2019

Member

So, with this option, CRI implementations will need to map the cgroup v1 fields to cgroup v2 values, right?

Could you list what each layer needs to do for each option to make it clear?

  • user (pod spec)
  • kubelet
  • CRI implementation
  • OCI runtime

This comment has been minimized.

Copy link
@giuseppe

giuseppe Dec 9, 2019

Author

the changes in the CRI implementation and OCI runtime might be implemented in a different way, deciding where to draw the line.
For CRI-O+crun, most of the logic is in the OCI runtime itself, but that is not the only way to achieve it.
For the Kubelet, the changes in this patch should be enough (at least until we have not hugetlb available): kubernetes/kubernetes#85218

In the second phase though, when cgroup v2 is fully supported through the stack there is need to change both pod specs+CRI.

@yujuhong

This comment has been minimized.

Copy link
Member

yujuhong commented Dec 3, 2019

/cc @Random-Liu for containerd

@k8s-ci-robot k8s-ci-robot requested a review from Random-Liu Dec 3, 2019
@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

k8s-ci-robot commented Dec 3, 2019

@yujuhong: GitHub didn't allow me to request PR reviews from the following users: for, containerd.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @Random-Liu for containerd

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alban alban mentioned this pull request Dec 4, 2019
@AkihiroSuda AkihiroSuda mentioned this pull request Dec 4, 2019
@giuseppe giuseppe force-pushed the giuseppe:cgroupsv2 branch from c0c871d to 23ddf34 Dec 9, 2019
@giuseppe giuseppe force-pushed the giuseppe:cgroupsv2 branch 2 times, most recently from a87c972 to 7fa00eb Dec 9, 2019
| OCI (x) | cgroup 2 value (y) | conversion | comment |
|---|---|---|---|
| limit | memory.max | y = x ||
| swap | memory.swap_max | y = x ||

This comment has been minimized.

Copy link
@AkihiroSuda

AkihiroSuda Dec 10, 2019

memory.swap.max

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
@giuseppe giuseppe force-pushed the giuseppe:cgroupsv2 branch from 7fa00eb to 77b401f Dec 10, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.