Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix reserved cgroup systemd #78793

Conversation

@mattjmcnaughton
Copy link
Contributor

commented Jun 7, 2019

What type of PR is this?
/kind bug

What this PR does / why we need it:
Fix an issue in which, when trying to specify the --kube-reserved-cgroup
(or --system-reserved-cgroup) with --cgroup-driver=systemd, we will
not properly convert the systemd cgroup name into the internal cgroup
name that k8s expects. Without this change, specifying
--kube-reserved-cgroup=/test.slice --cgroup-driver=systemd will fail,
and only --kube-reserved-cgroup=/test --crgroup-driver=systemd will succeed,
even if the actual cgroup existing on the host is /test.slice.

Additionally, add light unit testing of our process from converting to a
systemd cgroup name to kubernetes internal cgroup name.

Which issue(s) this PR fixes:
Fixes #78629

Special notes for your reviewer:
cc @xiongmaodada @jaypipes

Does this PR introduce a user-facing change?:

When specifying `--(kube|system)-reserved-cgroup`, with `--cgroup-driver=systemd`, it is now possible to use the fully qualified cgroupfs name (i.e. `/test-cgroup.slice`).
Fix reserved cgroup systemd
Fix an issue in which, when trying to specify the `--kube-reserved-cgroup`
(or `--system-reserved-cgroup`) with `--cgroup-driver=systemd`, we will
not properly convert the `systemd` cgroup name into the internal cgroup
name that k8s expects. Without this change, specifying
`--kube-reserved-cgroup=/test.slice --cgroup-driver=systemd` will fail,
and only `--kube-reserved-cgroup=/test --crgroup-driver=systemd` will succeed,
even if the actual cgroup existing on the host is `/test.slice`.

Additionally, add light unit testing of our process from converting to a
systemd cgroup name to kubernetes internal cgroup name.
@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor Author

commented Jun 7, 2019

/assign @ConnorDoyle

thanks :)

@dashpole

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

While it might not have been the optimal way to implement it, I believe the intended way to set the flag is to specify --kube-reserved-cgroup=/test to use the /test.slice cgroup.

cc @derekwaynecarr who might remember why we did this, or if it was a mistake we just decided to stick with for backwards compatibility.

@dashpole

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

I took a look at this PR, and I don't think this meets backwards compatibility requirements for kubelet flags. For example, if someone is currently specifying --kube-reserved-cgroup=/test, their kubelet wouldn't work after upgrading unless they switched to --kube-reserved-cgroup=/test.slice.

@jaypipes

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

Hmm, I was thinking a good solution to the original bug (from a user experience perspective) would be to simply have the configuration option handler for --kube-reserved-group simply strip off any trailing ".slice" notation that the user may have erroneously supplied.

That would be backwards-compatible and give a better user experience than the error message that was returned in #78629

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor Author

commented Jun 8, 2019

@ConnorDoyle @jaypipes thanks for the feedback!

The CgroupName function calls ParseSystemdToCgroupName, which strips the .slice only if it exists. So I think this pr may already be doing what @jaypipes is describing. In addition, I believe this diff will not harm backwards compatibility, because both /test and /test.slice can be passed.

Very possible I'm misreading the code though :)

@dashpole

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

Ah, you are correct. In that case, this change
/lgtm

@jaypipes
Copy link
Contributor

left a comment

Suffice to say it's a bit confusing to have a method called CgroupName() on the cgroupManager object in addition to a type named CgroupName :)

But yes, this patch works. Might be useful to mention in the commit message that the CgroupName method of the cgroupManager object calls ParseSystemdToCgroupName(), which is the part that fixes the bug/issue in question.

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor Author

commented Jun 11, 2019

Thanks all :)

cc @ConnorDoyle friendly ping :)

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor Author

commented Jun 19, 2019

cc @Random-Liu mind taking a look and approving if you get the chance?

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor Author

commented Jun 23, 2019

/assign @vishh

mind taking a look if you get the chance? would love to get this bug fixed and the pr closed :)

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor Author

commented Jul 23, 2019

cc @tallclair mind approving? would love to get this merged and the issue closed.

@mattjmcnaughton

This comment has been minimized.

Copy link
Contributor Author

commented Aug 2, 2019

/assign @derekwaynecarr

Could you please take a look? Thanks! I want to rule this out as a cause of #80807.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

I will take a look.

@dashpole I need to verify the behavior intended. Internally, we had the kubelet describe cgroups abstractly into the cgroupfs format, and converted it to a concrete name. Externally, I had thought we settled on using cgroupfs name, but yeah, this will be a fun walk down memory lane to remember what was intended or not ;-)

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

@mattjmcnaughton its also possible global accounting was not enabled on the CoreOS host.

see:
https://www.freedesktop.org/software/systemd/man/systemd-system.conf.html#DefaultCPUAccounting=

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

/hold

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

The original design had stated that only the cgroupfs fully qualified name was to be used.

See:
https://github.com/kubernetes/community/blob/master/contributors/design-proposals/node/node-allocatable.md#phase-2---enforce-allocatable-on-pods

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

/hold cancel

This fixes a regression I suspect that was introduced when we converted how we managed internal cgroup names. I do not think it will address the CoreOS issue that was noted as I think that is due to not having DefaultAccounting enabled on systemd for the unit that launched the kubelet.

Either way, this is a net win.

/approve
/lgtm

@k8s-ci-robot

This comment has been minimized.

Copy link
Contributor

commented Aug 2, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, mattjmcnaughton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

Updated the release note for clarity that the user is intended to use the fully qualified cgroupfs name.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

/hold

actually, this confused me for a second.

@derekwaynecarr

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

/hold cancel

the desired behavior was to always specify the fully qualified name and never the internal name, so always /system.slice and never /system even when using systemd cgroup driver. i think this regression was introduced when @filbranden tried to clean up our internal cgroup naming logic. In practice, it is extremely rare and not really recommended to enforce reservations on system or kube-reserved groups (which is why its not the default), which explains why this was probably not found previously. Its actually a good idea for us to log a warning if people do enforce it that tehy really really konw what they are doing!

@fejta-bot

This comment has been minimized.

Copy link

commented Aug 3, 2019

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit c63000e into kubernetes:master Aug 3, 2019

23 checks passed

cla/linuxfoundation mattjmcnaughton authorized
Details
pull-kubernetes-bazel-build Job succeeded.
Details
pull-kubernetes-bazel-test Job succeeded.
Details
pull-kubernetes-conformance-image-test Skipped.
pull-kubernetes-cross Skipped.
pull-kubernetes-dependencies Job succeeded.
Details
pull-kubernetes-e2e-gce Job succeeded.
Details
pull-kubernetes-e2e-gce-100-performance Job succeeded.
Details
pull-kubernetes-e2e-gce-csi-serial Skipped.
pull-kubernetes-e2e-gce-device-plugin-gpu Job succeeded.
Details
pull-kubernetes-e2e-gce-iscsi Skipped.
pull-kubernetes-e2e-gce-iscsi-serial Skipped.
pull-kubernetes-e2e-gce-storage-slow Skipped.
pull-kubernetes-godeps Skipped.
pull-kubernetes-integration Job succeeded.
Details
pull-kubernetes-kubemark-e2e-gce-big Job succeeded.
Details
pull-kubernetes-local-e2e Skipped.
pull-kubernetes-node-e2e Job succeeded.
Details
pull-kubernetes-node-e2e-containerd Job succeeded.
Details
pull-kubernetes-typecheck Job succeeded.
Details
pull-kubernetes-verify Job succeeded.
Details
pull-publishing-bot-validate Skipped.
tide In merge pool.
Details

@k8s-ci-robot k8s-ci-robot added this to the v1.16 milestone Aug 3, 2019

@mattjmcnaughton mattjmcnaughton deleted the mattjmcnaughton:mattjmcnaughton/78629-fix-reserved-cgroup-systemd branch Aug 3, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.