processes from /system.slice using cores assigned to guaranteed Pods when --cpu-manager-policy=static is set #85764

subrnath · 2019-11-30T13:05:53Z

What happened:

Following is the configuration for kubelet to enable the --cpu-manager-policy=static feature.

--cpu-manager-policy=static --system-reserved=cpu=10,memory=10Gi --system-reserved-cgroup=/system.slice.

There is no separate cgroup for the kubelet and hence --kube-reserved is not used.

./user.slice
./system.slice
./system.slice/system-getty.slice

PSR field of "ps -eLF " command on the host shows that processes from /system.slice are using cores assigned to the Application Pods which are guaranteed and integer cores assigned.

Here is /var/lib/kubelet/cpu_manager_state which shows that this Pod is assigned 8-11 and 32 - 35

{"policyName":"static","defaultCpuSet":"0-7,12-31,36-47","entries":{"f7cc858318f3a84d9cbd340e058126379292f09578ca362680e3a350c8edad63":"8-11,32-35"},"checksum":462128154}

on the Host,
ps -eLF | awk '{print $9 " " $13 " " $14 " " $2 " " $3}' | sort | grep -i glusterfs
8 /usr/sbin/glusterfs --log-level=ERROR 8870 1

Here 8 is the core ,PID 8870 and PPID is 1. Following command shows that cgroup of the PID 8870 is /system.slice

$ cat /proc/8870/cgroup
11:hugetlb:/
10:freezer:/
9:blkio:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
8:memory:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
7:pids:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
6:cpuset:/
5:net_cls,net_prio:/
4:perf_event:/
3:cpu,cpuacct:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
2:devices:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
1:name=systemd:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope

Similarly there are other processes from the /system.slice like docker daemon etc also using the cores assigned to this Pod.

What you expected to happen:

No system process from /system.slice should be scheduled on the cores assigned to this guaranteed Pod having integer CPU

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

Kubernetes version (use kubectl version): 1.15.3
Cloud provider or hardware configuration:
OS (e.g: cat /etc/os-release):

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

Kernel (e.g. uname -a):

4.10.0-27-generic

Install tools:
Network plugin and version (if this is a network-related bug):
Others:

/area kubelet
/sig node

The text was updated successfully, but these errors were encountered:

ipuustin · 2020-01-14T15:29:34Z

It seems to me that the cpuset cgroup of the pid 8870 you mention is just / instead of /system.slice:

6:cpuset:/

The cpuset cgroup controls the "integer" cpus on which the process executes.

ipuustin · 2020-01-15T14:47:06Z

I looked a bit further and it seems that the 10-core request is converted into cpu shares:

Jan 15 15:27:24 test kubelet[46360]: I0115 15:27:24.337944   46360 node_container_manager_linux.go:105] Enforcing System reserved on cgroup "/system.slice" with limits: map[cpu:{i:{value:10 scale:0} d:{Dec:<nil>} s:10 Format:DecimalSI}]
Jan 15 15:27:24 test kubelet[46360]: I0115 15:27:24.337977   46360 node_container_manager_linux.go:134] Enforcing limits on cgroup ["system"] with 824649638272 cpu shares, 0 bytes of memory, and 0 processes

Especially https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/types.go#L25, which is used to set the cgroup limit, doesn't have a field for integer CPU requests. Also, container manager only knows that "10" cpus are reserved, but it's not told which ones those are.

subrnath · 2020-01-16T11:21:38Z

Thanks.

but it's not told which ones those are.

Can you please let me know how to specified this? From https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/, --system-reserved=cpu is to be a total number of cores and not the cpu numbers like 1-10. Do you see any issue in "--system-reserved=cpu=10 param we have given". I can see another param ---reserved-cpus. this option is added in k8s 1.17 only. Also from documentation, it looks like it's alternative of -system-reserved=cpu.

Regarding your previous response - 6:cpuset:/

I can check why it's coming "/" and not the "/system.slice".

ipuustin · 2020-01-21T09:31:18Z

I think --reserved-cpus might help you here, because it actually takes a cpuset (such as 0-2). You could then manually assign that same cpuset to /system.slice.

subrnath · 2020-02-01T04:56:11Z

I tested with the following configuration -
--reserved-cpus=0-3 --cpu-manager-policy=static --system-reserved=cpu=2,memory=512Mi --kube-reserved=cpu=2,memory=512Mi

I am finding same issue. I deployed a Pod with 4 cores. /var/lib/kubelet/cpu_manager_state shows -

{"policyName":"static","defaultCpuSet":"0-3,6-15,18-23","entries":{"80135d993f44b38528069c26a050c402807f10f89aee354ef04af75671b04e83":"4-5,16-17"},"checksum":1025617988}

The above result shows that the Pod got 4 cores 4,5,16,17.

But same issue as we can see ceph-osd is scheduled on the core 5 which is assigned to the app Pod. app Pod is not using any external volume.

ps -eLF | grep -i ceph
167 24853 24825 26654 0 59 219386 62552 5 Jan31 ? 00:00:00 ceph-osd --foreground --id 5 --fsid a02fa118-f384-4c00-acb0-342ba63190f5 --setuser ceph --setgroup ceph --crush-location=root=default host=worker-cmm62-1 --default-log-to-file false --ms-learn-addr-from-peer=false

You could then manually assign that same cpuset to /system.slice

I didn't get it. Could you please let me know how to do that.

ipuustin · 2020-02-03T08:26:02Z

I have a PR which should help with your problem here by automating the cpuset resource setting to system-reserved cgroup: #87452

You can set a cpuset on a cgroup by writing the cpuset.cpus file. The exact way to do it depends on your cgroup setup but something like this might work:

# echo "1-4" > /sys/fs/cgroup/cpuset/system.slice/cpuset.cpus

Note that if you are reducing the cpuset, you need to first set it to all subdirectories (the parent cgroup can't have a smaller cpuset than the child cgroup). Otherwise you will get "permission denied" errors. The docs are here: https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt

jianzzha · 2020-02-28T12:02:20Z

@subrnath @ipuustin --reserved-cpus option is meant to be used with either isolcpu kargs or systemd CPUAffinity control. Using systemd CPUAffinity as an example, in /etc/systemd/system.conf, specify CPUAffinity=0 1 2 3; reboot the machine to make sure systemd take it; then for kubelet specify --reserved-cpus=0-3. see if this works for you, at least as a workaround.

As @ipuustin mentioned, enable cpuset cgroup setting might be another way to solve this - need to check if it is realistic - we will discuss that in the other PR.

harper1011 · 2020-04-15T22:08:30Z

@jianzzha @ipuustin I read the discussion above, and try to understand your statement/conclusion:
If we just need to do CPU isolation in K8S, then we only need to set "--reserved-cpus" in kubelet
and configure "CPUAffinity" in "/etc/systemd/system.conf" should be enough?
No other parameter is needed?
"--system-reserved", "--system-reserved-cgroup", "--kube-reserved" and "--kube-reserved-cgroup"can be skipped as they do nothing about CPU isolation. right?

"--reserved-cpus" can avoid Guaranteed Pods to be landed on given cpuset, and "CPUAffinity" can let all systemd managed process to stay the given cpuset.

jianzzha · 2020-04-16T13:11:25Z

@harper1011 yes that's right.

fejta-bot · 2020-07-15T13:58:19Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot · 2020-08-14T14:39:58Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

fejta-bot · 2020-09-13T15:22:01Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

k8s-ci-robot · 2020-09-13T15:22:15Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

2rs2ts · 2020-11-17T22:41:35Z

This is still very much a problem. Any chance this can get reopened? Otherwise I will probably make a duplicate.

subrnath added the kind/bug Categorizes issue or PR as related to a bug. label Nov 30, 2019

k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 30, 2019

ipuustin mentioned this issue Jan 17, 2020

cpuset.cpu_exclusive flag not set for Guaranteed Qos Pod Container #87249

Closed

ipuustin mentioned this issue Jan 22, 2020

Assign CPUset for system reserved cgroup. #87452

Closed

ipuustin mentioned this issue Feb 5, 2020

REQUEST: New membership for ipuustin kubernetes/org#1615

Closed

6 tasks

jianzzha mentioned this issue Apr 6, 2020

REQUEST: New membership for jianzzha kubernetes/org#1774

Closed

6 tasks

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2020

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2020

k8s-ci-robot closed this as completed Sep 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processes from /system.slice using cores assigned to guaranteed Pods when --cpu-manager-policy=static is set #85764

processes from /system.slice using cores assigned to guaranteed Pods when --cpu-manager-policy=static is set #85764

subrnath commented Nov 30, 2019 •

edited

ipuustin commented Jan 14, 2020

ipuustin commented Jan 15, 2020

subrnath commented Jan 16, 2020

ipuustin commented Jan 21, 2020

subrnath commented Feb 1, 2020

ipuustin commented Feb 3, 2020

jianzzha commented Feb 28, 2020

harper1011 commented Apr 15, 2020

jianzzha commented Apr 16, 2020

fejta-bot commented Jul 15, 2020

fejta-bot commented Aug 14, 2020

fejta-bot commented Sep 13, 2020

k8s-ci-robot commented Sep 13, 2020

2rs2ts commented Nov 17, 2020

processes from /system.slice using cores assigned to guaranteed Pods when --cpu-manager-policy=static is set #85764

processes from /system.slice using cores assigned to guaranteed Pods when --cpu-manager-policy=static is set #85764

Comments

subrnath commented Nov 30, 2019 • edited

ipuustin commented Jan 14, 2020

ipuustin commented Jan 15, 2020

subrnath commented Jan 16, 2020

ipuustin commented Jan 21, 2020

subrnath commented Feb 1, 2020

ipuustin commented Feb 3, 2020

jianzzha commented Feb 28, 2020

harper1011 commented Apr 15, 2020

jianzzha commented Apr 16, 2020

fejta-bot commented Jul 15, 2020

fejta-bot commented Aug 14, 2020

fejta-bot commented Sep 13, 2020

k8s-ci-robot commented Sep 13, 2020

2rs2ts commented Nov 17, 2020

subrnath commented Nov 30, 2019 •

edited