Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processes from /system.slice using cores assigned to guaranteed Pods when --cpu-manager-policy=static is set #85764

Closed
subrnath opened this issue Nov 30, 2019 · 14 comments
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@subrnath
Copy link

subrnath commented Nov 30, 2019

What happened:

Following is the configuration for kubelet to enable the --cpu-manager-policy=static feature.

--cpu-manager-policy=static --system-reserved=cpu=10,memory=10Gi --system-reserved-cgroup=/system.slice.

There is no separate cgroup for the kubelet and hence --kube-reserved is not used.

./user.slice
./system.slice
./system.slice/system-getty.slice

PSR field of "ps -eLF " command on the host shows that processes from /system.slice are using cores assigned to the Application Pods which are guaranteed and integer cores assigned.

Here is /var/lib/kubelet/cpu_manager_state which shows that this Pod is assigned 8-11 and 32 - 35

{"policyName":"static","defaultCpuSet":"0-7,12-31,36-47","entries":{"f7cc858318f3a84d9cbd340e058126379292f09578ca362680e3a350c8edad63":"8-11,32-35"},"checksum":462128154}

on the Host,
ps -eLF | awk '{print $9 " " $13 " " $14 " " $2 " " $3}' | sort | grep -i glusterfs
8 /usr/sbin/glusterfs --log-level=ERROR 8870 1

Here 8 is the core ,PID 8870 and PPID is 1. Following command shows that cgroup of the PID 8870 is /system.slice

$ cat /proc/8870/cgroup
11:hugetlb:/
10:freezer:/
9:blkio:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
8:memory:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
7:pids:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
6:cpuset:/
5:net_cls,net_prio:/
4:perf_event:/
3:cpu,cpuacct:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
2:devices:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope
1:name=systemd:/system.slice/run-rf63a79fe920945f497cdb8ed37874e36.scope

Similarly there are other processes from the /system.slice like docker daemon etc also using the cores assigned to this Pod.

What you expected to happen:

No system process from /system.slice should be scheduled on the cores assigned to this guaranteed Pod having integer CPU

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.15.3
  • Cloud provider or hardware configuration:
  • OS (e.g: cat /etc/os-release):

$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial

  • Kernel (e.g. uname -a):

4.10.0-27-generic

  • Install tools:
  • Network plugin and version (if this is a network-related bug):
  • Others:

/area kubelet
/sig node

@subrnath subrnath added the kind/bug Categorizes issue or PR as related to a bug. label Nov 30, 2019
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 30, 2019
@ipuustin
Copy link
Contributor

It seems to me that the cpuset cgroup of the pid 8870 you mention is just / instead of /system.slice:

6:cpuset:/

The cpuset cgroup controls the "integer" cpus on which the process executes.

@ipuustin
Copy link
Contributor

I looked a bit further and it seems that the 10-core request is converted into cpu shares:

Jan 15 15:27:24 test kubelet[46360]: I0115 15:27:24.337944   46360 node_container_manager_linux.go:105] Enforcing System reserved on cgroup "/system.slice" with limits: map[cpu:{i:{value:10 scale:0} d:{Dec:<nil>} s:10 Format:DecimalSI}]
Jan 15 15:27:24 test kubelet[46360]: I0115 15:27:24.337977   46360 node_container_manager_linux.go:134] Enforcing limits on cgroup ["system"] with 824649638272 cpu shares, 0 bytes of memory, and 0 processes

Especially https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/types.go#L25, which is used to set the cgroup limit, doesn't have a field for integer CPU requests. Also, container manager only knows that "10" cpus are reserved, but it's not told which ones those are.

@subrnath
Copy link
Author

Thanks.

but it's not told which ones those are.

Can you please let me know how to specified this? From https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/, --system-reserved=cpu is to be a total number of cores and not the cpu numbers like 1-10. Do you see any issue in "--system-reserved=cpu=10 param we have given". I can see another param ---reserved-cpus. this option is added in k8s 1.17 only. Also from documentation, it looks like it's alternative of -system-reserved=cpu.

Regarding your previous response - 6:cpuset:/

I can check why it's coming "/" and not the "/system.slice".

@ipuustin
Copy link
Contributor

I think --reserved-cpus might help you here, because it actually takes a cpuset (such as 0-2). You could then manually assign that same cpuset to /system.slice.

@subrnath
Copy link
Author

subrnath commented Feb 1, 2020

I tested with the following configuration -
--reserved-cpus=0-3 --cpu-manager-policy=static --system-reserved=cpu=2,memory=512Mi --kube-reserved=cpu=2,memory=512Mi

I am finding same issue. I deployed a Pod with 4 cores. /var/lib/kubelet/cpu_manager_state shows -

{"policyName":"static","defaultCpuSet":"0-3,6-15,18-23","entries":{"80135d993f44b38528069c26a050c402807f10f89aee354ef04af75671b04e83":"4-5,16-17"},"checksum":1025617988}

The above result shows that the Pod got 4 cores 4,5,16,17.

But same issue as we can see ceph-osd is scheduled on the core 5 which is assigned to the app Pod. app Pod is not using any external volume.

ps -eLF | grep -i ceph
167 24853 24825 26654 0 59 219386 62552 5 Jan31 ? 00:00:00 ceph-osd --foreground --id 5 --fsid a02fa118-f384-4c00-acb0-342ba63190f5 --setuser ceph --setgroup ceph --crush-location=root=default host=worker-cmm62-1 --default-log-to-file false --ms-learn-addr-from-peer=false

You could then manually assign that same cpuset to /system.slice

I didn't get it. Could you please let me know how to do that.

@ipuustin
Copy link
Contributor

ipuustin commented Feb 3, 2020

I have a PR which should help with your problem here by automating the cpuset resource setting to system-reserved cgroup: #87452

You can set a cpuset on a cgroup by writing the cpuset.cpus file. The exact way to do it depends on your cgroup setup but something like this might work:

# echo "1-4" > /sys/fs/cgroup/cpuset/system.slice/cpuset.cpus

Note that if you are reducing the cpuset, you need to first set it to all subdirectories (the parent cgroup can't have a smaller cpuset than the child cgroup). Otherwise you will get "permission denied" errors. The docs are here: https://www.kernel.org/doc/Documentation/cgroup-v1/cpusets.txt

@jianzzha
Copy link
Contributor

@subrnath @ipuustin --reserved-cpus option is meant to be used with either isolcpu kargs or systemd CPUAffinity control. Using systemd CPUAffinity as an example, in /etc/systemd/system.conf, specify CPUAffinity=0 1 2 3; reboot the machine to make sure systemd take it; then for kubelet specify --reserved-cpus=0-3. see if this works for you, at least as a workaround.

As @ipuustin mentioned, enable cpuset cgroup setting might be another way to solve this - need to check if it is realistic - we will discuss that in the other PR.

@harper1011
Copy link

@jianzzha @ipuustin I read the discussion above, and try to understand your statement/conclusion:
If we just need to do CPU isolation in K8S, then we only need to set "--reserved-cpus" in kubelet
and configure "CPUAffinity" in "/etc/systemd/system.conf" should be enough?
No other parameter is needed?
"--system-reserved", "--system-reserved-cgroup", "--kube-reserved" and "--kube-reserved-cgroup"can be skipped as they do nothing about CPU isolation. right?

"--reserved-cpus" can avoid Guaranteed Pods to be landed on given cpuset, and "CPUAffinity" can let all systemd managed process to stay the given cpuset.

@jianzzha
Copy link
Contributor

@harper1011 yes that's right.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 15, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 14, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@2rs2ts
Copy link
Contributor

2rs2ts commented Nov 17, 2020

This is still very much a problem. Any chance this can get reopened? Otherwise I will probably make a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kubelet kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants