/sys/devices/system/cpu/online not container aware #301

ilhaan · 2019-08-21T23:34:20Z

I have noticed that containers see all cores when I inspect /sys/devices/system/cpu/online, even though they have been restricted to a single core. /proc/cpuinfo seems to be reporting correctly.

For example, I have a container test:

root@server:~# lxc exec test -- grep -c "processor" /proc/cpuinfo 
1

Which is the expected output. However.

root@server:~# lxc exec test -- cat /sys/devices/system/cpu/online
0-55

I tested this using the following snap installed versions of LXD:

3.0.4 (from the 3.0/stable channel)
3.16 (from the 3.16/stable channel)

More info from my server:

root@server:~# lsb_release -a 
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 18.04.3 LTS
Release:	18.04
Codename:	bionic
root@server:~# uname -a 
Linux v45 4.15.0-58-generic #64-Ubuntu SMP Tue Aug 6 11:12:41 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
root@server:~# snap --version 
snap    2.40
snapd   2.40
series  16
ubuntu  18.04
kernel  4.15.0-58-generic

Seems like there was a merge in May 2019 to address this. The comments on this merge make it seem like this should be enabled by default.

I discovered this while setting up a Kubernetes cluster with Rancher where I used LXD containers as nodes and posted an issue in the rancher repo. Rancher seems to be using /sys/ to enumerate CPU and memory resources on nodes, which is why I need LXCFS to fix my problem.

I also posted about this on the LXD forum.

Please let me know if you need me to provide additional information.

The text was updated successfully, but these errors were encountered:

stgraber · 2019-08-22T16:59:55Z

Yeah, there is a bug in the generation of the online value...

root@shell01:~# grep lxcfs /proc/mounts
lxcfs /proc/cpuinfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/loadavg fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/meminfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/swaps fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/uptime fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /sys/devices/system/cpu/online fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
root@shell01:~# cat /sys/devices/system/cpu/online
0-7
root@shell01:~# grep ^Cpus_allowed_list /proc/self/status
Cpus_allowed_list:	5-6
root@shell01:~# grep ^processor /proc/cpuinfo 
processor	: 0
processor	: 1
root@shell01:~# stat -f /sys/devices/system/cpu/online
  File: "/sys/devices/system/cpu/online"
    ID: 0        Namelen: 255     Type: fuseblk
Block size: 512        Fundamental block size: 512
Blocks: Total: 0          Free: 0          Available: 0
Inodes: Total: 0          Free: 0
root@shell01:~#

@brauner can you take a look? For some reason we appear to be hitting some fallback case where we get the host value rather than something which matches our task's cpuset configuration.

ilhaan · 2019-08-26T21:13:25Z

@stgraber Thanks for looking into this. I just noticed the following:

root@test:~# grep lxcfs /proc/mounts
lxcfs /proc/cpuinfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/loadavg fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/meminfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/swaps fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/uptime fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

I don't see /sys/devices/system/cpu/online listed above. Tested this with LXD 3.0.4 and 3.16 and got the same results.

time-river · 2019-08-27T02:47:36Z

I also find the problem, the following software use /sys/devices/system/cpu/online

# lscpu
nginx when use pid /run/nginx.pid;

Note: For nginx, it use sysconf(_SC_NPROCESSORS_ONLN); to get cpu number which the # strace is /sys/devices/system/cpu/online.

So, is there a plan to support virtual /sys/devices/system/cpu/online?

stgraber · 2019-08-27T16:26:17Z

There is support in lxcfs to virtualize /sys/devices/system/cpu/online as can be seen in my listing above on a suitably recent LXCFS, there is however an issue with the way it's rendered.

yinhongbo · 2019-09-12T07:02:54Z

@ilhaan @stgraber Please try to use the code of the latest master branch, I just submitted a PR may solve your problem.

ilhaan · 2019-09-13T20:29:59Z

@yinhongbo thanks for submitting the PR. Not sure how I can test this in a snap installed version of LXD. I'll try to figure it out, but would appreciate if you could give me some pointers on how to do this. Until then, I'll try poking around the LXD 3.17 snap.

adamszen · 2019-09-25T14:48:36Z

@ilhaan @stgraber Please try to use the code of the latest master branch, I just submitted a PR may solve your problem.

Still not working. I start lxcfs (latest master branch):

# lxcfs -l /var/lib/lxcfs/
mount namespace: 5
hierarchies:
  0: fd:   6: cpuset,cpu,cpuacct,blkio,memory,devices,freezer,perf_event,hugetlb,pids,rdma

All mounts are ok:

# mount | egrep "cgroup|lxcfs"
none on /cgroup type cgroup (rw)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,allow_other)

On host:

# cat /proc/cpuinfo  | grep processor
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7
processor       : 8
processor       : 9
processor       : 10
processor       : 11

# cat /sys/devices/system/cpu/online
0-11

# cat /cgroup/cpuset.cpus
0-11

I do:

# echo "1-8" >/cgroup/lxc/cpuset.cpus
# cat /cgroup/lxc/cpuset.cpus
1-8

Then i start container (name: test):

# lxc-ls 
test

And do on host:

# echo "2" >/cgroup/lxc/test/cpuset.cpus
# cat /cgroup/lxc/test/cpuset.cpus
2

SSH into test and:

root@test:~# grep lxcfs /proc/mounts 
lxcfs /proc/cpuinfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/loadavg fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/meminfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/swaps fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/uptime fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /sys/devices/system/cpu/online fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

root@test:~# cat /proc/cpuinfo | grep processor
processor       : 0

OK, but:

root@test:~# cat /sys/devices/system/cpu/online
0-11

Not ok :(

It's something in cgfs_get_value function in bindings.c i thing - this funtcion is reading the wrong file?
Regards!

yinhongbo · 2019-09-27T09:42:48Z

@ilhaan @stgraber Please try to use the code of the latest master branch, I just submitted a PR may solve your problem.

Still not working. I start lxcfs (latest master branch):

# lxcfs -l /var/lib/lxcfs/
mount namespace: 5
hierarchies:
  0: fd:   6: cpuset,cpu,cpuacct,blkio,memory,devices,freezer,perf_event,hugetlb,pids,rdma

All mounts are ok:

# mount | egrep "cgroup|lxcfs"
none on /cgroup type cgroup (rw)
lxcfs on /var/lib/lxcfs type fuse.lxcfs (rw,nosuid,nodev,allow_other)

On host:

# cat /proc/cpuinfo  | grep processor
processor       : 0
processor       : 1
processor       : 2
processor       : 3
processor       : 4
processor       : 5
processor       : 6
processor       : 7
processor       : 8
processor       : 9
processor       : 10
processor       : 11

# cat /sys/devices/system/cpu/online
0-11

# cat /cgroup/cpuset.cpus
0-11

I do:

# echo "1-8" >/cgroup/lxc/cpuset.cpus
# cat /cgroup/lxc/cpuset.cpus
1-8

Then i start container (name: test):

# lxc-ls 
test

And do on host:

# echo "2" >/cgroup/lxc/test/cpuset.cpus
# cat /cgroup/lxc/test/cpuset.cpus
2

SSH into test and:

root@test:~# grep lxcfs /proc/mounts 
lxcfs /proc/cpuinfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/loadavg fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/meminfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/swaps fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/uptime fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /sys/devices/system/cpu/online fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

root@test:~# cat /proc/cpuinfo | grep processor
processor       : 0

OK, but:

root@test:~# cat /sys/devices/system/cpu/online
0-11

Not ok :(

It's something in cgfs_get_value function in bindings.c i thing - this funtcion is reading the wrong file?
Regards!

I think I know the reason for this problem. I am submitting a PR. Notify you after the merger.

adamszen · 2019-09-27T13:07:30Z

It works very well!

Sorry for the gif but it's easier to show me this ;)

screen-capture.gif

Thanks for this patch!

stgraber · 2020-03-03T10:47:10Z

Sounds like this got resolved with #307.

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: jemalloc#1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir()

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: jemalloc#1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir() v3: - mark cpu_count_is_deterministic static - check only if percpu arena is enabled - check narenas

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: #1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir() v3: - mark cpu_count_is_deterministic static - check only if percpu arena is enabled - check narenas

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: jemalloc#1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir() v3: - mark cpu_count_is_deterministic static - check only if percpu arena is enabled - check narenas

Determinitic number of CPUs is important for percpu arena to work correctly, since it uses cpu index - sched_getcpu(), and if it will greater then number of CPUs bad thing will happen, or assertion will be failed in debug build: <jemalloc>: ../contrib/jemalloc/src/jemalloc.c:321: Failed assertion: "ind <= narenas_total_get()" Aborted (core dumped) Number of CPUs can be obtained from the following places: - sched_getaffinity() - sysconf(_SC_NPROCESSORS_ONLN) - sysconf(_SC_NPROCESSORS_CONF) For the sched_getaffinity() you may simply use taskset(1) to run program on a different cpu, and in case it will be not first, percpu will work incorrectly, i.e.: $ taskset --cpu-list $(( $(getconf _NPROCESSORS_ONLN)-1 )) <your_program> _SC_NPROCESSORS_ONLN uses /sys/devices/system/cpu/online, LXD/LXC virtualize /sys/devices/system/cpu/online file [1], and so when you run container with limited limits.cpus it will bind randomly selected CPU to it [1]: lxc/lxcfs#301 _SC_NPROCESSORS_CONF uses /sys/devices/system/cpu/cpu*, and AFAIK nobody playing with dentries there. So if all three of these are equal, percpu arenas should work correctly. And a small note regardless _SC_NPROCESSORS_ONLN/_SC_NPROCESSORS_CONF, musl uses sched_getaffinity() for both. So this will also increase the entropy. Also note, that you can check is percpu arena really applied using abort_conf:true. Refs: jemalloc/jemalloc#1939 Refs: ClickHouse/ClickHouse#32806 v2: move malloc_cpu_count_is_deterministic() into malloc_init_hard_recursible() since _SC_NPROCESSORS_CONF does allocations for readdir() v3: - mark cpu_count_is_deterministic static - check only if percpu arena is enabled - check narenas

ilhaan changed the title ~~/sys/devices/system/cpu/online not container aware~~ /sys/devices/system/cpu/online not container aware Aug 21, 2019

yinhongbo mentioned this issue Sep 27, 2019

fix the problem of counting the number of CPUs when using cpuset and … #307

Merged

xigang mentioned this issue Oct 22, 2019

support virtual /sys/devices/system/cpu/online denverdino/lxcfs-admission-webhook#1

Merged

stgraber closed this as completed Mar 3, 2020

azat mentioned this issue Dec 17, 2021

[RFC] Disable percpu arena in case of non deterministic CPU count jemalloc/jemalloc#2181

Merged

azat mentioned this issue Dec 17, 2021

Disable percpu arena in case of non deterministic CPU count ClickHouse/jemalloc#2

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

/sys/devices/system/cpu/online not container aware #301

/sys/devices/system/cpu/online not container aware #301

ilhaan commented Aug 21, 2019

stgraber commented Aug 22, 2019

ilhaan commented Aug 26, 2019

time-river commented Aug 27, 2019

stgraber commented Aug 27, 2019

yinhongbo commented Sep 12, 2019

ilhaan commented Sep 13, 2019

adamszen commented Sep 25, 2019 •

edited

Loading

yinhongbo commented Sep 27, 2019

adamszen commented Sep 27, 2019

stgraber commented Mar 3, 2020

/sys/devices/system/cpu/online not container aware #301

/sys/devices/system/cpu/online not container aware #301

Comments

ilhaan commented Aug 21, 2019

stgraber commented Aug 22, 2019

ilhaan commented Aug 26, 2019

time-river commented Aug 27, 2019

stgraber commented Aug 27, 2019

yinhongbo commented Sep 12, 2019

ilhaan commented Sep 13, 2019

adamszen commented Sep 25, 2019 • edited Loading

yinhongbo commented Sep 27, 2019

adamszen commented Sep 27, 2019

stgraber commented Mar 3, 2020

adamszen commented Sep 25, 2019 •

edited

Loading