Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LXC 5 load averages are wrong in Debian 12 container #4372

Closed
kobuki opened this issue Dec 3, 2023 · 7 comments
Closed

LXC 5 load averages are wrong in Debian 12 container #4372

kobuki opened this issue Dec 3, 2023 · 7 comments

Comments

@kobuki
Copy link

kobuki commented Dec 3, 2023

Required information

  • Distribution: Proxmox VE (based on Debian 12.2)
  • Distribution version: 8.1.3
  • The output of
    • lxc-start --version: 5.0.2
    • lxc-checkconfig:
LXC version 5.0.2
Kernel configuration not found at /proc/config.gz; searching...
Kernel configuration found at /boot/config-6.5.11-6-pve

--- Namespaces ---
Namespaces: enabled
Utsname namespace: enabled
Ipc namespace: enabled
Pid namespace: enabled
User namespace: enabled
Network namespace: enabled

--- Control groups ---
Cgroups: enabled
Cgroup namespace: enabled
Cgroup v1 mount points:
Cgroup v2 mount points:
 - /sys/fs/cgroup
Cgroup device: enabled
Cgroup sched: enabled
Cgroup cpu account: enabled
Cgroup memory controller: enabled
Cgroup cpuset: enabled

--- Misc ---
Veth pair device: enabled, loaded
Macvlan: enabled, not loaded
Vlan: enabled, loaded
Bridges: enabled, not loaded
Advanced netfilter: enabled, loaded
CONFIG_IP_NF_TARGET_MASQUERADE: enabled, not loaded
CONFIG_IP6_NF_TARGET_MASQUERADE: enabled, not loaded
CONFIG_NETFILTER_XT_TARGET_CHECKSUM: enabled, not loaded
CONFIG_NETFILTER_XT_MATCH_COMMENT: enabled, not loaded
FUSE (for use with lxcfs): enabled, not loaded

--- Checkpoint/Restore ---
checkpoint restore: enabled
CONFIG_FHANDLE: enabled
CONFIG_EVENTFD: enabled
CONFIG_EPOLL: enabled
CONFIG_UNIX_DIAG: enabled
CONFIG_INET_DIAG: enabled
CONFIG_PACKET_DIAG: enabled
CONFIG_NETLINK_DIAG: enabled
File capabilities: enabled

Note : Before booting a new kernel, you can check its configuration
usage : CONFIG=/path/to/config /usr/bin/lxc-checkconfig
  • uname -a: Linux pve1 6.5.11-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64 GNU/Linux
  • cat /proc/self/cgroup: 0::/user.slice/user-0.slice/session-91.scope
  • cat /proc/1/mounts:
sysfs /sys sysfs rw,nosuid,nodev,noexec,relatime 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
udev /dev devtmpfs rw,nosuid,relatime,size=3985868k,nr_inodes=996467,mode=755,inode64 0 0
devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,noexec,relatime,size=803040k,mode=755,inode64 0 0
/dev/mapper/pve-root / ext4 rw,relatime,errors=remount-ro,stripe=256 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,inode64 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,inode64 0 0
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
bpf /sys/fs/bpf bpf rw,nosuid,nodev,noexec,relatime,mode=700 0 0
systemd-1 /proc/sys/fs/binfmt_misc autofs rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=17850 0 0
tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0
configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0
hugetlbfs /dev/hugepages hugetlbfs rw,relatime,pagesize=2M 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0
ramfs /run/credentials/systemd-sysusers.service ramfs ro,nosuid,nodev,noexec,relatime,mode=700 0 0
ramfs /run/credentials/systemd-tmpfiles-setup-dev.service ramfs ro,nosuid,nodev,noexec,relatime,mode=700 0 0
ramfs /run/credentials/systemd-sysctl.service ramfs ro,nosuid,nodev,noexec,relatime,mode=700 0 0
ramfs /run/credentials/systemd-tmpfiles-setup.service ramfs ro,nosuid,nodev,noexec,relatime,mode=700 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0
sunrpc /run/rpc_pipefs rpc_pipefs rw,relatime 0 0
/dev/fuse /etc/pve fuse rw,nosuid,nodev,relatime,user_id=0,group_id=0,default_permissions,allow_other 0 0
tmpfs /run/user/0 tmpfs rw,nosuid,nodev,relatime,size=803040k,nr_inodes=200760,mode=700,inode64 0 0
lxcfs /var/lib/lxcfs fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0

Issue description

I originally discovered the issue with Zabbix, monitoring Debian 12 containers. Delving into the issue a bit more, I discovered a deeper problem with some basic system calls. Calling the getloadavg() libc function produces values that are directly related to the load of the host, not the container. The values are exactly the host loadavg values divided by the number of cores allotted to the container.

I'm not seeing the same issue on earlier Debian versions (tried on 10, 11). I'm not entirely sure if it's related to LXC or Debian, the kernel or something else.

Steps to reproduce

I'm posting a simple POC below. The commands have been executed inside the CT.

$ cat loadavgtest.c
#include <stdio.h>
#include <stdlib.h>

int main() {
    double loadavg[3];

    if (getloadavg(loadavg, 3) == -1) {
        perror("Error getting load average");
        return 1;
    }

    printf("%.2f %.2f %.2f\n", loadavg[0], loadavg[1], loadavg[2]);
}
$ gcc loadavgtest.c -o loadavgtest
$ ./loadavgtest; cat /proc/loadavg
0.88 0.76 0.81
0.03 0.04 0.00 0/88 3348634

Information to attach

I'll provide more info if required.

@anooprac
Copy link

anooprac commented Apr 2, 2024

Hello, we're UT Austin students who are doing open-source contributions for a final project in a class. We wanted to learn more about the issue to approach it. Do you possibly know what could be causing the issue?

@daNutzzzzz
Copy link

BUMP on this as I and many others are affected by this issue

@stgraber
Copy link
Member

stgraber commented Apr 6, 2024

This isn't a LXC or even LXCFS bug.

What's going on is that your C library in this case is calling the sysinfo system call rather than reading the proc file.

LXC integrates with LXCFS to mask some proc files and provide the container specific values instead of the original host values.

This mechanism can only apply to files, so if your software gets the values without reading from proc, you'll get the host wide values.

There are ways to handle sysinfo and LXC actually provides the basics to do so. That's usually achieved through userspace system call interception and emulation. LXC allows you to set the seccomp policy needed for the interception but you then need to use your own code to actually emulate the system call.

Or you can use a higher level container manager like Incus which does use that LXC functionality and implements that emulation logic.

On Incus, setting security.syscalls.intercept.sysinfo=true will do what you want in this case.

@kobuki
Copy link
Author

kobuki commented Apr 6, 2024

I'm not sure I completely understand everything you say about capturing system calls as I'm nor familiar with LXC internals, but here are some additional points.

  • It is not "my C library", it's just a POC I made for this issue and it uses the default C library shipped with Debian 12.
  • It's working properly in previous Debian releases, e. g. in Debian 10 or 11 and everything (including D12) works in LXC 4.
  • Debian 12 load monitoring is broken because of this bug, at least in Zabbix and possibly in others.
  • Zabbix always used the method presented in my POC above and it has worked consistently until LCX 5.

I would like to get to the bottom of this, and currently I don't clearly see that it has indeed nothing to do with LXC. It's not my custom library and not my application, even. It is one shipped with Debian that used to work before LXC 5. As I've mentioned in my OP, I noticed this in Zabbix, looked at its source and made the POC that is based on the system call they use.

As rewriting each and every affected application or specialize them for running in LXC, including those people use for load monitoring is obviously out of the question, what do you suggest going forward with the issue?

@stgraber
Copy link
Member

stgraber commented Apr 6, 2024

By "your C library", I meant, the C library in your container.

The bottom of it is that glibc used to parse /proc/loadavg, now it uses the sysinfo system call. LXCFS which often comes installed alongside LXC provides filesystem emulation which includes /proc/loadavg, but now that glibc isn't reading this file, it's not getting its value anymore.

Your options basically are:

  • Try to block the sysinfo system call using an LXC seccomp policy. If glibc and other applications calling sysinfo have a fallback code path, that should work, for the time being.
  • Implement your own seccomp notifier target which you can then instruct liblxc to use for things like that. This needs to be a privileged daemon (running as real root) and handling all the security aspects of inspecting and modifying process memory space can be rather tricky.
  • Use a container solution like Incus which does have a central daemon, uses that liblxc feature and did go through the effort of implementing system call interceptions for things like sysinfo.

Background on potential future plans

LXCFS has been a hack from the beginning, it's been a needed hack as the kernel community has showed no interest for having those proc files reflect cgroup limits or reflect the container they're being read from. The fact that glibc and others are moving away from reading those files, either by using system calls like sysinfo or by parsing sys files instead is making LXCFS less and less effective over time.

We can keep patching things with approaches like system call interception, but that's very very tricky to get right and adds a bunch of latency and CPU usage to the mix which isn't exactly ideal.

Our current thinking (@tych0 @brauner and myself at least) is that we need to spend time reviving another project of ours, libresource which will provide reliable functions to fetch all kind of resource consumption and limit values that affect the running process.

This will then do the actual dance of fetching the system wide value either by reading files or by hitting some system calls but it will also be aware of cgroups and other process limits and apply those to the returned values.

It's going to be a very long game though as even once that library exists and it's easy to consume from just about every language, we'll need to see adoption from the various language run times, things like C libraries and individual pieces of software.

It will likely take at least a decade before it's widespread enough that hacks like LXCFS and the system call interception logic can fully go away, but we've got to start at some point since we're more than 10 years into the current hacks and haven't seen much improvements in either the kernel community or userspace.

Rant about loadavg

Separately, there's also the whole issue of loadavg being a terrible thing to monitor in the first place. The only time loadavg was correct in the way most monitoring systems use it was back when you had systems without cgroups, without network filesystems and with only a single CPU. On such systems, your load average would correctly reflect how loaded your system was with 0 being no load and 1 being fully loaded and any value above it being overloaded.

The loadavg value indicates the number of processes that are to be scheduled at any one time.
On a systems where processes can't be artificially prevented from being scheduled, so systems without cgroups or network filesystems, your ideal load is equal to the number of CPU threads as that would be a system running at peak usage (loadavg of 48 on a 24 core, 48 threads system).

Once you start considering cgroups and network filesystems, the value becomes downright useless. Take for example a system where you have placed a process inside of a cgroup and placed a 1% CPU thread consumption limit on it (CFQ 1ms/100ms), if that container has one process asking for CPU all the time, your TOTAL system load is now being increased by 1, despite that process on a 48 threads system only being allowed 1/4800 of the total system CPU time. Now say that process starts 2000 child processes, you'll now see your system loadavg going up to 2000 even if nothing else is running on the system. Your system will still be perfectly responsive, there will be nothing wrong with it at all, it's just that one cgroup that's got a bunch of tasks waiting for scheduling and no CPU time to schedule it on.

Network filesystems can cause a similar mess, if they're slow to respond, the calling process gets stuck in I/O wait, which will similarly count towards the load average. Because FUSE is effectively treated as a network filesystem in this kind of situation, a random unprivileged user on your system can use FUSE and deadlock potentially thousands of their processes on that, again causing your system loadavg to shot up to crazy values, despite no impact on actual system performance.

That's why modern tools don't look at those values anymore but instead monitor actual scheduling metrics and CPU usage, look at memory pressure rather than pure memory consumption, ...

(And note that I've mentioned cgroup here, not containers, that's because you can absolutely create a cgroup inside of your container and cause that kind of weird loadavg value even inside of a container, the same is true of the FUSE case as that can also be done inside of a container).

@kobuki
Copy link
Author

kobuki commented Apr 6, 2024

Thanks for the extensive answer, @stgraber. I think that for me the most important part is this:

The bottom of it is that glibc used to parse /proc/loadavg, now it uses the sysinfo system call.

A seccomp workaround could be feasible and I'm going to look into it, but in case of Zabbix at least, I can just use other methods for sampling, based on /proc/loadavg.

@tych0
Copy link
Member

tych0 commented Apr 6, 2024

A seccomp workaround could be feasible and I'm going to look into it, but in case of Zabbix at least, I can just use other methods for sampling, based on /proc/loadavg.

we have some code for this that we really should open source, cc @sdab wrote it. Maybe we need to work on cleaning up TSA?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants