-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LXC 5 load averages are wrong in Debian 12 container #4372
Comments
Hello, we're UT Austin students who are doing open-source contributions for a final project in a class. We wanted to learn more about the issue to approach it. Do you possibly know what could be causing the issue? |
BUMP on this as I and many others are affected by this issue |
This isn't a LXC or even LXCFS bug. What's going on is that your C library in this case is calling the sysinfo system call rather than reading the proc file. LXC integrates with LXCFS to mask some proc files and provide the container specific values instead of the original host values. This mechanism can only apply to files, so if your software gets the values without reading from proc, you'll get the host wide values. There are ways to handle sysinfo and LXC actually provides the basics to do so. That's usually achieved through userspace system call interception and emulation. LXC allows you to set the seccomp policy needed for the interception but you then need to use your own code to actually emulate the system call. Or you can use a higher level container manager like Incus which does use that LXC functionality and implements that emulation logic. On Incus, setting security.syscalls.intercept.sysinfo=true will do what you want in this case. |
I'm not sure I completely understand everything you say about capturing system calls as I'm nor familiar with LXC internals, but here are some additional points.
I would like to get to the bottom of this, and currently I don't clearly see that it has indeed nothing to do with LXC. It's not my custom library and not my application, even. It is one shipped with Debian that used to work before LXC 5. As I've mentioned in my OP, I noticed this in Zabbix, looked at its source and made the POC that is based on the system call they use. As rewriting each and every affected application or specialize them for running in LXC, including those people use for load monitoring is obviously out of the question, what do you suggest going forward with the issue? |
By "your C library", I meant, the C library in your container. The bottom of it is that glibc used to parse /proc/loadavg, now it uses the sysinfo system call. LXCFS which often comes installed alongside LXC provides filesystem emulation which includes /proc/loadavg, but now that glibc isn't reading this file, it's not getting its value anymore. Your options basically are:
Background on potential future plansLXCFS has been a hack from the beginning, it's been a needed hack as the kernel community has showed no interest for having those proc files reflect cgroup limits or reflect the container they're being read from. The fact that glibc and others are moving away from reading those files, either by using system calls like We can keep patching things with approaches like system call interception, but that's very very tricky to get right and adds a bunch of latency and CPU usage to the mix which isn't exactly ideal. Our current thinking (@tych0 @brauner and myself at least) is that we need to spend time reviving another project of ours, libresource which will provide reliable functions to fetch all kind of resource consumption and limit values that affect the running process. This will then do the actual dance of fetching the system wide value either by reading files or by hitting some system calls but it will also be aware of cgroups and other process limits and apply those to the returned values. It's going to be a very long game though as even once that library exists and it's easy to consume from just about every language, we'll need to see adoption from the various language run times, things like C libraries and individual pieces of software. It will likely take at least a decade before it's widespread enough that hacks like LXCFS and the system call interception logic can fully go away, but we've got to start at some point since we're more than 10 years into the current hacks and haven't seen much improvements in either the kernel community or userspace. Rant about loadavgSeparately, there's also the whole issue of loadavg being a terrible thing to monitor in the first place. The only time loadavg was correct in the way most monitoring systems use it was back when you had systems without cgroups, without network filesystems and with only a single CPU. On such systems, your load average would correctly reflect how loaded your system was with 0 being no load and 1 being fully loaded and any value above it being overloaded. The loadavg value indicates the number of processes that are to be scheduled at any one time. Once you start considering cgroups and network filesystems, the value becomes downright useless. Take for example a system where you have placed a process inside of a cgroup and placed a 1% CPU thread consumption limit on it (CFQ 1ms/100ms), if that container has one process asking for CPU all the time, your TOTAL system load is now being increased by 1, despite that process on a 48 threads system only being allowed 1/4800 of the total system CPU time. Now say that process starts 2000 child processes, you'll now see your system loadavg going up to 2000 even if nothing else is running on the system. Your system will still be perfectly responsive, there will be nothing wrong with it at all, it's just that one cgroup that's got a bunch of tasks waiting for scheduling and no CPU time to schedule it on. Network filesystems can cause a similar mess, if they're slow to respond, the calling process gets stuck in I/O wait, which will similarly count towards the load average. Because FUSE is effectively treated as a network filesystem in this kind of situation, a random unprivileged user on your system can use FUSE and deadlock potentially thousands of their processes on that, again causing your system loadavg to shot up to crazy values, despite no impact on actual system performance. That's why modern tools don't look at those values anymore but instead monitor actual scheduling metrics and CPU usage, look at memory pressure rather than pure memory consumption, ... (And note that I've mentioned cgroup here, not containers, that's because you can absolutely create a cgroup inside of your container and cause that kind of weird loadavg value even inside of a container, the same is true of the FUSE case as that can also be done inside of a container). |
Thanks for the extensive answer, @stgraber. I think that for me the most important part is this:
A seccomp workaround could be feasible and I'm going to look into it, but in case of Zabbix at least, I can just use other methods for sampling, based on |
we have some code for this that we really should open source, cc @sdab wrote it. Maybe we need to work on cleaning up TSA? |
Required information
lxc-start --version
:5.0.2
lxc-checkconfig
:uname -a
:Linux pve1 6.5.11-6-pve #1 SMP PREEMPT_DYNAMIC PMX 6.5.11-6 (2023-11-29T08:32Z) x86_64 GNU/Linux
cat /proc/self/cgroup
:0::/user.slice/user-0.slice/session-91.scope
cat /proc/1/mounts
:Issue description
I originally discovered the issue with Zabbix, monitoring Debian 12 containers. Delving into the issue a bit more, I discovered a deeper problem with some basic system calls. Calling the getloadavg() libc function produces values that are directly related to the load of the host, not the container. The values are exactly the host loadavg values divided by the number of cores allotted to the container.
I'm not seeing the same issue on earlier Debian versions (tried on 10, 11). I'm not entirely sure if it's related to LXC or Debian, the kernel or something else.
Steps to reproduce
I'm posting a simple POC below. The commands have been executed inside the CT.
Information to attach
I'll provide more info if required.
The text was updated successfully, but these errors were encountered: