-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
System calls with timeout receives EINTR with runc version 1.0.0-91 and onwards #3071
Comments
Ah, yes. This is related to a change I made to devices cgroup handling a while ago (#2391) -- when running under systemd, any resource update will cause the container to be temporarily frozen during the update. The reason for this is that the devices cgroup is modified by systemd in a way that causes spurious device errors even if no device rules are changed (so programs that are writing to There is a discussion by some Kubernetes folks on disabling this freezing mechanism in #3065, but given the above reasoning I'm not entirely convinced this is a reasonable thing to do. Given that |
Have there been any communication with the systemd people about this (I can't seem to find it)? (tho they might tell us to just use cgroup v2 instead 😅)
To make it totally clear, the change needed for k8s is not directly related to this, since the use case of runc in that situation is as a cgroup manager (see #3065 (comment)). #3065 is only a bugfix to avoid situations where containers freeze permanently. |
@eganvas can you clarify which syscall is returning EINTR? As pointed out by Aleksa above, if we just skip the freeze, the underlying containers will experience a brief EPERM instead of EINTR, which is worse, as EINTR is kind of expected and is not fatal (the app is expected to retry) while EPERM is considered fatal (the app is not supposed to retry). In any case, there are multiple ways to fix it.
|
The syscall that returned EINTR was recvmsg() when socket option SO_RCVTIMEO is set via strace snippet, Yes. The CPU manager should not update cpuset if there are no changes. cpu manager needs to update the cpuset when there are new pods scheduled. So eventually the processes will get EINTR and they need to handle it. My worry is that there could be many applications what was working fine without handling EINTR in 1.0.0-rc10 and then in rc91 and onwards they start experiencing issues. So not particular about any single application at this point but general behavioural change. |
The cpumanager reconcile state function updating cpuset irrespective of whether there is a change or not is now fixed in the latest K8s as part of the issue fix kubernetes/kubernetes#100906 Now it calls the update only when there is a legitimate reason to call update and this should make the whole solution scalable as well as avoid runc updates. The container applications will get EINTR still but only when there is a real need to update the crusts by the cpu manager. |
FIxed in runc 1.0.1, kubernetes PRs: kubernetes/kubernetes#103743, kubernetes/kubernetes#103746. |
Thanks a lot. Really great job there to get the fixes to 1.0.1 |
We have a containerised application that uses UDP sockets with SO_RCVTIMEO. When it is run as a pod we see that every 10 seconds or so the syscall returns with EINTR. The system is configured with cpumanager policy static and reserved-cpus. When the cpu manger reconcile loop is executed (default every 10 seconds) it tries to set the cpuset and at this time we think that the container is FROZEN and took back to running. At this time we see that EINTR is seen by the container aplication.
systemd --version
systemd 234
+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 -IDN default-hierarchy=hybrid
sudo cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-3,6,8-51,54,56-95","entries":{"38e714ed-b6f5-402d-8303-4877245856f8":{"trex":"4,52"},"b516c5cd-2082-45da-9932-dbc7d129129e":{"trex-sriov":"5,7,53,55"}},"checksum":3895732860}
periodic reconcile loop by the cpumanager that updates the cpuset for the containers
2021-07-07T21:10:11.167630+00:00 pool1-dc309-1-wk1-n26 kubelet[24156]: I0707 23:10:11.167565 24156 cpu_manager.go:407] "ReconcileState: ignoring terminated container" pod="kube-system/eric-tm-external-connectivity-frontend-speaker-2krg5" containerID="9d83fc3b4c8e7e8728368f5bf09919a59641393ccf36c72856d5d76f4865e012"
The system is not using cgroup v2
sudo ls /sys/fs/cgroup/cgroup.controllers
ls: cannot access '/sys/fs/cgroup/cgroup.controllers': No such file or directory
The test application that can be used to recreate the issue is iperf2 (https://sourceforge.net/projects/iperf2/) run inside a container.
There are two pods with iperf
server : iperf -s -u -e -i 30 -p 5201 and
client : iperf -c -u -p 5201 -t 300 -i 300 -z -e -b 100pps
If we do an strace we can see EINTR as well as the server will keep creating a new connection when it gets EINTR since server do not handle EINTR.
This issue is seen with runc version v1.0.0-rc91 and above. If we run v1.0.0-rc10 then this issue is never observed. We are aware that there are lot of changes between these two versions.
We did kernel tracing and looks like some one is putting the container to freeze state and at this point the system call gets EINTR and we did a test to see if it is the cpu manager or not. If we change the reconcile period then the EINTR also aligns with the new time and if we do not have cpu manager policy static then the issue is never seen with old or new runc versions. If we swap the versions then the behaviour changes keeping all other settings same. So we are pretty sure that this is some thing runc brings. We have tried with version v1.0.0 and the issue is seen as well.
The strace of the perf server is taken with strace -f -p -o where you can see the EINTR and interrupted system call logs.
Uploading server-trace-non-working.txt…
The text was updated successfully, but these errors were encountered: