System calls with timeout receives EINTR with runc version 1.0.0-91 and onwards #3071

eganvas · 2021-07-07T21:33:07Z

We have a containerised application that uses UDP sockets with SO_RCVTIMEO. When it is run as a pod we see that every 10 seconds or so the syscall returns with EINTR. The system is configured with cpumanager policy static and reserved-cpus. When the cpu manger reconcile loop is executed (default every 10 seconds) it tries to set the cpuset and at this time we think that the container is FROZEN and took back to running. At this time we see that EINTR is seen by the container aplication.

systemd --version
systemd 234
+PAM -AUDIT +SELINUX -IMA +APPARMOR -SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 -IDN default-hierarchy=hybrid

sudo cat /var/lib/kubelet/cpu_manager_state
{"policyName":"static","defaultCpuSet":"0-3,6,8-51,54,56-95","entries":{"38e714ed-b6f5-402d-8303-4877245856f8":{"trex":"4,52"},"b516c5cd-2082-45da-9932-dbc7d129129e":{"trex-sriov":"5,7,53,55"}},"checksum":3895732860}

periodic reconcile loop by the cpumanager that updates the cpuset for the containers

2021-07-07T21:10:11.167630+00:00 pool1-dc309-1-wk1-n26 kubelet[24156]: I0707 23:10:11.167565 24156 cpu_manager.go:407] "ReconcileState: ignoring terminated container" pod="kube-system/eric-tm-external-connectivity-frontend-speaker-2krg5" containerID="9d83fc3b4c8e7e8728368f5bf09919a59641393ccf36c72856d5d76f4865e012"

The system is not using cgroup v2
sudo ls /sys/fs/cgroup/cgroup.controllers
ls: cannot access '/sys/fs/cgroup/cgroup.controllers': No such file or directory

The test application that can be used to recreate the issue is iperf2 (https://sourceforge.net/projects/iperf2/) run inside a container.
There are two pods with iperf
server : iperf -s -u -e -i 30 -p 5201 and
client : iperf -c -u -p 5201 -t 300 -i 300 -z -e -b 100pps

If we do an strace we can see EINTR as well as the server will keep creating a new connection when it gets EINTR since server do not handle EINTR.

This issue is seen with runc version v1.0.0-rc91 and above. If we run v1.0.0-rc10 then this issue is never observed. We are aware that there are lot of changes between these two versions.

We did kernel tracing and looks like some one is putting the container to freeze state and at this point the system call gets EINTR and we did a test to see if it is the cpu manager or not. If we change the reconcile period then the EINTR also aligns with the new time and if we do not have cpu manager policy static then the issue is never seen with old or new runc versions. If we swap the versions then the behaviour changes keeping all other settings same. So we are pretty sure that this is some thing runc brings. We have tried with version v1.0.0 and the issue is seen as well.

The strace of the perf server is taken with strace -f -p -o where you can see the EINTR and interrupted system call logs.
Uploading server-trace-non-working.txt…

cyphar · 2021-07-07T23:39:15Z

We did kernel tracing and looks like some one is putting the container to freeze state and at this point the system call gets EINTR and

Ah, yes. This is related to a change I made to devices cgroup handling a while ago (#2391) -- when running under systemd, any resource update will cause the container to be temporarily frozen during the update. The reason for this is that the devices cgroup is modified by systemd in a way that causes spurious device errors even if no device rules are changed (so programs that are writing to /dev/null or anything similar will get EACCES during this period).

There is a discussion by some Kubernetes folks on disabling this freezing mechanism in #3065, but given the above reasoning I'm not entirely convinced this is a reasonable thing to do. Given that -EINTR is a perfectly legitimate error to get from such a syscall (it can happen under normal operation), is it possible to update your server so that it correctly handles -EINTR?

odinuge · 2021-07-08T09:46:47Z

Ah, yes. This is related to a change I made to devices cgroup handling a while ago (#2391) -- when running under systemd, any resource update will cause the container to be temporarily frozen during the update. The reason for this is that the devices cgroup is modified by systemd in a way that causes spurious device errors even if no device rules are changed (so programs that are writing to /dev/null or anything similar will get EACCES during this period).

Have there been any communication with the systemd people about this (I can't seem to find it)? (tho they might tell us to just use cgroup v2 instead 😅)

There is a discussion by some Kubernetes folks on disabling this freezing mechanism in #3065, but given the above reasoning I'm not entirely convinced this is a reasonable thing to do.

To make it totally clear, the change needed for k8s is not directly related to this, since the use case of runc in that situation is as a cgroup manager (see #3065 (comment)). #3065 is only a bugfix to avoid situations where containers freeze permanently.

kolyshkin · 2021-07-08T18:09:32Z

@eganvas can you clarify which syscall is returning EINTR?

As pointed out by Aleksa above, if we just skip the freeze, the underlying containers will experience a brief EPERM instead of EINTR, which is worse, as EINTR is kind of expected and is not fatal (the app is expected to retry) while EPERM is considered fatal (the app is not supposed to retry).

In any case, there are multiple ways to fix it.

Fix the app to properly handle EINTR
Fix the CPU manager (I assume it's part of Kubernetes) to avoid updating resource limits that weren't changed (maybe makes sense to file an issue to Kubernetes?).
Fix systemd to not apply device rules disruptively, then add a conditional (dependent on systemd version) freeze skip to runc.

eganvas · 2021-07-08T19:39:53Z

The syscall that returned EINTR was recvmsg() when socket option SO_RCVTIMEO is set via
setsockopt( mSettings->mSock, SOL_SOCKET, SO_RCVTIMEO, (char *)&timeout, sizeof(timeout))

strace snippet,
23670 nanosleep({tv_sec=0, tv_nsec=10000000}, <unfinished ...> 23669 <... recvfrom resumed>0x563df69fb500, 1470, 0, 0x7fd23e7aaa08, [128]) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) 30767 <... recvmsg resumed>{msg_namelen=16}, 0) = -1 EINTR (Interrupted system call) 23669 recvfrom(3, <unfinished ...>

Yes. The CPU manager should not update cpuset if there are no changes. cpu manager needs to update the cpuset when there are new pods scheduled. So eventually the processes will get EINTR and they need to handle it.
Agree that EINTR is more natural than EPERM.

My worry is that there could be many applications what was working fine without handling EINTR in 1.0.0-rc10 and then in rc91 and onwards they start experiencing issues. So not particular about any single application at this point but general behavioural change.

eganvas · 2021-07-11T12:48:16Z

The cpumanager reconcile state function updating cpuset irrespective of whether there is a change or not is now fixed in the latest K8s as part of the issue fix kubernetes/kubernetes#100906

Now it calls the update only when there is a legitimate reason to call update and this should make the whole solution scalable as well as avoid runc updates. The container applications will get EINTR still but only when there is a real need to update the crusts by the cpu manager.

kolyshkin · 2021-07-16T03:05:36Z

One other way to fix the problem is use cgroup v2 (which requires no freeze, and won't freeze once #3067 / #3092 will make its way to kubernetes.

This will also be fixed in cgroup v1 by #3082 / #3093.

kolyshkin · 2021-07-19T21:32:42Z

FIxed in runc 1.0.1, kubernetes PRs: kubernetes/kubernetes#103743, kubernetes/kubernetes#103746.

eganvas · 2021-07-21T07:11:52Z

Thanks a lot. Really great job there to get the fixes to 1.0.1

kolyshkin closed this as completed Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

System calls with timeout receives EINTR with runc version 1.0.0-91 and onwards #3071

System calls with timeout receives EINTR with runc version 1.0.0-91 and onwards #3071

eganvas commented Jul 7, 2021

cyphar commented Jul 7, 2021

odinuge commented Jul 8, 2021

kolyshkin commented Jul 8, 2021

eganvas commented Jul 8, 2021

eganvas commented Jul 11, 2021

kolyshkin commented Jul 16, 2021

kolyshkin commented Jul 19, 2021

eganvas commented Jul 21, 2021

System calls with timeout receives EINTR with runc version 1.0.0-91 and onwards #3071

System calls with timeout receives EINTR with runc version 1.0.0-91 and onwards #3071

Comments

eganvas commented Jul 7, 2021

cyphar commented Jul 7, 2021

odinuge commented Jul 8, 2021

kolyshkin commented Jul 8, 2021

eganvas commented Jul 8, 2021

eganvas commented Jul 11, 2021

kolyshkin commented Jul 16, 2021

kolyshkin commented Jul 19, 2021

eganvas commented Jul 21, 2021