-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
net: *TCPConn deadlines missed by more than 25% on NetBSD and OpenBSD #50189
Comments
Change https://golang.org/cl/372215 mentions this issue: |
…penBSD Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be the only ones that miss by that much. For #50189 Updates #36108 Change-Id: I22ac39cc7c254e40358fcd933b5a6016629602c3 Reviewed-on: https://go-review.googlesource.com/c/go/+/372215 Trust: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Change https://golang.org/cl/383175 mentions this issue: |
A theory on what might be causing this extra "slop": On NetBSD (not fully sure about OpenBSD), the granularity of sleeps and such is in terms of a kernel constant named HZ, whose default value is 100. Thus, jitter to the next 10-millisecond boundary is unfortunately expected. |
@bsiegert, empirically the amount of slop is not bounded to a constant — or, if it is, that constant is greater than a second. Many of those test failures ended in an attempt to sleep on the order of 3–5 seconds, and missed the target by well over a second. (I agree that rounding up to a system-dependent subsecond granularity would be reasonable and expected, but that doesn't seem to be the case here.) |
The 'd' constant is intentionally set fairly long to allow for builder jitter; however, dFromDuration previously hard-coded only 1ms of downward timing slop. That slop can be introduced due to time spent between the call to WaitN in the previous runWait and the call to time.Now in the current runWait, and empirically may be a bit larger than 1ms on certain builders (especially the android-amd64-emu builders, which also have more clock drift than many other platforms; see golang/go#42513). In addition, on some BSD platforms the slop in the upward direction may actually be longer than d no matter how generously d is set. That appears to be a platform bug (see golang/go#50189). This change adjusts dFromDuration to round to the nearest d instead of biasing in one direction or the other, and allows an additional factor of slop on the affected BSD platforms. Fixes #44067 Change-Id: Id4c073bee545be2291ad98158d764e19db0160cb Reviewed-on: https://go-review.googlesource.com/c/time/+/383175 Trust: Benny Siegert <bsiegert@gmail.com> Reviewed-by: Sameer Ajmani <sameer@golang.org> Trust: Bryan Mills <bcmills@google.com> Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/415234 mentions this issue: |
This applies the net package CL 365334, CL 366176, CL 372215 to the os package. CL 365334: These tests were checking for fairly narrow timing windows, but were running in parallel and heavily dependent on timer and goroutine scheduling. This change eliminates unnecessary goroutines, runs the tests sequentially (dramatically shortening the timeouts to reduce the penalty of doing so), and uses timestamp comparison instead of background timers to hopefully gain some robustness from monotonic timestamps. Many of the other tests from this package would benefit from similar simplifications, which we can apply if and when we notice flaky failures or want to improve the latency of running the test. CL 366176: It appears that at least the OpenBSD kernel gets sloppier the longer the timeout we give it, up to an observed overhead of around 25%. Let's give it a little more than that (33%) in the comparison, and also increase the growth curve to match the actual observed times instead of exponential initial growth. CL 372215: Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be the only ones that miss by that much. For #36108 For #50189 Fixes #50725 (we hope) Change-Id: I0854d27af67ca9fcf0f9d9e4ff67acff4c2effc8 Reviewed-on: https://go-review.googlesource.com/c/go/+/415234 Run-TryBot: Ian Lance Taylor <iant@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
This applies the net package CL 365334, CL 366176, CL 372215 to the os package. CL 365334: These tests were checking for fairly narrow timing windows, but were running in parallel and heavily dependent on timer and goroutine scheduling. This change eliminates unnecessary goroutines, runs the tests sequentially (dramatically shortening the timeouts to reduce the penalty of doing so), and uses timestamp comparison instead of background timers to hopefully gain some robustness from monotonic timestamps. Many of the other tests from this package would benefit from similar simplifications, which we can apply if and when we notice flaky failures or want to improve the latency of running the test. CL 366176: It appears that at least the OpenBSD kernel gets sloppier the longer the timeout we give it, up to an observed overhead of around 25%. Let's give it a little more than that (33%) in the comparison, and also increase the growth curve to match the actual observed times instead of exponential initial growth. CL 372215: Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be the only ones that miss by that much. For golang#36108 For golang#50189 Fixes golang#50725 (we hope) Change-Id: I0854d27af67ca9fcf0f9d9e4ff67acff4c2effc8 Reviewed-on: https://go-review.googlesource.com/c/go/+/415234 Run-TryBot: Ian Lance Taylor <iant@golang.org> Reviewed-by: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
In CL 366176 (for #36108), I increased the timeout slop in
net.TestWriteTimeoutFluctuation
andnet.TestReadTimeoutFluctuation
to 33% even for tests with very generous (multiple-second) timeouts. At that scale, that margin of slop should be trivial for even a heavily-loaded builder to hit.Unfortunately, the NetBSD and OpenBSD builders still do not reliably hit it, even on an
n1
instance that does not appear to be affected by #49209.Given the other issues we've had with NetBSD and OpenBSD, I suspect a kernel bug. I plan to further raise the slop for those two platforms, narrow it everywhere else, and call it at that without investigating further. However, I suggest that folks who care about these kernels (@bsiegert, @coypoop, @4a6f656c?) may want to look into whether the underlying system calls may be adding unnecessary slop in their timeout handling.
greplogs --dashboard -md -l -e 'FAIL: TestWriteTimeoutFluctuation' --since=2021-11-23
2021-12-13T16:09:22-d198a36/openbsd-amd64-70-n2d
2021-11-29T19:45:58-f598e29/netbsd-amd64-9_0-n1
2021-11-29T16:08:23-a59ab29/openbsd-amd64-68
2021-11-25T00:07:28-f7e34e7/openbsd-amd64-70-n1
2021-11-25T00:07:11-c58243a/openbsd-amd64-70-n1
The text was updated successfully, but these errors were encountered: