Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: *TCPConn deadlines missed by more than 25% on NetBSD and OpenBSD #50189

Open
bcmills opened this issue Dec 15, 2021 · 6 comments
Open

net: *TCPConn deadlines missed by more than 25% on NetBSD and OpenBSD #50189

bcmills opened this issue Dec 15, 2021 · 6 comments
Labels
ExpertNeeded NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD OS-OpenBSD
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Dec 15, 2021

In CL 366176 (for #36108), I increased the timeout slop in net.TestWriteTimeoutFluctuation and net.TestReadTimeoutFluctuation to 33% even for tests with very generous (multiple-second) timeouts. At that scale, that margin of slop should be trivial for even a heavily-loaded builder to hit.

Unfortunately, the NetBSD and OpenBSD builders still do not reliably hit it, even on an n1 instance that does not appear to be affected by #49209.

Given the other issues we've had with NetBSD and OpenBSD, I suspect a kernel bug. I plan to further raise the slop for those two platforms, narrow it everywhere else, and call it at that without investigating further. However, I suggest that folks who care about these kernels (@bsiegert, @coypoop, @4a6f656c?) may want to look into whether the underlying system calls may be adding unnecessary slop in their timeout handling.

greplogs --dashboard -md -l -e 'FAIL: TestWriteTimeoutFluctuation' --since=2021-11-23

2021-12-13T16:09:22-d198a36/openbsd-amd64-70-n2d
2021-11-29T19:45:58-f598e29/netbsd-amd64-9_0-n1
2021-11-29T16:08:23-a59ab29/openbsd-amd64-68
2021-11-25T00:07:28-f7e34e7/openbsd-amd64-70-n1
2021-11-25T00:07:11-c58243a/openbsd-amd64-70-n1

@bcmills bcmills added OS-NetBSD OS-OpenBSD NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Dec 15, 2021
@bcmills bcmills added this to the Unplanned milestone Dec 15, 2021
@gopherbot
Copy link

gopherbot commented Dec 15, 2021

Change https://golang.org/cl/372215 mentions this issue: net: increase timing slop in TimeoutFluctuation tests on NetBSD and OpenBSD

@bcmills
Copy link
Member Author

bcmills commented Dec 15, 2021

Expanding the regexp a bit to catch related tests:

greplogs --dashboard -md -l -e 'FAIL: Test(?:Read|ReadFrom|Write)TimeoutFluctuation' --since=2021-11-19

2021-12-13T16:09:22-d198a36/openbsd-amd64-70-n2d
2021-12-03T14:28:11-a174638/netbsd-386-9_0-n2d
2021-12-02T05:25:23-00dbcb3/netbsd-386-9_0
2021-12-01T22:20:39-c3a7fb2/openbsd-amd64-70
2021-11-30T20:04:58-7ccbcc9/openbsd-amd64-70-n1
2021-11-29T22:02:51-ebd0b77/openbsd-amd64-70
2021-11-29T22:02:45-1970e3e/openbsd-amd64-70
2021-11-29T19:45:58-f598e29/netbsd-amd64-9_0-n1
2021-11-29T16:08:23-a59ab29/openbsd-amd64-68
2021-11-27T23:27:52-9f2a075/openbsd-amd64-70-n1
2021-11-27T19:49:32-a142d65/openbsd-amd64-70-n1
2021-11-25T00:07:28-f7e34e7/openbsd-amd64-70-n1
2021-11-25T00:07:11-c58243a/openbsd-amd64-70-n1
2021-11-24T21:09:36-7e5331a/openbsd-amd64-70-n1
2021-11-24T20:53:54-a3b8f62/openbsd-amd64-70-n1
2021-11-22T18:55:55-2d7ae3f/openbsd-amd64-70
2021-11-22T18:50:19-9e94cc3/netbsd-amd64-9_0-n1
2021-11-22T18:50:19-9e94cc3/openbsd-amd64-70
2021-11-22T16:53:57-cd0bf38/netbsd-386-9_0-n1
2021-11-22T16:53:57-cd0bf38/netbsd-amd64-9_0
2021-11-22T16:15:36-6275b54/netbsd-386-9_0
2021-11-22T16:15:36-6275b54/netbsd-amd64-9_0-n1
2021-11-22T15:54:11-e73c6c8/netbsd-amd64-9_0
2021-11-22T12:30:26-b2aa138/netbsd-amd64-9_0-n1
2021-11-22T12:30:26-b2aa138/openbsd-amd64-70
2021-11-22T12:29:44-ffb6c79/netbsd-386-9_0
2021-11-22T12:29:44-ffb6c79/netbsd-amd64-9_0-n1
2021-11-22T12:29:44-ffb6c79/openbsd-amd64-70
2021-11-22T04:27:29-e30ebaa/netbsd-amd64-9_0
2021-11-22T03:24:07-a287c4a/openbsd-amd64-68
2021-11-22T03:24:01-0c3b4a3/netbsd-amd64-9_0
2021-11-22T03:24:01-0c3b4a3/openbsd-amd64-68
2021-11-22T03:24:01-0c3b4a3/openbsd-amd64-70
2021-11-20T08:47:36-91abe4b/netbsd-386-9_0-n1
2021-11-20T08:47:36-91abe4b/netbsd-amd64-9_0
2021-11-20T08:47:36-91abe4b/openbsd-amd64-70
2021-11-20T00:37:28-d2f4c93/netbsd-386-9_0
2021-11-20T00:32:49-57aba32/netbsd-amd64-9_0
2021-11-19T21:59:14-5e774b0/netbsd-386-9_0-n1
2021-11-19T21:59:14-5e774b0/openbsd-amd64-70

gopherbot pushed a commit that referenced this issue Dec 15, 2021
…penBSD

Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be
the only ones that miss by that much.

For #50189
Updates #36108

Change-Id: I22ac39cc7c254e40358fcd933b5a6016629602c3
Reviewed-on: https://go-review.googlesource.com/c/go/+/372215
Trust: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link

gopherbot commented Feb 4, 2022

Change https://golang.org/cl/383175 mentions this issue: rate: allow for more timing slop in TestWaitSimple

@bsiegert
Copy link
Contributor

bsiegert commented Feb 4, 2022

A theory on what might be causing this extra "slop":

On NetBSD (not fully sure about OpenBSD), the granularity of sleeps and such is in terms of a kernel constant named HZ, whose default value is 100. Thus, jitter to the next 10-millisecond boundary is unfortunately expected.

@bcmills
Copy link
Member Author

bcmills commented Feb 4, 2022

@bsiegert, empirically the amount of slop is not bounded to a constant — or, if it is, that constant is greater than a second. Many of those test failures ended in an attempt to sleep on the order of 3–5 seconds, and missed the target by well over a second.

(I agree that rounding up to a system-dependent subsecond granularity would be reasonable and expected, but that doesn't seem to be the case here.)

gopherbot pushed a commit to golang/time that referenced this issue Feb 10, 2022
The 'd' constant is intentionally set fairly long to allow for builder
jitter; however, dFromDuration previously hard-coded only 1ms of
downward timing slop.

That slop can be introduced due to time spent between the call to
WaitN in the previous runWait and the call to time.Now in the current
runWait, and empirically may be a bit larger than 1ms on certain
builders (especially the android-amd64-emu builders, which also have
more clock drift than many other platforms; see golang/go#42513).

In addition, on some BSD platforms the slop in the upward direction
may actually be longer than d no matter how generously d is set. That
appears to be a platform bug (see golang/go#50189).

This change adjusts dFromDuration to round to the nearest d instead of
biasing in one direction or the other, and allows an additional factor
of slop on the affected BSD platforms.

Fixes #44067

Change-Id: Id4c073bee545be2291ad98158d764e19db0160cb
Reviewed-on: https://go-review.googlesource.com/c/time/+/383175
Trust: Benny Siegert <bsiegert@gmail.com>
Reviewed-by: Sameer Ajmani <sameer@golang.org>
Trust: Bryan Mills <bcmills@google.com>
Run-TryBot: Bryan Mills <bcmills@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

gopherbot commented Jun 29, 2022

Change https://go.dev/cl/415234 mentions this issue: os: simplify deadline fluctuation tests

gopherbot pushed a commit that referenced this issue Jun 30, 2022
This applies the net package CL 365334, CL 366176, CL 372215 to the os
package.

CL 365334:

    These tests were checking for fairly narrow timing windows, but were
    running in parallel and heavily dependent on timer and goroutine
    scheduling. This change eliminates unnecessary goroutines, runs the
    tests sequentially (dramatically shortening the timeouts to reduce the
    penalty of doing so), and uses timestamp comparison instead of
    background timers to hopefully gain some robustness from monotonic
    timestamps.

    Many of the other tests from this package would benefit from similar
    simplifications, which we can apply if and when we notice flaky
    failures or want to improve the latency of running the test.

CL 366176:

    It appears that at least the OpenBSD kernel gets sloppier the longer
    the timeout we give it, up to an observed overhead of around 25%.
    Let's give it a little more than that (33%) in the comparison, and
    also increase the growth curve to match the actual observed times
    instead of exponential initial growth.

CL 372215:

    Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be
    the only ones that miss by that much.

For #36108
For #50189
Fixes #50725 (we hope)

Change-Id: I0854d27af67ca9fcf0f9d9e4ff67acff4c2effc8
Reviewed-on: https://go-review.googlesource.com/c/go/+/415234
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
jproberts pushed a commit to jproberts/go that referenced this issue Aug 10, 2022
This applies the net package CL 365334, CL 366176, CL 372215 to the os
package.

CL 365334:

    These tests were checking for fairly narrow timing windows, but were
    running in parallel and heavily dependent on timer and goroutine
    scheduling. This change eliminates unnecessary goroutines, runs the
    tests sequentially (dramatically shortening the timeouts to reduce the
    penalty of doing so), and uses timestamp comparison instead of
    background timers to hopefully gain some robustness from monotonic
    timestamps.

    Many of the other tests from this package would benefit from similar
    simplifications, which we can apply if and when we notice flaky
    failures or want to improve the latency of running the test.

CL 366176:

    It appears that at least the OpenBSD kernel gets sloppier the longer
    the timeout we give it, up to an observed overhead of around 25%.
    Let's give it a little more than that (33%) in the comparison, and
    also increase the growth curve to match the actual observed times
    instead of exponential initial growth.

CL 372215:

    Decrease the slop everywhere else, since NetBSD and OpenBSD seem to be
    the only ones that miss by that much.

For golang#36108
For golang#50189
Fixes golang#50725 (we hope)

Change-Id: I0854d27af67ca9fcf0f9d9e4ff67acff4c2effc8
Reviewed-on: https://go-review.googlesource.com/c/go/+/415234
Run-TryBot: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Bryan Mills <bcmills@google.com>
Reviewed-by: Ian Lance Taylor <iant@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExpertNeeded NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-NetBSD OS-OpenBSD
Projects
None yet
Development

No branches or pull requests

3 participants