Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time: fix test flakiness on OpenBSD #9903

Closed
mdempsky opened this issue Feb 17, 2015 · 6 comments

Comments

Projects
None yet
5 participants
@mdempsky
Copy link
Member

commented Feb 17, 2015

The time tests seem flaky on OpenBSD for some reason. E.g., recent-ish time.TestAfterQueuing failures on openbsd/amd64:

http://build.golang.org/log/7cf643d530ccbe39c97376ffdb2af64db66bcb5d
http://build.golang.org/log/10a9c8a1ffc2a088f48a12fdc9379f91a8bbc8b8
http://build.golang.org/log/702430fd8bf22805d19672d0795f6bc0d1395e6b
http://build.golang.org/log/cb50ea140b06232bdecb4c926134c9113f94e56a

Here's a time.TestReset failure on openbsd/amd64:

http://build.golang.org/log/26bb60d01823523470ffe47c7ed5fa9cdb280513

Flakiness has been seen on openbsd/386 in the past too; e.g.

http://build.golang.org/log/bd19f8590db8c7f317b3f968ac1af5126f5449cf

I've previously conjectured that this could be related to random scheduling delays interacting with repeated absolute<->relative timeout conversions in package runtime, but after measuring this seems unlikely (see https://groups.google.com/d/msg/golang-dev/AbSUgOucZyk/6ea0yP3ba1QJ).

It's possible this is actually an OpenBSD kernel issue. One conjecture I have at the moment is that the tc_windup() function (http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_tc.c) is not multiprocessor safe, because it doesn't use proper memory barriers. E.g., the final "timehands = th;" assignment might become visible to another thread before the modifications to the struct timehands's fields become visible.

Aside: I pointed this out privately to PHK in 2010, to which he responded "You are probably right, explicit memory barriers are probably called for." but FreeBSD still does not use memory barriers in tc_windup() either as far as I can tell: https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_tc.c

NetBSD appears to have added explicit memory barriers back in 2007: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/kern/kern_tc.c#rev1.20

@mdempsky

This comment has been minimized.

Copy link
Member Author

commented Feb 17, 2015

Joel Sing tells me that the old openbsd-386-rootbsd builder was MP, but openbsd-amd64-rootbsd builders was SP. Looking through old build.golang.org logs, I can't find any evidence of time.TimeAfterQueuing failing on openbsd-amd64-rootbsd (which aligns with my memories of thinking it was an openbsd/386-specific bug).

However, now openbsd-amd64-gce56 is MP too, and it's also exhibiting the bug. That seems to support my theory that it's a memory barrier issue.

Following up with OpenBSD to see about fixing kern_tc.c.

@bradfitz

This comment has been minimized.

Copy link
Member

commented Feb 18, 2015

We should have a Github issue label for kernel-bugs-in-various-operating-systems-that-go-has-revealed

@mdempsky

This comment has been minimized.

Copy link
Member Author

commented Mar 1, 2015

Still jet lagged, so I dug into this some more early this morning. On my X201s running OpenBSD -current, I was able to reproduce this issue. I think the kern_tc.c race is still technically a bug, but it doesn't seem to be the root cause of this crash.

My way of easily reproducing currently is:

sed -i.tmp 's/const attempts = 3/const attempts = 1/' src/time/sleep_test.go
go test -c time
yes | head -n 10000 | GOMAXPROCS=4 xargs -P20 -n1 ./time.test -test.short=true -test.timeout=120s -test.run=TestAfterQueuing | grep -v '^PASS$'

(Notably, so far I've only repro'd with GOMAXPROCS>1, but it seems like the builders are using GOMAXPROCS=1.)

A couple things to explain:

  1. With -test.short=true, TestAfterQueuing lowers "delta" to 20ms.
  2. On OpenBSD (at least the arches supported by Go), the processor scheduler has a period of 10ms, whereas on Linux they're 1ms. I.e., CPU intensive threads can potentially run for up to 10ms uninterrupted on OpenBSD.
  3. When "go test std cmd" runs time.test, it's also busy compiling other packages/commands (i.e., CPU intensive work).

That leads to these consequences:

  1. The first call to testAfterQueuing() tries to create a bunch of interleaved timers on lines 261--263. If the test process is preempted for >=20ms while creating the timers, it's possible (e.g.) the final After(0 * delta) call occurred more than delta after the call to After(1 * delta). I've seen this happen in ktrace/kdump output.

(Note: Subsequent retries, the global "slots" array is already sorted, so this cause goes away. To trigger the failure below with "attempts = 1", move the sort.Ints() call earlier.)

  1. Even when the queued timers are for non-decreasing intervals, there's a possibility that the "timerproc" Go routine gets delayed for >=20ms. I've traced through timerproc to verify that even in this case, it processes the timers in the right order; but it's not guaranteed the "await" goroutines will be scheduled to execute in the same order.

(Interestingly, in a simplified test case, I'm able to reproduce this non-deterministic goroutine scheduling with GOMAXPROCS>1, but not with GOMAXPROCS=1. I need to investigate more to find out why this is.)

Takeaway: I don't think "kernel-bug" is an entirely accurate here, but because of OpenBSD's kernel behavior, it might make sense to only set "delta = 20 * Millisecond" for GOOS != "openbsd".

@rsc rsc removed os-openbsd labels Apr 10, 2015

@rsc rsc added this to the Unplanned milestone Apr 10, 2015

@rsc rsc added the OS-OpenBSD label Apr 15, 2015

@gopherbot

This comment has been minimized.

Copy link

commented Apr 29, 2015

CL https://golang.org/cl/9510 mentions this issue.

@mdempsky mdempsky closed this in d12b532 Apr 29, 2015

@mdempsky

This comment has been minimized.

Copy link
Member Author

commented May 4, 2015

Sigh, still flaking: http://build.golang.org/log/6ae361bdba9c95e35737fe173e291cd45959f1b3

(This was the build from 10f6d30 on build.golang.org for openbsd/386.)

@gopherbot

This comment has been minimized.

Copy link

commented May 6, 2015

CL https://golang.org/cl/9795 mentions this issue.

@bradfitz bradfitz closed this in b4bc7b4 May 6, 2015

@mikioh mikioh modified the milestones: Go1.5, Unplanned May 15, 2015

@golang golang locked and limited conversation to collaborators Jun 25, 2016

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.