Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
time: fix test flakiness on OpenBSD #9903
The time tests seem flaky on OpenBSD for some reason. E.g., recent-ish time.TestAfterQueuing failures on openbsd/amd64:
Here's a time.TestReset failure on openbsd/amd64:
Flakiness has been seen on openbsd/386 in the past too; e.g.
I've previously conjectured that this could be related to random scheduling delays interacting with repeated absolute<->relative timeout conversions in package runtime, but after measuring this seems unlikely (see https://groups.google.com/d/msg/golang-dev/AbSUgOucZyk/6ea0yP3ba1QJ).
It's possible this is actually an OpenBSD kernel issue. One conjecture I have at the moment is that the tc_windup() function (http://cvsweb.openbsd.org/cgi-bin/cvsweb/src/sys/kern/kern_tc.c) is not multiprocessor safe, because it doesn't use proper memory barriers. E.g., the final "timehands = th;" assignment might become visible to another thread before the modifications to the struct timehands's fields become visible.
Aside: I pointed this out privately to PHK in 2010, to which he responded "You are probably right, explicit memory barriers are probably called for." but FreeBSD still does not use memory barriers in tc_windup() either as far as I can tell: https://github.com/freebsd/freebsd/blob/master/sys/kern/kern_tc.c
NetBSD appears to have added explicit memory barriers back in 2007: http://cvsweb.netbsd.org/bsdweb.cgi/src/sys/kern/kern_tc.c#rev1.20
Joel Sing tells me that the old openbsd-386-rootbsd builder was MP, but openbsd-amd64-rootbsd builders was SP. Looking through old build.golang.org logs, I can't find any evidence of time.TimeAfterQueuing failing on openbsd-amd64-rootbsd (which aligns with my memories of thinking it was an openbsd/386-specific bug).
However, now openbsd-amd64-gce56 is MP too, and it's also exhibiting the bug. That seems to support my theory that it's a memory barrier issue.
Following up with OpenBSD to see about fixing kern_tc.c.
Still jet lagged, so I dug into this some more early this morning. On my X201s running OpenBSD -current, I was able to reproduce this issue. I think the kern_tc.c race is still technically a bug, but it doesn't seem to be the root cause of this crash.
My way of easily reproducing currently is:
(Notably, so far I've only repro'd with GOMAXPROCS>1, but it seems like the builders are using GOMAXPROCS=1.)
A couple things to explain:
That leads to these consequences:
(Note: Subsequent retries, the global "slots" array is already sorted, so this cause goes away. To trigger the failure below with "attempts = 1", move the sort.Ints() call earlier.)
(Interestingly, in a simplified test case, I'm able to reproduce this non-deterministic goroutine scheduling with GOMAXPROCS>1, but not with GOMAXPROCS=1. I need to investigate more to find out why this is.)
Takeaway: I don't think "kernel-bug" is an entirely accurate here, but because of OpenBSD's kernel behavior, it might make sense to only set "delta = 20 * Millisecond" for GOOS != "openbsd".
Sigh, still flaking: http://build.golang.org/log/6ae361bdba9c95e35737fe173e291cd45959f1b3
(This was the build from 10f6d30 on build.golang.org for openbsd/386.)