sync: TestWaitGroupMisuse2 hangs sometimes #20072
Labels
FrozenDueToAge
help wanted
NeedsFix
The path to resolution is known, but the work has not been done.
Milestone
I've been seeing the sync tests timeout quite a lot recently (on my 8 and 18 core linux/s390x VMs) and the cause appears to be TestWaitGroupMisuse2. I've also recreated the problem, albeit with more difficulty, on my 2-core darwin/amd64 laptop (note that by default this test doesn't run unless you have 5+ cores).
CL 36841 (83f95b8) added some new code that uses atomic operations to try and sync up the goroutines in order to make the test more reliable (to fix #11443). Basically all 3 goroutines atomically increment the uint32
here
and then poll it until it has the value 3. Unfortunately there is nothing to stop one of the goroutines being preempted in preparation for a GC 'stop the world' phase while the others are left spinning onhere
. This is a deadlock becausehere
will never be incremented by the preempted goroutine and the spinning goroutines aren't preemptible.The root cause of the problem is essentially #10958 and this issue can be fixed with either GOEXPERIMENT=preemptibleloops or GOGC=off. Although even with these fixes, at least on linux/s390x, the test still often takes a long time (up to 75s) and sometimes burns through all 1e6 iterations without triggering a panic. This probably explains why I see this a lot, since the longer the test runs the more likely a badly timed GC is.
Is it worth modifying this test or should we just wait until #10958 is fixed?
Side note: while recreating this on my 2-core darwin/amd64 laptop I noticed that repeatedly running TestWaitGroupMisuse2 leaks goroutines. I think this is because it doesn't necessarily wake up the goroutine that calls Wait. Not sure if we should care about that?
The text was updated successfully, but these errors were encountered: