Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: TestWaitGroupMisuse2 hangs sometimes #20072

Closed
mundaym opened this issue Apr 21, 2017 · 2 comments
Closed

sync: TestWaitGroupMisuse2 hangs sometimes #20072

mundaym opened this issue Apr 21, 2017 · 2 comments
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@mundaym
Copy link
Member

mundaym commented Apr 21, 2017

I've been seeing the sync tests timeout quite a lot recently (on my 8 and 18 core linux/s390x VMs) and the cause appears to be TestWaitGroupMisuse2. I've also recreated the problem, albeit with more difficulty, on my 2-core darwin/amd64 laptop (note that by default this test doesn't run unless you have 5+ cores).

CL 36841 (83f95b8) added some new code that uses atomic operations to try and sync up the goroutines in order to make the test more reliable (to fix #11443). Basically all 3 goroutines atomically increment the uint32 here and then poll it until it has the value 3. Unfortunately there is nothing to stop one of the goroutines being preempted in preparation for a GC 'stop the world' phase while the others are left spinning on here. This is a deadlock because here will never be incremented by the preempted goroutine and the spinning goroutines aren't preemptible.

The root cause of the problem is essentially #10958 and this issue can be fixed with either GOEXPERIMENT=preemptibleloops or GOGC=off. Although even with these fixes, at least on linux/s390x, the test still often takes a long time (up to 75s) and sometimes burns through all 1e6 iterations without triggering a panic. This probably explains why I see this a lot, since the longer the test runs the more likely a badly timed GC is.

Is it worth modifying this test or should we just wait until #10958 is fixed?

Side note: while recreating this on my 2-core darwin/amd64 laptop I noticed that repeatedly running TestWaitGroupMisuse2 leaks goroutines. I think this is because it doesn't necessarily wake up the goroutine that calls Wait. Not sure if we should care about that?

@andybons
Copy link
Member

@rsc @aclements

@andybons andybons added help wanted NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Apr 11, 2018
@andybons andybons added this to the Unplanned milestone Apr 11, 2018
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/112978 mentions this issue: sync: deflake TestWaitGroupMisuse2

@golang golang locked and limited conversation to collaborators May 14, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge help wanted NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

3 participants