runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575
Observed on the
The text was updated successfully, but these errors were encountered:
I am able to reproduce once every few times when I run this command on the ppc64-linux-buildlet (which each time itself runs the test 500 times):
When it fails, the test seems to hang (somehow the normal "deadlock" detection of no goroutines and no main thread (because Goexit called) doesn't happen), and then something forces a SIGQUIT after 60 seconds.
The same test command run locally on amd64 never fails.
The actual test program (that is supposed to deadlock when all threads end and main does the Goexit) is:
I can still reproduce the problem if I comment out the
I'll keep investigating and check if it is present in 1.13. However, it seems unlikely that this has to be fixed for 1.14, since it is so rare and only happens when all threads and main exit (which is most likely a programming mistake).
Is it more likely if you run with GOGC=20 or something very low? I ran into problems with this test when stress testing other changes, and I think it was because of the stress testing, not the changes (but I'm not at a computer right now). My hunch was that some GC-related goroutines or maybe the scavenger were interfering with deadlock detection. @mknyszek, I recall you looking into a similar problem with the scavenger a while ago, but don't remember the outcome of that.…
It turns out that the problem is on the latest release of go1.13, but not on the very first releases of 1.13 in August, so I did a 'git bisect'. The change that it came up with (not saying this is definitive at all) was:
runtime: redefine scavenge goal in terms of heap_inuse [mknyszek]
So, good guess that it might be related to GC/scavenger-related. Also, as I mentioned, it isn't reproducible for any commit if runtime.GC() is removed. Not sure why this would particularly show up only for ppc64.
Will update further when I get a change to try out GOGC=20
@danscales thanks for the bisection, that helps a lot.
I printed out some diagnostic information and ran the program until it hung.
I found that in the cases where it hung, the scavenger had turned on, but consistently found no work to do, and thus fell into it's exponential back-off case. Since the program is no longer making progress at that point, it means that the scavenger will never get memory to scavenge, so it sits there for all eternity preventing
There are indeed cases where the scavenger turned on but there was no hang, because it found memory to scavenge and achieved its goal, thereby turning off and allowing
The reason why removing
So, now the question is why does it turn on if there's no work to do? In this case,
My first guess was that since
There's a 5 pages (40960 byte) discrepancy between
This indicates either a bug in the scavenging/treap code, or maybe 5 pages are being unaccounted for correctly. I'll dig further.
OK it appears to be neither of the things I thought it would be.
So it turns out there's a chance the computed rate could end up as +Inf, which means that "retained-want" (in the scavenger's calculation) is going to be some nonsense number.
This is the crux of the problem. The reason why we see this on ppc64 and not other platforms is the higher system page size of 64 KiB; it's more likely with small heaps that we could end up having less work than a single page, and so we fall into a situation where the scavenger should scavenge something (due to one set of sane pacing parameters) but will calculate it's always ahead of schedule because of the nonsense number, so it gets stuck in a loop.
The fix is easy: never let the nonsense number happen by either always rounding up to 1 physical page worth of work, or turning off the scavenger if there isn't at least one physical page worth of work. Both avoid the divide by zero which causes the +Inf, and the nonsense number later on.
This is not a performance problem or anything else in real running applications because the nonsense could dissipate in the following GC cycle, or the scavenger will harmlessly back off and do nothing (when there's really so little work to do that it doesn't matter). It is a problem for deadlock detection though, which is useful for teaching, so I'll put up a fix.
Should we backport this?