Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

bcmills opened this issue Sep 27, 2019 · 12 comments

runtime: TestGoexitCrash failure on linux-ppc64le-buildlet #34575

bcmills opened this issue Sep 27, 2019 · 12 comments


Copy link

@bcmills bcmills commented Sep 27, 2019

Observed on the linux-ppc64le-buildlet builder (

--- FAIL: TestGoexitCrash (60.00s)
    crash_test.go:95: testprog GoexitExit exit status: exit status 2
    crash_test.go:291: output:
        SIGQUIT: quit
        PC=0x68570 m=0 sigcode=0
        goroutine 6 [syscall]:
        runtime.notetsleepg(0x1d8d40, 0xc34fff89f, 0x1415)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/lock_futex.go:227 +0x38 fp=0xc000036730 sp=0xc0000366f0 pc=0x198e8
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/time.go:308 +0x2f4 fp=0xc0000367b8 sp=0xc000036730 pc=0x59774
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/asm_ppc64x.s:884 +0x4 fp=0xc0000367b8 sp=0xc0000367b8 pc=0x67a44
        created by runtime.(*timersBucket).addtimerLocked
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/time.go:166 +0x13c
        r0   0xdd	r1   0xc000036648
        r2   0x2f2ec	r3   0x4
        r4   0x80	r5   0x0
        r6   0xc000036698	r7   0x0
        r8   0x0	r9   0x0
        r10  0x0	r11  0x0
        r12  0x0	r13  0x1ddac8
        r14  0x196d4	r15  0xc000036678
        r16  0x1	r17  0xffffffffb150046b
        r18  0xb150046b	r19  0x127474
        r20  0xc00008e020	r21  0x1d6a40
        r22  0x0	r23  0x39
        r24  0x10	r25  0x8
        r26  0x0	r27  0x0
        r28  0x0	r29  0x0
        r30  0xc00005e900	r31  0x130000
        pc   0x68570	ctr  0x0
        link 0x39094	xer  0x0
        ccr  0x54400002	trap 0xc00
        want output containing: no goroutines (main called runtime.Goexit) - deadlock!
FAIL	runtime	109.964s

CC @danscales @mknyszek; see previously #31966

@bcmills bcmills added this to the Go1.14 milestone Sep 27, 2019
@rsc rsc modified the milestones: Go1.14, Backlog Oct 9, 2019
Copy link
Member Author

@bcmills bcmills commented Oct 21, 2019

Copy link
Member Author

@bcmills bcmills commented Oct 21, 2019

@danscales @mknyszek @aclements: could someone from the runtime team take a look at this failure and at least assess whether it's something we should fix in 1.14?

@bcmills bcmills modified the milestones: Backlog, Go1.14 Oct 21, 2019
Copy link

@danscales danscales commented Oct 21, 2019

I will take a look, see if I can reproduce or make a guess as to what might be happening.

Copy link

@danscales danscales commented Oct 24, 2019

I am able to reproduce once every few times when I run this command on the ppc64-linux-buildlet (which each time itself runs the test 500 times):

gomote run user-danscales-linux-ppc64-buildlet-0 go/bin/go test runtime -test.count=500 TestGoexitCrash

When it fails, the test seems to hang (somehow the normal "deadlock" detection of no goroutines and no main thread (because Goexit called) doesn't happen), and then something forces a SIGQUIT after 60 seconds.

The same test command run locally on amd64 never fails.

The actual test program (that is supposed to deadlock when all threads end and main does the Goexit) is:

func GoexitExit() {
	go func() {
	i := 0
	runtime.SetFinalizer(&i, func(p *int) {})

I can still reproduce the problem if I comment out the i := 0 line and the SetFinalizer line. However, I can't reproduce if I comment out the runtime.GC line.

I'll keep investigating and check if it is present in 1.13. However, it seems unlikely that this has to be fixed for 1.14, since it is so rare and only happens when all threads and main exit (which is most likely a programming mistake).

Copy link

@danscales danscales commented Oct 24, 2019

OK, it definitely also happens in go1.13. I haven't been able to reproduce in go1.12 yet, so it might be a slight regression in the go1.13 timeframe.

@bcmills bcmills modified the milestones: Go1.14, Backlog Oct 24, 2019
Copy link

@aclements aclements commented Oct 25, 2019

Copy link

@danscales danscales commented Oct 25, 2019

It turns out that the problem is on the latest release of go1.13, but not on the very first releases of 1.13 in August, so I did a 'git bisect'. The change that it came up with (not saying this is definitive at all) was:

runtime: redefine scavenge goal in terms of heap_inuse [mknyszek]

So, good guess that it might be related to GC/scavenger-related. Also, as I mentioned, it isn't reproducible for any commit if runtime.GC() is removed. Not sure why this would particularly show up only for ppc64.

Will update further when I get a change to try out GOGC=20

Copy link

@mknyszek mknyszek commented Oct 25, 2019

@danscales thanks for the bisection, that helps a lot.

I printed out some diagnostic information and ran the program until it hung.

I found that in the cases where it hung, the scavenger had turned on, but consistently found no work to do, and thus fell into it's exponential back-off case. Since the program is no longer making progress at that point, it means that the scavenger will never get memory to scavenge, so it sits there for all eternity preventing checkdead from firing since it thinks that the scavenger is alive via a live timer.

There are indeed cases where the scavenger turned on but there was no hang, because it found memory to scavenge and achieved its goal, thereby turning off and allowing checkdead to throw. Most of the time it just never turns on at all, which is why this is hard to reproduce.

The reason why removing runtime.GC() fixes this is because the scavenger is always off for the first GC, so an explicit GC opens up the opportunity to potentially turn on the scavenger at all.

So, now the question is why does it turn on if there's no work to do? In this case, heap_sys - heap_released (the scavenger's retained heap memory metric) is greater than the goal it sets for itself, which, as of the CL @danscales mentioned, is based on heap_inuse at the end of the last GC, multiplied by the expected heap growth (next_gc / last_next_gc).

My first guess was that since last_heap_inuse is stale information (slightly) there could be skew with the actual heap_inuse which means the whole heap is either in use or scavenged. But this guess was wrong. When I print out heap_inuse at the last scavenger pacing change before the hang, it's identical in value to last_heap_inuse.

There's a 5 pages (40960 byte) discrepancy between heap_sys - heap_released and heap_inuse. In theory, heap_sys is all mapped memory, heap_released is all freed and scavenged memory, and heap_inuse is all in use memory. Therefore those 40960 bytes should be free and unscavenged memory, but the scavenger can't find them. If it could, it would just scavenge some portion of them and the application would exit quickly (like in the non-hanging scavenger-on cases, where 40960 bytes is the exact discrepancy).

This indicates either a bug in the scavenging/treap code, or maybe 5 pages are being unaccounted for correctly. I'll dig further.

Copy link

@mknyszek mknyszek commented Oct 25, 2019

OK it appears to be neither of the things I thought it would be.

So it turns out there's a chance the computed rate could end up as +Inf, which means that "retained-want" (in the scavenger's calculation) is going to be some nonsense number.

This is the crux of the problem. The reason why we see this on ppc64 and not other platforms is the higher system page size of 64 KiB; it's more likely with small heaps that we could end up having less work than a single page, and so we fall into a situation where the scavenger should scavenge something (due to one set of sane pacing parameters) but will calculate it's always ahead of schedule because of the nonsense number, so it gets stuck in a loop.

The fix is easy: never let the nonsense number happen by either always rounding up to 1 physical page worth of work, or turning off the scavenger if there isn't at least one physical page worth of work. Both avoid the divide by zero which causes the +Inf, and the nonsense number later on.

This is not a performance problem or anything else in real running applications because the nonsense could dissipate in the following GC cycle, or the scavenger will harmlessly back off and do nothing (when there's really so little work to do that it doesn't matter). It is a problem for deadlock detection though, which is useful for teaching, so I'll put up a fix.

Should we backport this?

Copy link

@mknyszek mknyszek commented Oct 25, 2019

Uploaded a fix for this. With the patch, I can run the test 100,000 consecutive times on linux/ppc64 without issue.

Copy link

@gopherbot gopherbot commented Oct 25, 2019

Change mentions this issue: runtime: turn off scavenger when there's <1 physical page of work

Copy link

@mknyszek mknyszek commented Oct 29, 2019

Moved to the Go 1.14 milestone just because the fix is already up for review, is small, and close to landing.

@mknyszek mknyszek self-assigned this Oct 29, 2019
@gopherbot gopherbot closed this in 08a72c8 Nov 1, 2019
@golang golang locked and limited conversation to collaborators Oct 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.