New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: deadlock/livelock in GC code #28695

Closed
RLH opened this Issue Nov 9, 2018 · 5 comments

Comments

Projects
None yet
6 participants
@RLH
Contributor

RLH commented Nov 9, 2018

The GC appears to be deadlock. The test in issue27695.go triggers it though the test seem unrelated to the deadlock.

go version devel +ac277d9234 Wed Nov 7 13:04:50 2018 +0000 linux/amd64

GOARCH="amd64"
GOOS="linux""

Noticed this running go/test/fixedbugs/issue27695.go from the standard distribution.

Once every 1000 or so runs it hangs. The easiest way to make this happen is to run
GOTRACEBACK=crash stress ../test/fixedbugs/issue27695

Run top and you will see 1 out of a thousand or so invocations of issue27695 hanging.
Grab the PID of one of the hung executions and run gdb and dump all the threads.

rlh@rlh:~/work/go$ gdb -p 49509
GNU gdb (Debian 7.12-6) 7.12.0.20161007-git
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 49509
[New LWP 49510]
[New LWP 49511]
[New LWP 49512]
[New LWP 49513]
[New LWP 49514]
[New LWP 49519]
runtime.futex () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:536
536		MOVL	AX, ret+40(FP)
warning: File "/home/rlh/work/go/src/runtime/runtime-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /home/rlh/work/go/src/runtime/runtime-gdb.py
line to your configuration file "/home/rlh/.gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/rlh/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
(gdb) thread apply all bt

Thread 7 (LWP 49519):
#0  runtime.futex () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:536
#1  0x00000000004261f0 in runtime.futexsleep (addr=0x587a48 <runtime.sched+296>, val=0, ns=100000) at /home/rlh/work/go/src/runtime/os_linux.go:63
#2  0x000000000040971e in runtime.notetsleep_internal (n=0x587a48 <runtime.sched+296>, ns=100000, ~r2=<optimized out>) at /home/rlh/work/go/src/runtime/lock_futex.go:193
#3  0x00000000004097f1 in runtime.notetsleep (n=0x587a48 <runtime.sched+296>, ns=100000, ~r2=<optimized out>) at /home/rlh/work/go/src/runtime/lock_futex.go:216
#4  0x000000000042d221 in runtime.forEachP (fn={void (runtime.p *)} 0xc00006ffb8) at /home/rlh/work/go/src/runtime/proc.go:1396
#5  0x000000000044d8fd in runtime.gcMarkDone.func1 () at /home/rlh/work/go/src/runtime/mgc.go:1407
#6  0x000000000044fce6 in runtime.systemstack () at /home/rlh/work/go/src/runtime/asm_amd64.s:351
#7  0x000000000042caf0 in ?? () at /home/rlh/work/go/src/runtime/proc.go:1082
#8  0x0000000000000000 in ?? ()

Thread 6 (LWP 49514):
#0  runtime.futex () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:536
#1  0x000000000042617b in runtime.futexsleep (addr=0xc0000604c0, val=0, ns=-1) at /home/rlh/work/go/src/runtime/os_linux.go:46
#2  0x000000000040958f in runtime.notesleep (n=0xc0000604c0) at /home/rlh/work/go/src/runtime/lock_futex.go:151
#3  0x000000000042dfe8 in runtime.stopm () at /home/rlh/work/go/src/runtime/proc.go:1936
#4  0x000000000042ea54 in runtime.gcstopm () at /home/rlh/work/go/src/runtime/proc.go:2133
#5  0x000000000042f2d4 in runtime.findrunnable (gp=0xc000026500, inheritTime=false) at /home/rlh/work/go/src/runtime/proc.go:2188
#6  0x000000000042fcb8 in runtime.schedule () at /home/rlh/work/go/src/runtime/proc.go:2525
#7  0x00000000004301d4 in runtime.goschedImpl (gp=0xc000063c80) at /home/rlh/work/go/src/runtime/proc.go:2620
#8  0x00000000004303e4 in runtime.gopreempt_m (gp=0xc000063c80) at /home/rlh/work/go/src/runtime/proc.go:2648
#9  0x000000000043dab9 in runtime.newstack () at /home/rlh/work/go/src/runtime/stack.go:1033
#10 0x000000000044fdbf in runtime.morestack () at /home/rlh/work/go/src/runtime/asm_amd64.s:429
#11 0x0000000000000000 in ?? ()

Thread 5 (LWP 49513):
#0  runtime.futex () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:536
#1  0x000000000042617b in runtime.futexsleep (addr=0xc000060140, val=0, ns=-1) at /home/rlh/work/go/src/runtime/os_linux.go:46
#2  0x000000000040958f in runtime.notesleep (n=0xc000060140) at /home/rlh/work/go/src/runtime/lock_futex.go:151
#3  0x000000000042dfe8 in runtime.stopm () at /home/rlh/work/go/src/runtime/proc.go:1936
#4  0x000000000042f13a in runtime.findrunnable (gp=0xc000028a00, inheritTime=false) at /home/rlh/work/go/src/runtime/proc.go:2399
#5  0x000000000042fcb8 in runtime.schedule () at /home/rlh/work/go/src/runtime/proc.go:2525
#6  0x00000000004301d4 in runtime.goschedImpl (gp=0xc000062d80) at /home/rlh/work/go/src/runtime/proc.go:2620
#7  0x00000000004303e4 in runtime.gopreempt_m (gp=0xc000062d80) at /home/rlh/work/go/src/runtime/proc.go:2648
#8  0x000000000043dab9 in runtime.newstack () at /home/rlh/work/go/src/runtime/stack.go:1033
#9  0x000000000044fdbf in runtime.morestack () at /home/rlh/work/go/src/runtime/asm_amd64.s:429
#10 0x0000000000000000 in ?? ()

Thread 4 (LWP 49512):
#0  runtime.futex () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:536
#1  0x000000000042617b in runtime.futexsleep (addr=0xc00003a840, val=0, ns=-1) at /home/rlh/work/go/src/runtime/os_linux.go:46
---Type <return> to continue, or q <return> to quit---
#2  0x000000000040958f in runtime.notesleep (n=0xc00003a840) at /home/rlh/work/go/src/runtime/lock_futex.go:151
#3  0x000000000042dfe8 in runtime.stopm () at /home/rlh/work/go/src/runtime/proc.go:1936
#4  0x000000000042f13a in runtime.findrunnable (gp=0xc000024000, inheritTime=false) at /home/rlh/work/go/src/runtime/proc.go:2399
#5  0x000000000042fcb8 in runtime.schedule () at /home/rlh/work/go/src/runtime/proc.go:2525
#6  0x000000000042ffcc in runtime.park_m (gp=0xc000062a80) at /home/rlh/work/go/src/runtime/proc.go:2605
#7  0x000000000044fc5b in runtime.mcall () at /home/rlh/work/go/src/runtime/asm_amd64.s:299
#8  0x0000000000000000 in ?? ()

Thread 3 (LWP 49511):
#0  runtime.procyield () at /home/rlh/work/go/src/runtime/asm_amd64.s:553
#1  0x000000000042c041 in runtime.scang (gp=0xc000063e00, gcw=0xc000027770) at /home/rlh/work/go/src/runtime/proc.go:923
#2  0x000000000044ddbd in runtime.markroot.func1 () at /home/rlh/work/go/src/runtime/mgcmark.go:221
#3  0x000000000041a796 in runtime.markroot (gcw=0xc000027770, i=26) at /home/rlh/work/go/src/runtime/mgcmark.go:202
#4  0x000000000041c1ff in runtime.gcDrainN (gcw=0xc000027770, scanWork=65536, ~r2=<optimized out>) at /home/rlh/work/go/src/runtime/mgcmark.go:1006
#5  0x000000000041af84 in runtime.gcAssistAlloc1 (gp=0xc000062c00, scanWork=65536) at /home/rlh/work/go/src/runtime/mgcmark.go:512
#6  0x000000000044de73 in runtime.gcAssistAlloc.func1 () at /home/rlh/work/go/src/runtime/mgcmark.go:423
#7  0x000000000044fce6 in runtime.systemstack () at /home/rlh/work/go/src/runtime/asm_amd64.s:351
#8  0x000000000042caf0 in ?? () at /home/rlh/work/go/src/runtime/proc.go:1082
#9  0x0000000000000000 in ?? ()

Thread 2 (LWP 49510):
#0  runtime.usleep () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:131
#1  0x0000000000433deb in runtime.sysmon () at /home/rlh/work/go/src/runtime/proc.go:4280
#2  0x000000000042cc43 in runtime.mstart1 () at /home/rlh/work/go/src/runtime/proc.go:1206
#3  0x000000000042cb5e in runtime.mstart () at /home/rlh/work/go/src/runtime/proc.go:1172
#4  0x0000000000453ab3 in runtime.clone () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:587
#5  0x0000000000000000 in ?? ()

Thread 1 (LWP 49509):
#0  runtime.futex () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:536
#1  0x000000000042617b in runtime.futexsleep (addr=0x588120 <runtime.m0+320>, val=0, ns=-1) at /home/rlh/work/go/src/runtime/os_linux.go:46
#2  0x000000000040958f in runtime.notesleep (n=0x588120 <runtime.m0+320>) at /home/rlh/work/go/src/runtime/lock_futex.go:151
#3  0x000000000042dfe8 in runtime.stopm () at /home/rlh/work/go/src/runtime/proc.go:1936
#4  0x000000000042f13a in runtime.findrunnable (gp=0xc00002af00, inheritTime=false) at /home/rlh/work/go/src/runtime/proc.go:2399
#5  0x000000000042fcb8 in runtime.schedule () at /home/rlh/work/go/src/runtime/proc.go:2525
#6  0x000000000042ffcc in runtime.park_m (gp=0xc000062780) at /home/rlh/work/go/src/runtime/proc.go:2605
#7  0x000000000044fc5b in runtime.mcall () at /home/rlh/work/go/src/runtime/asm_amd64.s:299
#8  0x000000000044fb79 in runtime.rt0_go () at /home/rlh/work/go/src/runtime/asm_amd64.s:201
#9  0x0000000000000000 in ?? ()
(gdb) 

This should be a good starting place.

@aclements

This comment has been minimized.

Member

aclements commented Nov 19, 2018

This is a deadlock between thread 7, which is attempting to force all Ps to a safe point, and thread 3, which must be trying to scan thread 7's user stack. Since neither can preempt the other, they're stuck.

Thread 7 (LWP 49519):
#0  runtime.futex () at /home/rlh/work/go/src/runtime/sys_linux_amd64.s:536
#1  0x00000000004261f0 in runtime.futexsleep (addr=0x587a48 <runtime.sched+296>, val=0, ns=100000) at /home/rlh/work/go/src/runtime/os_linux.go:63
#2  0x000000000040971e in runtime.notetsleep_internal (n=0x587a48 <runtime.sched+296>, ns=100000, ~r2=<optimized out>) at /home/rlh/work/go/src/runtime/lock_futex.go:193
#3  0x00000000004097f1 in runtime.notetsleep (n=0x587a48 <runtime.sched+296>, ns=100000, ~r2=<optimized out>) at /home/rlh/work/go/src/runtime/lock_futex.go:216
#4  0x000000000042d221 in runtime.forEachP (fn={void (runtime.p *)} 0xc00006ffb8) at /home/rlh/work/go/src/runtime/proc.go:1396
#5  0x000000000044d8fd in runtime.gcMarkDone.func1 () at /home/rlh/work/go/src/runtime/mgc.go:1407
#6  0x000000000044fce6 in runtime.systemstack () at /home/rlh/work/go/src/runtime/asm_amd64.s:351
#7  0x000000000042caf0 in ?? () at /home/rlh/work/go/src/runtime/proc.go:1082
#8  0x0000000000000000 in ?? ()
Thread 3 (LWP 49511):
#0  runtime.procyield () at /home/rlh/work/go/src/runtime/asm_amd64.s:553
#1  0x000000000042c041 in runtime.scang (gp=0xc000063e00, gcw=0xc000027770) at /home/rlh/work/go/src/runtime/proc.go:923
#2  0x000000000044ddbd in runtime.markroot.func1 () at /home/rlh/work/go/src/runtime/mgcmark.go:221
#3  0x000000000041a796 in runtime.markroot (gcw=0xc000027770, i=26) at /home/rlh/work/go/src/runtime/mgcmark.go:202
#4  0x000000000041c1ff in runtime.gcDrainN (gcw=0xc000027770, scanWork=65536, ~r2=<optimized out>) at /home/rlh/work/go/src/runtime/mgcmark.go:1006
#5  0x000000000041af84 in runtime.gcAssistAlloc1 (gp=0xc000062c00, scanWork=65536) at /home/rlh/work/go/src/runtime/mgcmark.go:512
#6  0x000000000044de73 in runtime.gcAssistAlloc.func1 () at /home/rlh/work/go/src/runtime/mgcmark.go:423
#7  0x000000000044fce6 in runtime.systemstack () at /home/rlh/work/go/src/runtime/asm_amd64.s:351
#8  0x000000000042caf0 in ?? () at /home/rlh/work/go/src/runtime/proc.go:1082
#9  0x0000000000000000 in ?? ()

So, the cause is clear. The fix is less so. It would be much better if all of these preemption operations were asynchronous; we've had plenty of other similar deadlocks in the past. To fix this bug, it may be enough to allow preemption of forEachP's user goroutine.

@andybons

This comment has been minimized.

Member

andybons commented Nov 28, 2018

@aclements feel free to re-assign as you see fit.

@mknyszek

This comment has been minimized.

Contributor

mknyszek commented Dec 6, 2018

I was able to reproduce and I think I have a fix. As soon as I can confirm it fixes the issue I'll put it up for review.

@gopherbot

This comment has been minimized.

gopherbot commented Dec 6, 2018

Change https://golang.org/cl/153077 mentions this issue: runtime: enable preemption of mark termination goroutine

@mknyszek mknyszek self-assigned this Dec 6, 2018

@mknyszek

This comment has been minimized.

Contributor

mknyszek commented Dec 6, 2018

Running the same program @RLH mentioned above, I'm at 35k iterations with no stale (deadlocked) processes, so hopefully this means its fixed. :)

@gopherbot gopherbot closed this in 578667f Dec 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment