Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: fatal error: checkdead: runnable g #40368

Closed
slon opened this issue Jul 23, 2020 · 27 comments
Closed

runtime: fatal error: checkdead: runnable g #40368

slon opened this issue Jul 23, 2020 · 27 comments
Assignees
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@slon
Copy link
Contributor

slon commented Jul 23, 2020

What version of Go are you using (go version)?

$ go version
go version go1.14.6 linux/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/home/prime/.cache/go-build"
GOENV="/home/prime/.config/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOINSECURE=""
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/prime/Code/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/home/prime/Code/go/src/a.yandex-team.ru/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build172964829=/tmp/go-build -gno-record-gcc-switches"

What did you do?

I encountered the following error. Panic reliably happens in about one of 10000 jobs on a mapreduce cluster. My binary is built with CGO_ENABLED=0 and is started with GOMAXPROCS=1.

runtime: checkdead: find g 1 in status 1
fatal error: checkdead: runnable g

runtime stack:
runtime.throw(0xb1ac74, 0x15)
	/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0x7ffef3709728 sp=0x7ffef37096f8 pc=0x438e02
runtime.checkdead()
	/usr/local/go/src/runtime/proc.go:4407 +0x3c9 fp=0x7ffef37097a8 sp=0x7ffef3709728 pc=0x4457a9
runtime.mput(0xf2e7e0)
	/usr/local/go/src/runtime/proc.go:4824 +0x50 fp=0x7ffef37097c8 sp=0x7ffef37097a8 pc=0x446ee0
runtime.stopm()
	/usr/local/go/src/runtime/proc.go:1832 +0x7d fp=0x7ffef37097f0 sp=0x7ffef37097c8 pc=0x43f03d
runtime.exitsyscall0(0xc000000180)
	/usr/local/go/src/runtime/proc.go:3268 +0xc6 fp=0x7ffef3709820 sp=0x7ffef37097f0 pc=0x442b86
runtime.mcall(0xf2e280)
	/usr/local/go/src/runtime/asm_amd64.s:318 +0x5b fp=0x7ffef3709830 sp=0x7ffef3709820 pc=0x46919b

goroutine 1 [runnable]:
syscall.Syscall(0x1, 0x4, 0xc0000f4000, 0x1000, 0x1000, 0x1000, 0x0)
	/usr/local/go/src/syscall/asm_linux_amd64.s:18 +0x5 fp=0xc0000bc6e8 sp=0xc0000bc6e0 pc=0x48cb05
syscall.write(0x4, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/syscall/zsyscall_linux_amd64.go:914 +0xb9 fp=0xc0000bc768 sp=0xc0000bc6e8 pc=0x48ade9
syscall.Write(0x4, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/syscall/syscall_unix.go:214 +0x70 fp=0xc0000bc7c8 sp=0xc0000bc768 pc=0x488ff0
internal/poll.(*FD).Write(0xc000054600, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/internal/poll/fd_unix.go:268 +0x2b7 fp=0xc0000bc900 sp=0xc0000bc7c8 pc=0x4b0a77
os.(*File).write(0xc00000e7a0, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/os/file_unix.go:280 +0x81 fp=0xc0000bc978 sp=0xc0000bc900 pc=0x4bba21
os.(*File).Write(0xc00000e7a0, 0xc0000f4000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
	/usr/local/go/src/os/file.go:153 +0xe9 fp=0xc0000bc9f8 sp=0xc0000bc978 pc=0x4b89c9
bufio.(*Writer).Flush(0xc0000ee180, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:591 +0x141 fp=0xc0000bcaf0 sp=0xc0000bc9f8 pc=0x543a81
bufio.(*Writer).WriteString(0xc0000ee180, 0xc00047bfc2, 0x1a, 0x0, 0x0, 0x0)
	/usr/local/go/src/bufio/bufio.go:694 +0x24b fp=0xc0000bcbd8 sp=0xc0000bcaf0 pc=0x544b0b
a.yandex-team.ru/yt/go/yson.(*Writer).str(0xc0000ee1c0, 0xc00047bfb0, 0x2c)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/writer.go:242 +0x29e fp=0xc0000bcce0 sp=0xc0000bcbd8 pc=0x65292e
a.yandex-team.ru/yt/go/yson.(*Writer).String(0xc0000ee1c0, 0xc00047bfb0, 0x2c)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/writer.go:371 +0x107 fp=0xc0000bcd40 sp=0xc0000bcce0 pc=0x653aa7
a.yandex-team.ru/yt/go/yson.encodeAny(0xc0000ee1c0, 0xa57740, 0xc000222a38, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:200 +0x20c1 fp=0xc0000bd170 sp=0xc0000bcd40 pc=0x63faf1
a.yandex-team.ru/yt/go/yson.encodeReflectStruct.func1(0xc0000d5d00, 0xa, 0x10, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:463 +0x316 fp=0xc0000bd2b8 sp=0xc0000bd170 pc=0x656dc6
a.yandex-team.ru/yt/go/yson.encodeReflectStruct(0xc0000ee1c0, 0xaf1e40, 0xc000222a10, 0x99, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:487 +0x33d fp=0xc0000bd3d0 sp=0xc0000bd2b8 pc=0x6414ad
a.yandex-team.ru/yt/go/yson.encodeReflect(0xc0000ee1c0, 0xaf1e40, 0xc000222a10, 0x99, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:350 +0x71b fp=0xc0000bd4c8 sp=0xc0000bd3d0 pc=0x64045b
a.yandex-team.ru/yt/go/yson.encodeAny(0xc0000ee1c0, 0xaf1e40, 0xc000222a10, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/marshal.go:326 +0x9a4 fp=0xc0000bd8f8 sp=0xc0000bd4c8 pc=0x63e3d4
a.yandex-team.ru/yt/go/yson.(*Writer).Any(0xc0000ee1c0, 0xaf1e40, 0xc000222a10)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/yson/writer.go:668 +0x5d fp=0xc0000bd940 sp=0xc0000bd8f8 pc=0x655d3d
a.yandex-team.ru/yt/go/mapreduce.(*writer).Write(0xc0000e6630, 0xaf1e40, 0xc000222a10, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/mapreduce/writer.go:34 +0x89 fp=0xc0000bd988 sp=0xc0000bd940 pc=0x8cb2c9
a.yandex-team.ru/yt/go/mapreduce.(*writer).MustWrite(0xc0000e6630, 0xaf1e40, 0xc000222a10)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/mapreduce/writer.go:39 +0x43 fp=0xc0000bd9d8 sp=0xc0000bd988 pc=0x8cb363
a.yandex-team.ru/yt/jaeger.RebuildIndexJob.Do(0xbb5920, 0xc00006f940, 0xbbec80, 0xc0000b0480, 0xc00000da80, 0x2, 0x2, 0x0, 0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/jaeger/rebuild_index.go:28 +0x2e8 fp=0xc0000bdbf8 sp=0xc0000bd9d8 pc=0xa04998
a.yandex-team.ru/yt/jaeger.(*RebuildIndexJob).Do(0xf58688, 0xbb5920, 0xc00006f940, 0xbbec80, 0xc0000b0480, 0xc00000da80, 0x2, 0x2, 0x0, 0x0)
	<autogenerated>:1 +0xa9 fp=0xc0000bdc68 sp=0xc0000bdbf8 pc=0xa084e9
a.yandex-team.ru/yt/go/mapreduce.JobMain(0x0)
	/home/prime/Code/go/src/a.yandex-team.ru/yt/go/mapreduce/main.go:72 +0x9d1 fp=0xc0000bdf60 sp=0xc0000bdc68 pc=0x8c4d61
main.main()
	/home/prime/Code/go/src/a.yandex-team.ru/yt/jaeger/yt-jaeger/main.go:12 +0x35 fp=0xc0000bdf88 sp=0xc0000bdf60 pc=0xa0c6e5
runtime.main()
	/usr/local/go/src/runtime/proc.go:203 +0x1c8 fp=0xc0000bdfe0 sp=0xc0000bdf88 pc=0x43b4a8
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc0000bdfe8 sp=0xc0000bdfe0 pc=0x46b141

goroutine 2 [force gc (idle)]:
runtime.gopark(0xb3b238, 0xf2ccc0, 0x1411, 0x1)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003af88 sp=0xc00003af58 pc=0x43b86b
runtime.goparkunlock(0xf2ccc0, 0x1411, 0x1)
	/usr/local/go/src/runtime/proc.go:310 +0x53 fp=0xc00003afb8 sp=0xc00003af88 pc=0x43b923
runtime.forcegchelper()
	/usr/local/go/src/runtime/proc.go:253 +0xaa fp=0xc00003afe0 sp=0xc00003afb8 pc=0x43b70a
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003afe8 sp=0xc00003afe0 pc=0x46b141
created by runtime.init.6
	/usr/local/go/src/runtime/proc.go:242 +0x35

goroutine 3 [GC sweep wait]:
runtime.gopark(0xb3b238, 0xf2d300, 0x46140c, 0x1)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003b780 sp=0xc00003b750 pc=0x43b86b
runtime.goparkunlock(0xf2d300, 0xba140c, 0x1)
	/usr/local/go/src/runtime/proc.go:310 +0x53 fp=0xc00003b7b0 sp=0xc00003b780 pc=0x43b923
runtime.bgsweep(0xc000052000)
	/usr/local/go/src/runtime/mgcsweep.go:89 +0x101 fp=0xc00003b7d8 sp=0xc00003b7b0 pc=0x427161
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003b7e0 sp=0xc00003b7d8 pc=0x46b141
created by runtime.gcenable
	/usr/local/go/src/runtime/mgc.go:214 +0x5c

goroutine 4 [sleep]:
runtime.gopark(0xb3b238, 0xf2d2c0, 0xc000011313, 0x2)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003bf20 sp=0xc00003bef0 pc=0x43b86b
runtime.goparkunlock(0xf2d2c0, 0x2c99138c01313, 0x2)
	/usr/local/go/src/runtime/proc.go:310 +0x53 fp=0xc00003bf50 sp=0xc00003bf20 pc=0x43b923
runtime.scavengeSleep(0x928f8, 0x13b74e7)
	/usr/local/go/src/runtime/mgcscavenge.go:214 +0x83 fp=0xc00003bf80 sp=0xc00003bf50 pc=0x4255b3
runtime.bgscavenge(0xc000052000)
	/usr/local/go/src/runtime/mgcscavenge.go:337 +0x1ce fp=0xc00003bfd8 sp=0xc00003bf80 pc=0x4257ae
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003bfe0 sp=0xc00003bfd8 pc=0x46b141
created by runtime.gcenable
	/usr/local/go/src/runtime/mgc.go:215 +0x7e

goroutine 5 [finalizer wait]:
runtime.gopark(0xb3b238, 0xf585d0, 0xf51410, 0x1)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003a728 sp=0xc00003a6f8 pc=0x43b86b
runtime.goparkunlock(0xf585d0, 0xb31410, 0x1)
	/usr/local/go/src/runtime/proc.go:310 +0x53 fp=0xc00003a758 sp=0xc00003a728 pc=0x43b923
runtime.runfinq()
	/usr/local/go/src/runtime/mfinal.go:175 +0x96 fp=0xc00003a7e0 sp=0xc00003a758 pc=0x41cec6
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003a7e8 sp=0xc00003a7e0 pc=0x46b141
created by runtime.createfing
	/usr/local/go/src/runtime/mfinal.go:156 +0x61

goroutine 6 [GC worker (idle)]:
runtime.gopark(0xb3b0b8, 0xc0004340f0, 0x1418, 0x0)
	/usr/local/go/src/runtime/proc.go:304 +0xcb fp=0xc00003c758 sp=0xc00003c728 pc=0x43b86b
runtime.gcBgMarkWorker(0xc00002c000)
	/usr/local/go/src/runtime/mgc.go:1873 +0x119 fp=0xc00003c7d8 sp=0xc00003c758 pc=0x420659
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc00003c7e0 sp=0xc00003c7d8 pc=0x46b141
created by runtime.gcBgMarkStartWorkers
	/usr/local/go/src/runtime/mgc.go:1821 +0x77

I'm unable to reproduce this issue locally, but I was able to collect a coredump. Binary and core dump are inside checkdead.zip archive. My binary is reading data from stdin and writing data to stdout, without spawning any additional goroutines.

I would be glad to hear any suggestions on how to futher diagnose this issue.

@ALTree ALTree changed the title fatal error: checkdead: runnable g runtime: fatal error: checkdead: runnable g Jul 23, 2020
@ALTree ALTree added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 23, 2020
@ALTree
Copy link
Member

ALTree commented Jul 23, 2020

cc @aclements @mknyszek

@slon
Copy link
Contributor Author

slon commented Jul 23, 2020

I tried running couple of other configurations.

  • Issue is not reproducing with 1.13.14
  • Issue is not reproducing with CGO_ENABLED=1
  • Issue is not reproducing with GOMAXPROCS=2

@aclements
Copy link
Member

cc @prattmic

@slon, can you try reproducing with GODEBUG=asyncpreemptoff=1?

@slon
Copy link
Contributor Author

slon commented Jul 23, 2020

Issue is reproducing with GODEBUG=asyncpreemptoff=1.

@ianlancetaylor ianlancetaylor added this to the Go1.16 milestone Jul 23, 2020
@ianlancetaylor
Copy link
Contributor

Can you reproduce it with the environment variable GODEBUG=schedtrace=1,scheddetail=1 ?

@prattmic
Copy link
Member

Dealing with gdb is a pain, but if I'm reading the stack correctly, I believe this G exited the syscall, failed to acquire a P, added to the global runq and then called stopm. That mput -> checkdead is the one throwing.

(gdb) p /x 'runtime.allgs'
$21 = []*runtime.g = {0xc000000180, 0xc000000600, 0xc000000780, 0xc000000900, 0xc000000d80, 0xc000000f00}
(gdb) p /x 'runtime.sched'.runq
$19 = {
  head = 0xc000000180, 
  tail = 0xc000000180
}

@prattmic
Copy link
Member

Also of interest is that sysmon is blocked waiting on sched.lock:

Thread 3 (LWP 35):
#0  runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:568
#1  0x00000000004350b6 in runtime.futexsleep (addr=0xf2dfb8 <runtime.sched+24>, val=2, ns=-1) at /usr/local/go/src/runtime/os_linux.go:45
#2  0x000000000040bb0c in runtime.lock (l=0xf2dfb8 <runtime.sched+24>) at /usr/local/go/src/runtime/lock_futex.go:102
#3  0x000000000043ca57 in runtime.mcommoninit (mp=0xc00003e700) at /usr/local/go/src/runtime/proc.go:616
#4  0x000000000043e43d in runtime.allocm (_p_=0xc00002c000, fn={void (void)} 0xc00004dec8, ~r2=<optimized out>) at /usr/local/go/src/runtime/proc.go:1390
#5  0x000000000043eb89 in runtime.newm (fn={void (void)} 0xc00004def0, _p_=0xc00002c000) at /usr/local/go/src/runtime/proc.go:1704
#6  0x000000000043f269 in runtime.startm (_p_=0xc00002c000, spinning=false) at /usr/local/go/src/runtime/proc.go:1875
#7  0x0000000000445b3a in runtime.sysmon () at /usr/local/go/src/runtime/proc.go:4539
#8  0x000000000043db71 in runtime.mstart1 () at /usr/local/go/src/runtime/proc.go:1097
#9  0x000000000043da74 in runtime.mstart () at /usr/local/go/src/runtime/proc.go:1062
#10 0x000000000046d333 in runtime.clone () at /usr/local/go/src/runtime/sys_linux_amd64.s:619
#11 0x0000000000000000 in ?? ()

At first glance this isn't surprising since checkdead is holding sched.lock during the throw. What is surprising is that it is stuck on the lock in mcommoninit. startm also takes sched.lock but releases it before calling newm. This implies that the startm call started before checkdead, which makes me rather suspicious of a race condition with sysmon (perhaps taking back the P from the syscall G).

@slon
Copy link
Contributor Author

slon commented Jul 23, 2020

I'm unable to reproduce it with GODEBUG=schedtrace=1,scheddetail=1.

@slon
Copy link
Contributor Author

slon commented Jul 23, 2020

I think I found sequence of events that might lead to that error.

  1. G1 is blocked in syscall.
  2. sysmon takes P and starts another M to run GC.
  3. G1 returns from syscall and fails to acquire P.
  4. GC is finished. Other M stops.
  5. G1 calls stopm() and triggers the panic.

Is that plausible?

@ianlancetaylor
Copy link
Contributor

@slon The code in exitsyscall0 acquires the scheduler lock, then tries to get a P. In this case it fails to get a P, so it adds the G to the global run queue. So I think the only way that your scenario can lead to trouble is if when the M running GC stops it does not decide to run the G that is on the global run queue. I don't yet see how that could happen.

@prattmic
Copy link
Member

For posterity, I've more explicitly walked the stack to confirm the call path in #40368 (comment) is correct.

(gdb) set $start = 0xc00004fdf0
(gdb) while $start < $end
 >printf "start: 0x%lx, pc: 0x%lx\n", $start, *(void **)($start+8)
 >info symbol *(void **)($start+8)
 >set $start = *(void **)($start)
 >end
start: 0xc00004fdf0, pc: 0x44feaa
runtime.crash + 26 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004fe08, pc: 0x467d21
runtime.fatalthrow.func1 + 129 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004fe50, pc: 0x438fc7
runtime.fatalthrow + 87 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004fe88, pc: 0x438e02
runtime.throw + 114 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004feb8, pc: 0x4457a9
runtime.checkdead + 969 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004ff38, pc: 0x446ee0
runtime.mput + 80 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004ff58, pc: 0x43f03d
runtime.stopm + 125 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004ff80, pc: 0x442b86
runtime.exitsyscall0 + 198 in section .text of /home/mpratt_google_com/yt-jaeger
start: 0xc00004ffb0, pc: 0x46919b
runtime.mcall + 91 in section .text of /home/mpratt_google_com/yt-jaeger

@ianlancetaylor
Copy link
Contributor

@prattmic I think you've found the problem. sysmon calls startm which acquires the only P and releases the scheduler lock. exitsyscall0 acquires the scheduler lock, finds no P, puts the G on the global run queue, releases the lock, calls stopm which calls checkdead. At this point there is no P available, because startm snagged it. But there is also no M available, because startm hasn't yet made it to the line sched.mnext++ in mcommoninit, and it will never get there because checkdead is holding the scheduler lock.

In short, the problem is that sysmon -> startm -> newm can acquire a P before creating an M to run on that P. When GOMAXPROCS == 1, we are in a situation where there are no P's and no M's, so things look bleak.

@ianlancetaylor
Copy link
Contributor

One fix might be for startm to say somewhere "I am starting an M" when mget returns nil, before releasing the scheduler lock. Then mcommoninit would somehow remove that statement after it increments sched.mnext.

@prattmic
Copy link
Member

After spending too long on this, I'm able to reproduce the crash with a program racing syscall exit with timer expiration, plus strategically placed sleeps in the runtime to help increase the likelihood of racing properly:

package main

import (
        "syscall"
        "time"
)

func main() {
        t := time.NewTimer(100*time.Millisecond)
        defer t.Stop()

        ts := syscall.Timespec{
                Nsec: int64(100*time.Millisecond - 4*time.Microsecond),
        }
        for {
                if err := syscall.Nanosleep(&ts, nil); err != syscall.EINTR {
                        break
                }
        }
}

A program like this is going to have 3 Ms:

  1. Running the main G in the nanosleep syscall.
  2. Spinning M in the findrunnable netpoll waiting for timer expiration.
  3. Sysmon M also waiting for timer expiration.

In order to trigger the crash, we need the race to proceed in this order:

  1. Sysmon wakes up, calls startm to handle the timer.
  2. startm takes the only P, but does not find an idle M. Releases sched.lock. (Does not make it to lock in mcommoninit)
  3. netpoll in findrunnable completes, runs timer (chan send with no waiter), then goes to sleep (stopm, this ensures checkdead won't see this M as running).
  4. Syscall completes, takes sched.lock, finds no P, adds to global runq and goes to sleep.
  5. checkdead detects that there are no running Ms (because mcommoninit hasn't run yet). Panic!

Perhaps a more complex program could convince the spinning M to already be stopped, in which case this race would be much easier to trigger? As-is, it is very tricky. I'll try to send out a fix on Monday, but it will likely not include a test due to the difficulty of triggering.

runtime: checkdead: find g 1 in status 1                                                                                                     
fatal error: checkdead: runnable g                                                                                                           
                                                                                                                                             
runtime stack:                                                                                                                               
runtime.throw(0x48c091, 0x15)                                                                                                                
        /usr/local/google/home/mpratt/src/go/src/runtime/panic.go:1116 +0x72                          
runtime.checkdead()                                                                                                                          
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:4665 +0x548                          
runtime.mput(0x4f0680)                                                                                                                       
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:5119 +0x9e                           
runtime.stopm()                                                                                                                              
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:1947 +0x7f                           
runtime.exitsyscall0(0xc000000180)                                                                                                           
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:3490 +0x1c5                          
runtime.mcall(0x0)                                                                                                                           
        /usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:318 +0x5b                        
                                                                                                                                             
goroutine 1 [runnable]:                                                                                                                      
syscall.Syscall(0x23, 0xc000034750, 0x0, 0x0, 0x0, 0x0, 0x0)                                                                                 
        /usr/local/google/home/mpratt/src/go/src/syscall/asm_linux_amd64.s:18 +0x5
syscall.Nanosleep(0xc000034750, 0x0, 0x7fcc503a2000, 0x1a099)
        /usr/local/google/home/mpratt/src/go/src/syscall/zsyscall_linux_amd64.go:640 +0x5f
main.main()
        /usr/local/google/home/mpratt/src/go/src/40368.go:31 +0x7e

@prattmic prattmic modified the milestones: Go1.16, Go1.14.7 Jul 24, 2020
@prattmic prattmic added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Jul 24, 2020
@prattmic
Copy link
Member

@gopherbot backport to 1.14 please. This crash could affect any program running with GOMAXPROCS=1.

@gopherbot
Copy link

Backport issue(s) opened: #40398 (for 1.14).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@prattmic prattmic self-assigned this Jul 24, 2020
@networkimprov
Copy link

@prattmic the backport issue automatically has a 1.14.x milestone, so this would be 1.15 (and release-blocker?) or 1.16.

@ALTree
Copy link
Member

ALTree commented Jul 25, 2020

As networkimprov said the backport already has a 1.14.7 milestone, so I'm putting this one in 1.16 (move it to 1.15 if you think this is 1.15 material).

@ALTree ALTree modified the milestones: Go1.14.7, Go1.16 Jul 25, 2020
@apmattil
Copy link

would this kind of backtrace be relaited to this issue ?
'''
(dlv) bt
0 0x0000000000465511 in runtime.raise
at /usr/local/go/src/runtime/sys_linux_amd64.s:165
1 0x0000000000449c7b in runtime.crash
at /usr/local/go/src/runtime/signal_unix.go:822
2 0x000000c0000098e0 in ???
at ?:-1
3 0x0000000000448b84 in runtime.restoreGsignalStack
at /usr/local/go/src/runtime/signal_unix.go:1156
4 0x0000000000448b84 in runtime.sigtrampgo
at /usr/local/go/src/runtime/signal_unix.go:455
5 0x0000000000465873 in runtime.sigtramp
at /usr/local/go/src/runtime/sys_linux_amd64.s:389
6 0x0000000000465960 in runtime.sigreturn
at /usr/local/go/src/runtime/sys_linux_amd64.s:481
7 0x0000000000449e0a in runtime.noSignalStack
at /usr/local/go/src/runtime/signal_unix.go:870
8 0x0000000001355bc0 in ???
at ?:-1
9 0x0000000000433c47 in runtime.startpanic_m
at /usr/local/go/src/runtime/panic.go:1277
10 0x00007ffc657265e8 in ???
at ?:-1
11 0x0000000000440a12 in runtime.checkdead
at /usr/local/go/src/runtime/proc.go:4386
12 0x0000000000439eb5 in runtime.startm
at /usr/local/go/src/runtime/proc.go:1851
13 0x000000c00003c000 in ???
at ?:-1
14 0x000000000043c03c in runtime.schedule
at /usr/local/go/src/runtime/proc.go:2485
15 0x000000000043cff6 in runtime.reentersyscall
at /usr/local/go/src/runtime/proc.go:2935
16 0x000000c00004c7c0 in ???
at ?:-1
17 0x0000000000463956 in runtime.goexit
at /usr/local/go/src/runtime/asm_amd64.s:1374
(dlv) frame 3

runtime.raise() /usr/local/go/src/runtime/sys_linux_amd64.s:165 (PC: 0x465511)
Warning: debugging optimized function
Frame 3: /usr/local/go/src/runtime/signal_unix.go:1156 (PC: 448b84)
1151: // before entering the signal handler.
1152: //go:nosplit
1153: //go:nowritebarrierrec
1154: func restoreGsignalStack(st *gsignalStack) {
1155: gp := getg().m.gsignal
=>1156: gp.stack = st.stack
1157: gp.stackguard0 = st.stackguard0
1158: gp.stackguard1 = st.stackguard1
1159: gp.stktopsp = st.stktopsp
1160: }
1161:
(dlv) p st
Command failed: could not find symbol value for st
(dlv) locals
(no locals)
(dlv) args
(no args)
(dlv) p st
Command failed: could not find symbol value for st
(dlv) p gp
Command failed: could not find symbol value for gp
(dlv) frame 9
runtime.raise() /usr/local/go/src/runtime/sys_linux_amd64.s:165 (PC: 0x465511)
Warning: debugging optimized function
Frame 9: /usr/local/go/src/runtime/panic.go:1277 (PC: 433c47)
1272: case 2:
1273: // This is a genuine bug in the runtime, we couldn't even
1274: // print the stack trace successfully.
1275: g.m.dying = 3
1276: print("stack trace unavailable\n")
=>1277: exit(4)
1278: fallthrough
1279: default:
1280: // Can't even print! Just exit.
1281: exit(5)
1282: return false // Need to return something.
(dlv) args
(no args)

'''

@prattmic
Copy link
Member

@networkimprov @ALTree thanks, this is fine. If it doesn't make 1.15 it will probably want to be in the first minor release, but I don't think it needs to be blocking.

@apmattil it is difficult to tell from the stack trace. What was the panic message printed by that crash?

@apmattil
Copy link

apmattil commented Jul 27, 2020 via email

@gopherbot
Copy link

Change https://golang.org/cl/245018 mentions this issue: runtime: ensure startm new M is consistently visible to checkdead

@prattmic
Copy link
Member

@apmattil If you are running 1.14.6, then the line number in the stack trace indicates it is this case: https://github.com/golang/go/blob/go1.14.6/src/runtime/proc.go#L4386

That is different from this bug, so you should file a separate issue for that.

@apmattil
Copy link

apmattil commented Jul 27, 2020 via email

@prattmic
Copy link
Member

@apmattil thanks, that's helpful. I opened #40434, as that is a different issue.

@prattmic prattmic modified the milestones: Go1.16, Go1.15 Jul 28, 2020
@gopherbot
Copy link

Change https://golang.org/cl/245297 mentions this issue: [release-branch.go1.14] runtime: ensure startm new M is consistently visible to checkdead

@gopherbot
Copy link

Change https://golang.org/cl/246199 mentions this issue: [release-branch.go1.13] runtime: ensure startm new M is consistently visible to checkdead

gopherbot pushed a commit that referenced this issue Aug 22, 2020
…visible to checkdead

If no M is available, startm first grabs an idle P, then drops
sched.lock and calls newm to start a new M to run than P.

Unfortunately, that leaves a window in which a G (e.g., returning from a
syscall) may find no idle P, add to the global runq, and then in stopm
discover that there are no running M's, a condition that should be
impossible with runnable G's.

To avoid this condition, we pre-allocate the new M ID in startm before
dropping sched.lock. This ensures that checkdead will see the M as
running, and since that new M must eventually run the scheduler, it will
handle any pending work as necessary.

Outside of startm, most other calls to newm/allocm don't have a P at
all. The only exception is startTheWorldWithSema, which always has an M
if there is 1 P (i.e., the currently running M), and if there is >1 P
the findrunnable spinning dance ensures the problem never occurs.

This has been tested with strategically placed sleeps in the runtime to
help induce the correct race ordering, but the timing on this is too
narrow for a test that can be checked in.

For #40368
Fixes #40398

Change-Id: If5e0293a430cc85154b7ed55bc6dadf9b340abe2
Reviewed-on: https://go-review.googlesource.com/c/go/+/245018
Run-TryBot: Michael Pratt <mpratt@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
(cherry picked from commit 85afa2e)
Reviewed-on: https://go-review.googlesource.com/c/go/+/245297
@golang golang locked and limited conversation to collaborators Jul 31, 2021
@prattmic prattmic self-assigned this Jun 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

8 participants