-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Description
What version of Go are you using (go version
)?
1.20.3
Does this issue reproduce with the latest release?
Yes
What operating system and processor architecture are you using (go env
)?
This happens on m3.medium AWS instances. I could not reproduce it anywhere else so far. This is a monothreaded amd64 instance
What did you do?
- Compiled a binary with 1.20.3, with lots of goroutines, heavy network activity
- Launched it on an m3.medium AWS instance
- Saw it crashing with a "runtime: checkdead: found g XX in status 1"
What did you expect to see?
No crash
What did you see instead?
A crash with "runtime: checkdead: found g XX in status 1". This crash happens on average after a few hours, apparently correlated with the network activity - the more network activity (ie, the more goroutines waiting on network), the more often it will happen.
runtime: checkdead: find g 34876 in status 1
fatal error: checkdead: runnable g
runtime stack:
runtime.throw({0x105b8e6?, 0×300?})
#011/us/local/go/src/runtime/panic.go:1047 +0×5d fp=0xc000a55d58 sp=0xc000a55d28 pc=0x43711d
runtime.checkdead.func1(0xc0009fba00)
#011/usr/local/go/src/runtime/proc.go:5216 +0xe5 fp=0xc000a55d90 sp=0xc000a55d58 pc=0x444dc5
runtime.forEachG(0xc000a55dfo)
#011/usr/local/go/sr/runtime/proc.go:591 +0×5c fp=0xc000a55dc® sp=0xc000a55d90 pc=0×43a8bc
runtime.checkdead()
#011/usr/local/go/src/runtime/proc.go:5203 +0xf7 fp=0xc000a5510 sp=0xc000a55dc0 pc=0x444ab7
runtime.mput(0xc000a55e40?)
#011/usr/local/go/src/runtime/proc.go:5696 +0×33 fp=0x000a5520 sp=0xc000a55e10 pc=0x446033
runtime.stopm()
#011/usr/local/go/src/runtime/proc.go:2332 +0×67 fp=0xc000a5550 sp=0xc000a55e20 pc=0×43dce7
runtime.findRunnable()
#011/usr/local/go/src/runtime/proc.go:3007 +0xabc fp=0xc000a55f58 sp=0x000a55e50 pc=0x43f55c
runtime.schedule()
#011/usr/local/go/src/runtime/proc.go: 3360 +0xb1 fp=0xc000a55f90 sp=0xc000a55f58 pc=0×440391
runtime.park_m(0x000611380?)
#011/usr/local/go/src/runtime/proc.go:3511 +0×12d fp=0xc000a55fco sp=oxc000a55f90 pc=0×4408ad
runtime.mcall()
#011/usr/local/go/src/runtime/asm_amd64.s:452+0×43 fp=0xc000a55fdo sp=0xc000a55fco pc=0×469f23
Additional details
We've encountered a race condition in the Go runtime that leads to a panic in the checkdead()
function when running on a monothreaded instance (GOMAXPROCS=1). The issue seems related to a specific scenario in the findRunnable()
function where a goroutine is readied between the findRunnable()
check and stopm()
call on line 3007 acquiring the scheduler lock, causing an inconsistent scheduler state.
Recap:
findRunnable()
cannot find any goroutines to return, reaching the end of the function and callingstopm()
.stopm()
acquires the lock on the scheduler and callsmput()
.mput()
adds the M to the idle pool and callscheckdead()
.- There are no active M, but
checkdead()
finds a runnable G and panics.
The race condition appears to be related to the "delicate dance" code in findRunnable()
, which tries to handle a race between the network poller readying a goroutine and the M being stopped. However, this race condition can still occur in rare cases, leading to a panic in checkdead()
.
To investigate this issue further, we tried to create a minimal reproducer, but we could not trigger it consistently due to the race condition's rarity. Nonetheless, runtime stack traces from the checkdead()
panics in the real-world scenario clearly show that it originates from the specific stopm()
call in findRunnable()
, indicating a strong correlation between the race condition and the panic.
A temporary fix patch has been applied to the checkdead()
function to detect and handle the inconsistent state caused by the race condition without panicking immediately. Instead, it collects information and prints debug logs about the state of goroutines and the runtime stack. It only panics when the inconsistent state occurs too many times in a row or for an extended period of time. All occurrences so far were already resolved (ie an M was active) on the next call to checkdead().
This is ofc not a viable solution as it does not fix the root cause of the issue and may introduce other problems.
--- 1.20.3-orig/src/runtime/proc.go
+++ 1.20.3-botigo/src/runtime/proc.go
@@ -5154,6 +5154,9 @@
}
unlock(&sched.lock)
}
+
+var inconsistentStateCount uint8
+var inconsistentStateTime int64
// Check for deadlock situation.
// The check is based on number of running M's, if 0 -> deadlock.
@@ -5199,6 +5202,7 @@
throw("checkdead: inconsistent counts")
}
+ inconsistentState := false
grunning := 0
forEachG(func(gp *g) {
if isSystemGoroutine(gp, false) {
@@ -5212,10 +5216,76 @@
case _Grunnable,
_Grunning,
_Gsyscall:
- print("runtime: checkdead: find g ", gp.goid, " in status ", s, "\n")
- throw("checkdead: runnable g")
+
+ // no M is available - but at least one G is runnable/running or in a syscall; this is an inconsistent state
+ inconsistentState = true
+
+ // Although, if the G is runnable, this might just be a race condition as it was readied by an event right
+ // before the sched lock was acquired.
+ // If one of the G is running or in a syscall, then it's a real issue
+ if s&^_Gscan != _Grunnable {
+ print("runtime: checkdead: find g ", gp.goid, " in status ", s, "\n")
+ throw("checkdead: runnable g")
+ }
+
+ print("[checkdead] find g ", gp.goid, " in status ", s,
+ " - mcount=", mcount(), " sched.nmidle=", sched.nmidle, " sched.nmidlelocked=", sched.nmidlelocked,
+ " sched.nmsys=", sched.nmsys, " sched.nmspinning=", sched.nmspinning.Load(), "\n")
}
})
+
+ // if an inconsistent state due to a race condition was detected, then print some debug info
+ // panic if it we get too many occurences in a row, or if we've been stuck too much time without reaching this limit
+ if inconsistentState {
+ // increase the count of inconsistent state occurence
+ inconsistentStateCount++
+
+ // if we get inconsistent state too many times in a row, then panic
+ if inconsistentStateCount == 10 {
+ throw("runtime: checkdead: too many successive occurences of inconsistent state")
+ }
+
+ // if this is the first occurence of inconsistent state, save time
+ // else check for time stuck
+ if inconsistentStateCount == 1 {
+ inconsistentStateTime = nanotime()
+ } else if inconsistentStateTime < nanotime()-5e9 { // 5s
+ throw("runtime: checkdead: too much time stuck in successive occurences of inconsistent state")
+ }
+
+ // print the runtime stacktrace
+ {
+ print("[checkdead] runtime stack\n")
+ pc := getcallerpc()
+ sp := getcallersp()
+ gp := getg()
+ traceback(pc, sp, 0, gp)
+ }
+
+ // print G's state, omitting dead ones, with stacktraces for runnable ones
+ print("[checkdead] goroutines\n")
+ forEachG(func(gp *g) {
+ if isSystemGoroutine(gp, false) {
+ return
+ }
+
+ s := gp.atomicstatus.Load()
+
+ if s&^_Gscan == _Gdead {
+ return
+ }
+
+ goroutineheader(gp)
+ if s&^_Gscan == _Grunnable {
+ traceback(^uintptr(0), ^uintptr(0), 0, gp)
+ }
+ })
+ } else if inconsistentStateCount > 0 {
+ // things got back to normal, clear counters
+ inconsistentStateCount = 0
+ inconsistentStateTime = 0
+ }
+
if grunning == 0 { // possible if main goroutine calls runtime·Goexit()
unlock(&sched.lock) // unlock so that GODEBUG=scheddetail=1 doesn't hang
fatal("no goroutines (main called runtime.Goexit) - deadlock!")