runtime: checkdead fires due to suspected race in the Go runtime when GOMAXPROCS=1 on AWS

### What version of Go are you using (`go version`)?

1.20.3

### Does this issue reproduce with the latest release?

Yes

### What operating system and processor architecture are you using (`go env`)?

This happens on m3.medium AWS instances. I could not reproduce it anywhere else so far. This is a monothreaded amd64 instance

### What did you do?

- Compiled a binary with 1.20.3, with lots of goroutines, heavy network activity
- Launched it on an m3.medium AWS instance
- Saw it crashing with a "runtime: checkdead: found g XX in status 1"

### What did you expect to see?

No crash

### What did you see instead?

A crash with "runtime: checkdead: found g XX in status 1". This crash happens on average after a few hours, apparently correlated with the network activity - the more network activity (ie, the more goroutines waiting on network), the more often it will happen.
 
```
runtime: checkdead: find g 34876 in status 1
fatal error: checkdead: runnable g

runtime stack:
runtime.throw({0x105b8e6?, 0×300?})
#011/us/local/go/src/runtime/panic.go:1047 +0×5d fp=0xc000a55d58 sp=0xc000a55d28 pc=0x43711d
runtime.checkdead.func1(0xc0009fba00)
#011/usr/local/go/src/runtime/proc.go:5216 +0xe5 fp=0xc000a55d90 sp=0xc000a55d58 pc=0x444dc5
runtime.forEachG(0xc000a55dfo)
#011/usr/local/go/sr/runtime/proc.go:591 +0×5c fp=0xc000a55dc® sp=0xc000a55d90 pc=0×43a8bc
runtime.checkdead()
#011/usr/local/go/src/runtime/proc.go:5203 +0xf7 fp=0xc000a5510 sp=0xc000a55dc0 pc=0x444ab7
runtime.mput(0xc000a55e40?)
#011/usr/local/go/src/runtime/proc.go:5696 +0×33 fp=0x000a5520 sp=0xc000a55e10 pc=0x446033
runtime.stopm()
#011/usr/local/go/src/runtime/proc.go:2332 +0×67 fp=0xc000a5550 sp=0xc000a55e20 pc=0×43dce7 
runtime.findRunnable()
#011/usr/local/go/src/runtime/proc.go:3007 +0xabc fp=0xc000a55f58 sp=0x000a55e50 pc=0x43f55c
runtime.schedule()
#011/usr/local/go/src/runtime/proc.go: 3360 +0xb1 fp=0xc000a55f90 sp=0xc000a55f58 pc=0×440391
runtime.park_m(0x000611380?)
#011/usr/local/go/src/runtime/proc.go:3511 +0×12d fp=0xc000a55fco sp=oxc000a55f90 pc=0×4408ad
runtime.mcall()
#011/usr/local/go/src/runtime/asm_amd64.s:452+0×43 fp=0xc000a55fdo sp=0xc000a55fco pc=0×469f23
```

### Additional details

We've encountered a race condition in the Go runtime that leads to a panic in the `checkdead()` function when running on a monothreaded instance (GOMAXPROCS=1). The issue seems related to a specific scenario in the `findRunnable()` function where a goroutine is readied between the `findRunnable()` check and `stopm()` call on line 3007 acquiring the scheduler lock, causing an inconsistent scheduler state.

Recap:

- `findRunnable()` cannot find any goroutines to return, reaching the end of the function and calling `stopm()`.
- `stopm()` acquires the lock on the scheduler and calls `mput()`.
- `mput()` adds the M to the idle pool and calls `checkdead()`.
- There are no active M, but `checkdead()` finds a runnable G and panics.

The race condition appears to be related to the "delicate dance" code in `findRunnable()`, which tries to handle a race between the network poller readying a goroutine and the M being stopped. However, this race condition can still occur in rare cases, leading to a panic in `checkdead()`.

To investigate this issue further, we tried to create a minimal reproducer, but we could not trigger it consistently due to the race condition's rarity. Nonetheless, runtime stack traces from the `checkdead()` panics in the real-world scenario clearly show that it originates from the specific `stopm()` call in `findRunnable()`, indicating a strong correlation between the race condition and the panic.

A temporary fix patch has been applied to the `checkdead()` function to detect and handle the inconsistent state caused by the race condition without panicking immediately. Instead, it collects information and prints debug logs about the state of goroutines and the runtime stack. It only panics when the inconsistent state occurs too many times in a row or for an extended period of time. All occurrences so far were already resolved (ie an M was active) on the next call to checkdead().

This is ofc not a viable solution as it does not fix the root cause of the issue and may introduce other problems.

```go
--- 1.20.3-orig/src/runtime/proc.go
+++ 1.20.3-botigo/src/runtime/proc.go
@@ -5154,6 +5154,9 @@
 	}
 	unlock(&sched.lock)
 }
+
+var inconsistentStateCount uint8
+var inconsistentStateTime int64
 
 // Check for deadlock situation.
 // The check is based on number of running M's, if 0 -> deadlock.
@@ -5199,6 +5202,7 @@
 		throw("checkdead: inconsistent counts")
 	}
 
+	inconsistentState := false
 	grunning := 0
 	forEachG(func(gp *g) {
 		if isSystemGoroutine(gp, false) {
@@ -5212,10 +5216,76 @@
 		case _Grunnable,
 			_Grunning,
 			_Gsyscall:
-			print("runtime: checkdead: find g ", gp.goid, " in status ", s, "\n")
-			throw("checkdead: runnable g")
+
+			// no M is available - but at least one G is runnable/running or in a syscall; this is an inconsistent state
+			inconsistentState = true
+
+			// Although, if the G is runnable, this might just be a race condition as it was readied by an event right
+			// before the sched lock was acquired.
+			// If one of the G is running or in a syscall, then it's a real issue
+			if s&^_Gscan != _Grunnable {
+				print("runtime: checkdead: find g ", gp.goid, " in status ", s, "\n")
+				throw("checkdead: runnable g")
+			}
+
+			print("[checkdead] find g ", gp.goid, " in status ", s,
+				" - mcount=", mcount(), " sched.nmidle=", sched.nmidle, " sched.nmidlelocked=", sched.nmidlelocked,
+				" sched.nmsys=", sched.nmsys, " sched.nmspinning=", sched.nmspinning.Load(), "\n")
 		}
 	})
+
+	// if an inconsistent state due to a race condition was detected, then print some debug info
+	// panic if it we get too many occurences in a row, or if we've been stuck too much time without reaching this limit
+	if inconsistentState {
+		// increase the count of inconsistent state occurence
+		inconsistentStateCount++
+
+		// if we get inconsistent state too many times in a row, then panic
+		if inconsistentStateCount == 10 {
+			throw("runtime: checkdead: too many successive occurences of inconsistent state")
+		}
+
+		// if this is the first occurence of inconsistent state, save time
+		// else check for time stuck
+		if inconsistentStateCount == 1 {
+			inconsistentStateTime = nanotime()
+		} else if inconsistentStateTime < nanotime()-5e9 { // 5s
+			throw("runtime: checkdead: too much time stuck in successive occurences of inconsistent state")
+		}
+
+		// print the runtime stacktrace
+		{
+			print("[checkdead] runtime stack\n")
+			pc := getcallerpc()
+			sp := getcallersp()
+			gp := getg()
+			traceback(pc, sp, 0, gp)
+		}
+
+		// print G's state, omitting dead ones, with stacktraces for runnable ones
+		print("[checkdead] goroutines\n")
+		forEachG(func(gp *g) {
+			if isSystemGoroutine(gp, false) {
+				return
+			}
+
+			s := gp.atomicstatus.Load()
+
+			if s&^_Gscan == _Gdead {
+				return
+			}
+
+			goroutineheader(gp)
+			if s&^_Gscan == _Grunnable {
+				traceback(^uintptr(0), ^uintptr(0), 0, gp)
+			}
+		})
+	} else if inconsistentStateCount > 0 {
+		// things got back to normal, clear counters
+		inconsistentStateCount = 0
+		inconsistentStateTime = 0
+	}
+
 	if grunning == 0 { // possible if main goroutine calls runtime·Goexit()
 		unlock(&sched.lock) // unlock so that GODEBUG=scheddetail=1 doesn't hang
 		fatal("no goroutines (main called runtime.Goexit) - deadlock!")

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runtime: checkdead fires due to suspected race in the Go runtime when GOMAXPROCS=1 on AWS #59600

What version of Go are you using (`go version`)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (`go env`)?

What did you do?

What did you expect to see?

What did you see instead?

Additional details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

runtime: checkdead fires due to suspected race in the Go runtime when GOMAXPROCS=1 on AWS #59600

Description

What version of Go are you using (go version)?

Does this issue reproduce with the latest release?

What operating system and processor architecture are you using (go env)?

What did you do?

What did you expect to see?

What did you see instead?

Additional details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

What version of Go are you using (`go version`)?

What operating system and processor architecture are you using (`go env`)?