runtime: SIGQUIT crash loop + startm race condition deadlock #65138
Labels
compiler/runtime
Issues related to the Go compiler and/or runtime.
NeedsDecision
Feedback is required from experts, contributors, and/or the community before a change can be made.
Milestone
Discovered by @bcmills in #64752 (comment). Bryan's analysis copied here:
Some analysis of the test log reported by
watchflakes
.The program ends up running six goroutines in total. When run on my workstation, at the time of the crash the goroutines are parked in the following locations:
main
inC.trigger_crash
forcegchelper
ingoparkunlock
bgsweep
ingoparkunlock
bgscavenge
ingoparkunlock
runfinq
ingopark
sysmon
innotetsleep
sysmon
goroutine is not included in the output fromtracebackothers
— I guess because it doesn't have an associated P or G?)In a successful run on my local workstation, I see
SIGQUIT: quit
logs for five background threads:notetsleep
viasysmon
notesleep
viatemplateThread
stopm
viaschedule
andfindRunnable
In the builder log, we see
SIGQUIT: quit
logs for only four background threads:notetsleep
viasysmon
(as expected)notesleep
viatemplateThread
(as expected)allocm
viaschedule
,resetspinning
, andnewm
stopm
viaschedule
andstartlockedm
The thread in
allocm
looks suspiciously like a race in the runtime. In particular:startm
callsmReserveID
(incrementingmnext
) before it callsnewm
newm
thread was interrupted bySIGQUIT
duringallocm
, which is before the new thread is actually started for the M (innewm1
).So this looks like a genuine watchdog failure: the kernel defers delivery of the final
SIGQUIT
signal because all of the threads that could receive it are already handling aSIGQUIT
(with that signal presumably masked), and themain
thread is spinning in thedocrash
loop waiting for acknowledgement from a fifth thread that had an M ID reserved but was never actually started.Perhaps
startm
needs to block signals from at some point before callingmReserveID
untilnewm
has returned?cc @golang/runtime
The text was updated successfully, but these errors were encountered: