Skip to content

runtime: SIGQUIT crash loop + startm race condition deadlock #65138

@prattmic

Description

@prattmic

Discovered by @bcmills in #64752 (comment). Bryan's analysis copied here:

Some analysis of the test log reported by watchflakes.

The program ends up running six goroutines in total. When run on my workstation, at the time of the crash the goroutines are parked in the following locations:

In a successful run on my local workstation, I see SIGQUIT: quit logs for five background threads:

In the builder log, we see SIGQUIT: quit logs for only four background threads:

  • one in notetsleep via sysmon (as expected)
  • one in notesleep via templateThread (as expected)
  • one in allocm via schedule, resetspinning, and newm
  • one in stopm via schedule and startlockedm

The thread in allocm looks suspiciously like a race in the runtime. In particular:

  • startm calls mReserveID (incrementing mnext) before it calls newm
  • The newm thread was interrupted by SIGQUIT during allocm, which is before the new thread is actually started for the M (in newm1).

So this looks like a genuine watchdog failure: the kernel defers delivery of the final SIGQUIT signal because all of the threads that could receive it are already handling a SIGQUIT (with that signal presumably masked), and the main thread is spinning in the docrash loop waiting for acknowledgement from a fifth thread that had an M ID reserved but was never actually started.

Perhaps startm needs to block signals from at some point before calling mReserveID until newm has returned?

cc @golang/runtime

Metadata

Metadata

Assignees

Labels

NeedsDecisionFeedback is required from experts, contributors, and/or the community before a change can be made.compiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions