Skip to content

runtime: deadlock involving gcControllerState.enlistWorker #19112

@rsc

Description

@rsc

We have seen one instance of a production job suddenly spinning to 100% CPU and becoming unresponsive. In that one instance, a SIGQUIT was sent after 328 minutes of spinning, and the stacks showed a single goroutine in "IO wait (scan)" state.

Looking for things that might get stuck if a goroutine got stuck in scanning a stack, we found that injectglist does:

lock(&sched.lock)
var n int
for n = 0; glist != nil; n++ {
	gp := glist
	glist = gp.schedlink.ptr()
	casgstatus(gp, _Gwaiting, _Grunnable)
	globrunqput(gp)
}
unlock(&sched.lock)

and that casgstatus spins on gp.atomicstatus until the _Gscan bit goes away. Essentially, this code locks sched.lock and then while holding sched.lock, waits to lock gp.atomicstatus.

The code that is doing the scan is:

if castogscanstatus(gp, s, s|_Gscan) {
	if !gp.gcscandone {
		scanstack(gp, gcw)
		gp.gcscandone = true
	}
	restartg(gp)
	break loop
}

More analysis showed that scanstack can, in a rare case, end up calling back into code that acquires sched.lock. For example:

runtime.scanstack at proc.go:866
calls runtime.gentraceback at mgcmark.go:842
calls runtime.scanstack$1 at traceback.go:378
calls runtime.scanframeworker at mgcmark.go:819
calls runtime.scanblock at mgcmark.go:904
calls runtime.greyobject at mgcmark.go:1221
calls (*runtime.gcWork).put at mgcmark.go:1412
calls (*runtime.gcControllerState).enlistWorker at mgcwork.go:127
calls runtime.wakep at mgc.go:632
calls runtime.startm at proc.go:1779
acquires runtime.sched.lock at proc.go:1675

This path was found with an automated deadlock-detecting tool by @aclements. There are many such paths but they all go through enlistWorker -> wakep.

The evidence strongly suggests that one of these paths is what caused the deadlock we observed. We're running those jobs with GOTRACEBACK=crash now to try to get more information if it happens again.

Further refinement and analysis by @aclements and me shows that if we drop the wakep call from enlistWorker, the remaining few deadlock cycles found by the tool are all false positives caused by not understanding the effect of calls to func variables.

For Go 1.8 we intend to drop the enlistWorker -> wakep call. It was intended only as a performance optimization, it rarely executes, and if it does execute at just the wrong time it can (and plausibly did) cause the deadlock we saw.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions