Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: loop over allp causes a nil pointer dereference crash #23098

Closed
mikioh opened this issue Dec 12, 2017 · 10 comments

Comments

Projects
None yet
6 participants
@mikioh
Copy link
Contributor

commented Dec 12, 2017

See https://build.golang.org/log/ff2e3f95d35cbaec0291bf387a6409c1f71e417f

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x8078907]

runtime stack:
runtime.throw(0x8150279, 0x2a)
	/tmp/workdir/go/src/runtime/panic.go:616 +0x6b
runtime.sigpanic()
	/tmp/workdir/go/src/runtime/signal_unix.go:366 +0x230
runtime.runqempty(0x0, 0xffffff01)
	/tmp/workdir/go/src/runtime/proc.go:4638 +0x17
runtime.findrunnable(0x383e6000, 0x0)
	/tmp/workdir/go/src/runtime/proc.go:2342 +0x18d
runtime.schedule()
	/tmp/workdir/go/src/runtime/proc.go:2530 +0x10b
runtime.mstart1(0x0)
	/tmp/workdir/go/src/runtime/proc.go:1232 +0x7a
runtime.mstart()
	/tmp/workdir/go/src/runtime/proc.go:1188 +0x59

Seems like there is an exception to the assumption that "none of allp[i] are nil or in state _Pdead." in http://golang.org/cl/45575.

Hej Hej, @aclements

@mikioh mikioh added this to the Go1.10 milestone Dec 12, 2017

@bradfitz bradfitz added the NeedsFix label Dec 12, 2017

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 4, 2018

The failures are all in programs that call runtime.GOMAXPROCS to increase the number of P's. The failures in all cases are in the // check all runqueues once again loop in findrunnable. Perhaps of particular interest are three failures running go/test/fixedbugs/issue13160.go, a program that calls runtime.GOMAXPROCS once at the start of main. In one of those (https://build.golang.org/log/e9f4fd250aa74c540edb5e75f824d341f9d8a1e1) the failure is in a call to gopark calling schedule, likely from a blocking channel operation. The code in that case is apparently running well after the call to runtime.GOMAXPROCS has returned. So somehow findrunnable is seeing a nil entry in allp after GOMAXPROCS returns. Which seems impossible.

@aclements

This comment has been minimized.

Copy link
Member

commented Jan 4, 2018

The failures in all cases are in the // check all runqueues once again loop in findrunnable.

That loop is concerning because it runs (somewhat non-obviously) without a P, so it could be running concurrently with a STW. I'm not sure how that would be a problem in that particular test, since, as you point out, allp should never be changing after the initial GOMAXPROCS.

I have a fix, which I will send shortly. I can't reproduce this issue, so it's hard to say if this actually fixes the problem, but it's certainly not a bad thing to fix.

@gopherbot

This comment has been minimized.

Copy link

commented Jan 4, 2018

Change https://golang.org/cl/86215 mentions this issue: runtime: avoid race on allp in findrunnable

gopherbot pushed a commit that referenced this issue Jan 4, 2018

runtime: avoid race on allp in findrunnable
findrunnable loops over allp to check run queues *after* it has
dropped its own P. This is unsafe because allp can change when nothing
is blocking safe-points. Hence, procresize could change allp
concurrently with findrunnable's loop. Beyond generally violating Go's
memory model, in the best case this could findrunnable to observe a
nil P pointer if allp has been grown but the new slots not yet
initialized. In the worst case, the reads of allp could tear, causing
findrunnable to read a word that isn't even a valid *P pointer.

Fix this by taking a snapshot of the allp slice header (but not the
backing store) before findrunnable drops its P and iterating over this
snapshot. The actual contents of allp are immutable up to len(allp),
so this fixes the race.

Updates #23098 (may fix).

Change-Id: I556ae2dbfffe9fe4a1bf43126e930b9e5c240ea8
Reviewed-on: https://go-review.googlesource.com/86215
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>

@ianlancetaylor ianlancetaylor modified the milestones: Go1.10, Go1.11 Jan 10, 2018

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jan 10, 2018

I'm troubled by https://build.golang.org/log/71e1620c62a2255007ea450fc02e9c5d6f5af068 which shows a crash in a different part of findrunnable that refers to allp. In that case we are clearly running with a p, and the entry in allp is not nil, but it still somehow got a nil pointer dereference.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2018

Several of these seem to be nil pointer dereferences:

##### ../misc/cgo/test
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x80 addr=0x0 pc=0x43b498]

runtime stack:
runtime.throw(0x9301d0, 0x2a)
	/data/mipsle/go/src/runtime/panic.go:608 +0x58
runtime.sigpanic()
	/data/mipsle/go/src/runtime/signal_unix.go:374 +0x3b0
runtime.findrunnable(0x1022000, 0x0)
	/data/mipsle/go/src/runtime/proc.go:2349 +0xbc
runtime.schedule()
	/data/mipsle/go/src/runtime/proc.go:2612 +0x154
runtime.park_m(0x1192380)
	/data/mipsle/go/src/runtime/proc.go:2675 +0xb0

proc.go:2349 is

if gp := runqsteal(_p_, allp[enum.position()], stealRunNextG); gp != nil {

It's hard to see a nil dereference there. runqsteal is not being inlined.

@ianlancetaylor ianlancetaylor modified the milestones: Go1.11, Go1.12 Jul 11, 2018

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented Dec 6, 2018

I grep'd over the logs since June 29th, and I couldn't find any new instances of this bug. That is, specifically nil pointer dereferences in findrunnable that happen at the same line (just looked at it myself to check). There was one that looked fishy from plan9-arm, but the dereference was cgo_yield and so seemed unrelated.

I'll bisect to try to figure out what commit "fixed" it for linux-mipsle.

With that said though, should this still be considered a release blocker?

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Dec 6, 2018

If it hasn't happened since June 29 I think we can close this.

@mknyszek

This comment has been minimized.

Copy link
Contributor

commented Dec 6, 2018

OK! I just took a second, slow look over the logs and I'm pretty confident in my assessment.

I'll close this bug off.

@mknyszek mknyszek closed this Dec 6, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.