Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: SIGSEGV in runtime.deltimer on linux-mips-rtrk during ReadMemStats #43712

Closed
bcmills opened this issue Jan 15, 2021 · 12 comments
Closed

runtime: SIGSEGV in runtime.deltimer on linux-mips-rtrk during ReadMemStats #43712

bcmills opened this issue Jan 15, 2021 · 12 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Jan 15, 2021

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x46ee1c]

runtime stack:
runtime.throw(0x6a72fd, 0x2a)
	/tmp/gobuilder-mips64/go/src/runtime/panic.go:1112 +0x60 fp=0x6fb81e04 sp=0x6fb81df0 pc=0x4461cc
runtime.sigpanic()
	/tmp/gobuilder-mips64/go/src/runtime/signal_unix.go:714 +0x3b4 fp=0x6fb81e1c sp=0x6fb81e04 pc=0x463048
runtime.deltimer(0xc80000, 0x0)
	/tmp/gobuilder-mips64/go/src/runtime/time.go:317 +0x164 fp=0x6fb81e30 sp=0x6fb81e20 pc=0x46ee1c
time.stopTimer(...)
	/tmp/gobuilder-mips64/go/src/runtime/time.go:219
runtime.wakeScavenger()
	/tmp/gobuilder-mips64/go/src/runtime/mgcscavenge.go:197 +0xa4 fp=0x6fb81e40 sp=0x6fb81e30 pc=0x42c274
runtime.sysmon()
	/tmp/gobuilder-mips64/go/src/runtime/proc.go:5191 +0x428 fp=0x6fb81e94 sp=0x6fb81e40 pc=0x457198
runtime.mstart1()
	/tmp/gobuilder-mips64/go/src/runtime/proc.go:1275 +0xf0 fp=0x6fb81ea4 sp=0x6fb81e94 pc=0x44c830
runtime.mstart()
	/tmp/gobuilder-mips64/go/src/runtime/proc.go:1240 +0x68 fp=0x6fb81eb8 sp=0x6fb81ea4 pc=0x44c710

…

goroutine 176543 [waiting]:
runtime.systemstack_switch()
	/tmp/gobuilder-mips64/go/src/runtime/asm_mipsx.s:159 +0x8 fp=0xcea978 sp=0xcea974 pc=0x48be64
runtime.stopTheWorld(0x6994d3, 0xe)
	/tmp/gobuilder-mips64/go/src/runtime/proc.go:984 +0xa0 fp=0xcea998 sp=0xcea978 pc=0x44be60
runtime.ReadMemStats(0xcea9dc)
	/tmp/gobuilder-mips64/go/src/runtime/mstats.go:473 +0x48 fp=0xcea9ac sp=0xcea998 pc=0x43e358
testing.AllocsPerRun(0x3e8, 0xcebf9c, 0x0, 0x0)
	/tmp/gobuilder-mips64/go/src/testing/allocs.go:28 +0xcc fp=0xcebf78 sp=0xcea9ac pc=0x50c340
runtime_test.TestStringConcatenationAllocs(0xd121c0)
	/tmp/gobuilder-mips64/go/src/runtime/malloc_test.go:140 +0x60 fp=0xcebfa4 sp=0xcebf78 pc=0x5c459c
testing.tRunner(0xd121c0, 0x6ae020)
	/tmp/gobuilder-mips64/go/src/testing/testing.go:1194 +0x130 fp=0xcebfe4 sp=0xcebfa4 pc=0x5176d0
runtime.goexit()
	/tmp/gobuilder-mips64/go/src/runtime/asm_mipsx.s:635 +0x4 fp=0xcebfe4 sp=0xcebfe4 pc=0x48e2ec
created by testing.(*T).Run
	/tmp/gobuilder-mips64/go/src/testing/testing.go:1239 +0x294

2021-01-14T21:55:29-eb33002/linux-mips-rtrk
2020-12-17T20:25:45-8fcf318/linux-mips-rtrk

CC @aclements @mknyszek @prattmic @mengzhuo; compare #43625.

@bcmills bcmills added this to the Backlog milestone Jan 15, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented Jan 15, 2021

Both of those failures are during the current-cycle code freeze, as is the one in #43625, but I notice that #43625 was unmarked as release-blocker without further comment.

@golang/release, should this be milestoned to 1.16, given that it may be a regression from (or exposed by) the timer changes in 1.16? (Marking as release-blocker until that is answered, but I'm not particularly invested in the answer.)

@bcmills bcmills modified the milestones: Backlog, Go1.16 Jan 15, 2021
@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented Jan 15, 2021

I think this issue is not related to #43625

Might related to #35541 since tpp shouldn't be nil

atomic.Xadd(&tpp.deletedTimers, 1)

CC @cherrymui

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Jan 15, 2021

I don't think this is related to #35541, at least not directly. There, the problem is likely a malformed pointer but not 0. (Of course a malformed pointer could lead to memory corruption which could lead to anything.)

I'm not really familiar with the timer code to tell for sure that tpp cannot be 0.

				// Must fetch t.pp before changing status,
				// as cleantimers in another goroutine
				// can clear t.pp of a timerDeleted timer.
				tpp := t.pp.ptr()
				if !atomic.Cas(&t.status, timerModifying, timerDeleted) {

This comment, and MIPS being a weak memory model machine, make me worried.

@prattmic
Copy link
Member

@prattmic prattmic commented Jan 15, 2021

I've been looking at the same thing. t.pp should be set before transition to timerModifying (and doesn't change on transition to timerModifiedLater), and the strongly-ordered CAS should act as a barrier, but I'm still going through to see if we missed a case.

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jan 20, 2021

We only have two failing cases to look at, but they are very similar.

In both cases it is running the test GOMAXPROCS=2 runtime -cpu=1,2,4 -quick.

The test running is TestStringConcatenationAllocs. It calls testing.AllocsPerRun which calls runtime.ReadMemStats which calls stopTheWorld. In both stack traces the goroutine is waiting for a call to systemstack to complete. That call is to stopTheWorldWithSema. We have no stack trace for the call to stopTheWorldWithSema, so we don't know where it is.

In both failures sysmon wakes up while the world is stopping and decides to call wakeScavenger. wakeScavenger calls stopTimer(scavenge.timer), and that crashes because scavenge.timer.pp == 0 while scavenge.timer.status is either timerWaiting or timerModifiedLater. This should be impossible.

testing.AllocsPerRun starts by calling runtime.GOMAXPROCS(1). It seems likely that this is while the test is running with GOMAXPROCS set to 2 or 4. That means that we will have gone through (*p).destroy which calls moveTimers. This is interesting because if the timer status is timerWaiting, moveTimers sets t.pp = 0 without first adjusting the status. It then calls doaddtimer which sets t.pp to the new P. I don't see how this could possibly cause a problem, as there are quite a few atomic operations around, but it is a case where we have status == timerWaiting and t.pp = 0, which is the case that crashes.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 20, 2021

Change https://golang.org/cl/284775 mentions this issue: runtime: don't adjust timer pp field in timerWaiting status

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jan 20, 2021

@ianlancetaylor I think you've figured it out: sysmon runs without a P and calls deltimer, and that call could happen while the world is stopped, while timers are being moved due to destroy. Since there is this window of time where t.pp == 0 while in timerWaiting in moveTimers and a deltimer could run concurrently and assume t.pp != nil after observing timerWaiting, yeah I think that's it.

I think your fix is exactly right, since then the deltimer will see timerMoving and will spin.

More broadly, maybe the scavenger really shouldn't use timers, or sysmon really shouldn't be the one kicking it. Sigh.

@gopherbot gopherbot closed this in d2d155d Jan 20, 2021
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jan 21, 2021

It turns out that this is also a problem in Go 1.15.

@gopherbot Please open a backport issue for Go 1.15.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 21, 2021

Backport issue(s) opened: #43833 (for 1.15).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://golang.org/wiki/MinorReleases.

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Jan 21, 2021

@mknyszek Is it known whether it's also a problem for 1.14?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jan 21, 2021

I don't believe so. wakeScavenger and the whole mechanism around it was only added in Go 1.15.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 27, 2021

Change https://golang.org/cl/287092 mentions this issue: [release-branch.go1.15] runtime: don't adjust timer pp field in timerWaiting status

gopherbot pushed a commit that referenced this issue Feb 3, 2021
…Waiting status

Before this CL, the following sequence was possible:

* GC scavenger starts and sets up scavenge.timer
* GC calls readyForScavenger, but sysmon is sleeping
* program calls runtime.GOMAXPROCS to shrink number of processors
* procresize destroys a P, the one that scavenge.timer is on
* (*pp).destroy calls moveTimers, which gets to the scavenger timer
* scavenger timer is timerWaiting, and moveTimers clears t.pp
* sysmon wakes up and calls wakeScavenger
* wakeScavengers calls stopTimer on scavenger.timer, still timerWaiting
* stopTimer calls deltimer which loads t.pp, which is still nil
* stopTimer tries to increment deletedTimers on nil t.pp, and crashes

The point of vulnerability is the time that t.pp is set to nil by
moveTimers and the time that t.pp is set to non-nil by moveTimers,
which is a few instructions at most. So it's not likely and in
particular is quite unlikely on x86. But with a more relaxed memory
model the area of vulnerability can be somewhat larger. This appears
to tbe the cause of two builder failures in a few months on linux-mips.

This CL fixes the problem by making moveTimers change the status from
timerWaiting to timerMoving while t.pp is clear. That will cause
deltimer to wait until the status is back to timerWaiting, at which
point t.pp has been set again.

For #43712
Fixes #43833

Change-Id: I66838319ecfbf15be66c1fac88d9bd40e2295852
Reviewed-on: https://go-review.googlesource.com/c/go/+/284775
Trust: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit d2d155d)
Reviewed-on: https://go-review.googlesource.com/c/go/+/287092
Run-TryBot: Carlos Amedee <carlos@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants