Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: test timeout with truncated stack trace on linux-ppc64le-power9osu #43175

Closed
bcmills opened this issue Dec 14, 2020 · 4 comments
Closed

runtime: test timeout with truncated stack trace on linux-ppc64le-power9osu #43175

bcmills opened this issue Dec 14, 2020 · 4 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 14, 2020

This looks suspiciously like a deadlock in the runtime's stack-dumping code: the test times out after 3 minutes, but hasn't finished dumping stacks after 4m, and appears to have been interrupted in the middle of a GC.

CC @danscales @mknyszek @aclements

2020-12-14T15:03:28-451b6b3/linux-ppc64le-power9osu

panic: test timed out after 3m0s

goroutine 92156 [running]:
panic(0x10209800, 0xc0003ce000)
	/workdir/go/src/runtime/panic.go:1060 +0x540 fp=0xc00052af58 sp=0xc00052ae78 pc=0x10039da0
testing.(*M).startAlarm.func1()
	/workdir/go/src/testing/testing.go:1701 +0xcc fp=0xc00052afc0 sp=0xc00052af58 pc=0x100ef2dc
runtime.goexit()
	/workdir/go/src/runtime/asm_ppc64x.s:869 +0x4 fp=0xc00052afc0 sp=0xc00052afc0 pc=0x100774f4
created by time.goFunc
	/workdir/go/src/time/sleep.go:169 +0x44

goroutine 1 [chan receive, 2 minutes, locked to thread]:
runtime.gopark(0x10277560, 0xc00016e118, 0x170e, 0x2)
	/workdir/go/src/runtime/proc.go:336 +0xec fp=0xc0000409b0 sp=0xc000040980 pc=0x1003cc9c
runtime.chanrecv(0xc00016e0c0, 0x0, 0xc00030e201, 0xc000080101)
	/workdir/go/src/runtime/chan.go:576 +0x388 fp=0xc000040a50 sp=0xc0000409b0 pc=0x10007ab8
runtime.chanrecv1(0xc00016e0c0, 0x0)
	/workdir/go/src/runtime/chan.go:439 +0x24 fp=0xc000040a90 sp=0xc000040a50 pc=0x100076d4
testing.tRunner.func1(0xc000084480)
	/workdir/go/src/testing/testing.go:1160 +0x230 fp=0xc000040b90 sp=0xc000040a90 pc=0x100eee40
testing.tRunner(0xc000084480, 0xc000040cc8)
	/workdir/go/src/testing/testing.go:1198 +0x108 fp=0xc000040bf0 sp=0xc000040b90 pc=0x100ea8a8
testing.runTests(0xc0000a4048, 0x103eaee0, 0x15f, 0x15f, 0xbfedfe69b91e7fd5, 0x29e8e5d854, 0x10404ee0, 0x71ad96470680)
	/workdir/go/src/testing/testing.go:1510 +0x2b4 fp=0xc000040ce8 sp=0xc000040bf0 pc=0x100ec0e4
testing.(*M).Run(0xc0000d6000, 0x0)
	/workdir/go/src/testing/testing.go:1418 +0x1a0 fp=0xc000040e28 sp=0xc000040ce8 pc=0x100eb1b0
runtime_test.TestMain(0xc0000d6000)
	/workdir/go/src/runtime/crash_test.go:27 +0x2c fp=0xc000040e88 sp=0xc000040e28 pc=0x1016579c
main.main()
	_testmain.go:1245 +0x134 fp=0xc000040f50 sp=0xc000040e88 pc=0x101e7774
runtime.main()
	/workdir/go/src/runtime/proc.go:225 +0x270 fp=0xc000040fc0 sp=0xc000040f50 pc=0x1003c830
runtime.goexit()
	/workdir/go/src/runtime/asm_ppc64x.s:869 +0x4 fp=0xc000040fc0 sp=0xc000040fc0 pc=0x100774f4

goroutine 2 [force gc (idle), 2 minutes]:
runtime.gopark(0x10277918, 0x10404600, 0x1411, 0x1)
	/workdir/go/src/runtime/proc.go:336 +0xec fp=0xc000032f80 sp=0xc000032f50 pc=0x1003cc9c
runtime.goparkunlock(...)
	/workdir/go/src/runtime/proc.go:342
runtime.forcegchelper()
	/workdir/go/src/runtime/proc.go:276 +0xcc fp=0xc000032fc0 sp=0xc000032f80 pc=0x1003cb0c
runtime.goexit()
	/workdir/go/src/runtime/asm_ppc64x.s:869 +0x4 fp=0xc000032fc0 sp=0xc000032fc0 pc=0x100774f4
created by runtime.init.5
	/workdir/go/src/runtime/proc.go:264 +0x34

goroutine 3 [GC sweep wait]:
runtime.gopark(0x10277918, 0x10404aa0, 0x140c, 0x1)
	/workdir/go/src/runtime/proc.go:336 +0xec fp=0xc000033778 sp=0xc000033748 pc=0x1003cc9c
runtime.goparkunlock(...)
	/workdir/go/src/runtime/proc.go:342
runtime.bgsweep(0xc000016070)
	/workdir/go/src/runtime/mgcsweep.go:182 +0x188 fp=0xc0000337b8 sp=0xc000033778 pc=0x10027028
runtime.goexit()
	/workdir/go/src/runtime/asm_ppc64x.s:869 +0x4 fp=0xc0000337b8 sp=0xc0000337b8 pc=0x100774f4
created by runtime.gcenable
	/workdir/go/src/runtime/mgc.go:217 +0x58

goroutine 4 [GC scavenge wait]:
runtime.gopark(0x10277918, 0x10404be0, 0x140d, 0x1)
	/workdir/go/src/runtime/proc.go:336 +0xec fp=0xc000041f38 sp=0xc000041f08 pc=0x1003cc9c
runtime.goparkunlock(...)
	/workdir/go/src/runtime/proc.go:342
runtime.bgscavenge(0xc000016070)
	/workdir/go/src/runtime/mgcscavenge.go:314 +0x29c fp=0xc000041fb8 sp=0xc000041f38 pc=0x1002535c
runtime.goexit()
	/workdir/go/src/runtime/asm_ppc64x.s:869 +0x4 fp=0xc000041fb8 sp=0xc000041fb8 pc=0x100774f4
created by runtime.gcenable
	/workdir/go/src/runtime/mgc.go:218 +0x78

goroutine 18 [finalizer wait]:
runtime.gopark(0x10277918, 0x10438350, 0xc000011410, 0x1)
	/workdir/go/src/runtime/proc.go:336 +0xec fp=0xc000270f18 sp=0xc000270ee8 pc=0x1003cc9c
runtime.goparkunlock(...)
	/workdir/go/src/runtime/proc.go:342
runtime.runfinq()
	/workdir/go/src/runtime/mfinal.go:175 +0xc8 fp=0xc000270fc0 sp=0xc000270f18 pc=0x1001c158
runtime.goexit()
	/workdir/go/src/runtime/asm_ppc64x.s:869 +0x4 fp=0xc000270fc0 sp=0xc000270fc0 pc=0x100774f4
created by runtime.createfing
	/workdir/go/src/runtime/mfinal.go:156 +0x90

goroutine 161 [GC worker (idle), 2 minutes]:
runtime.gopark(0x10277670, 0xc00001c080, 0x10071418, 0x0)
	/workdir/go/src/runtime/proc.go:336 +0xec fp=0xc000034718 sp=0xc0000346e8 pc=0x1003cc9c
runtime.gcBgMarkWorker()
	/workdir/go/src/runtime/mgc.go:1911 +0xfc fp=0xc0000347c0 sp=0xc000034718 pc=0x1001ff1c
runtime.goexit()
	/workdir/go/src/runtime/asm_ppc64x.s:869 +0x4 fp=0xc0000347c0 sp=0xc0000347c0 pc=0x100774f4
created by runtime.gcBgMarkStartWorkers
	/workdir/go/src/runtime/mgc.go:1835 +0x38

…
goroutine 92151 [runnable]:
syscall.Syscall(0x3, 0xa, 0xc0012c6071, *** Test killed: ran too long (4m0s).
@bcmills bcmills added this to the Backlog milestone Dec 14, 2020
@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Dec 14, 2020

Unfortunately there is a known possibility for deadlock in the stack-dumping code. (See #42062 (comment).)

The static lock checker has known about this for a while. @prattmic has a fix which is basically to say "if we're going down anyway, it's probably OK to just race; better than a deadlock, anyway." (golang.org/cl/270861).

I think this is a duplicate of #42669 as a result.

Loading

@bcmills
Copy link
Member Author

@bcmills bcmills commented Dec 14, 2020

Ah, so now we have a new data point for this part of the comment!

I don't think in practice (2) even hurts diagnostics for us because in the worst case it causes a deadlock, and eventually we'll try to dump everything much more forcefully without acquiring locks (e.g. someone sends the process a SIGABRT) since we'll freeze the world (or try and dump it all anyway... maybe this is only true with GOTRACEBACK=crash).

I think this is a case where the deadlock does hurt diagnostics for us: the test process itself is not in a position to dump everything more forcefully, and the go command sends SIGKILL (via cmd.Process.Kill), not SIGABRT or any other signal that the runtime can handle. (It assumes that if the test process has deadlocked hard enough that its self-signal doesn't work, another catchable signal won't help.)

Loading

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 5, 2021

Change https://golang.org/cl/270861 mentions this issue: runtime: don't take allglock in tracebackothers

Loading

@prattmic
Copy link
Member

@prattmic prattmic commented Jan 5, 2021

Since we think this is caused by the allglock deadlock fixed in https://golang.org/cl/270861, I'm going to close this for now. We can reopen if we find this still occurring.

Loading

@prattmic prattmic closed this Jan 5, 2021
gopherbot pushed a commit that referenced this issue Jan 5, 2021
tracebackothers is called from fatal throw/panic.

A fatal throw may be taken with allglock held (notably in the allocator
when allglock is held), which would cause a deadlock in tracebackothers
when we try to take allglock again. Locking allglock here is also often
a lock order violation w.r.t. the locks held when throw was called.

Avoid the deadlock and ordering issues by skipping locking altogether.
It is OK to miss concurrently created Gs (which are generally avoided by
freezetheworld(), and which were possible previously anyways if created
after the loop).

Fatal throw/panic freezetheworld(), which should freeze other threads
that may be racing to modify allgs. However, freezetheworld() does _not_
guarantee that it stops all other threads, so we can't simply drop the
lock.

Fixes #42669
Updates #43175

Change-Id: I657aec46ed35fd5d1b3f1ba25b500128ab26b088
Reviewed-on: https://go-review.googlesource.com/c/go/+/270861
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Trust: Michael Pratt <mpratt@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
@mknyszek @prattmic @bcmills @gopherbot and others