Skip to content

runtime: unresponsive tests with a running on other thread goroutine created by runtime.gcenable #64062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gopherbot opened this issue Nov 10, 2023 · 13 comments
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@gopherbot
Copy link
Contributor

gopherbot commented Nov 10, 2023

#!watchflakes
post <- log ~ `Test killed with quit` && log ~ `goroutine \d+ \[running\]:\n\s+goroutine running on other thread; stack unavailable\n\s*created by runtime\.gcenable`

Issue created automatically to collect these failures.

Example (log):

SIGQUIT: quit
PC=0x873fc m=0 sigcode=0

goroutine 2 [running]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_ppc64x.s:212 +0x10 fp=0xc00003cd70 sp=0xc00003cd50 pc=0x83180
runtime.(*mheap).freeSpan(...)
	/workdir/go/src/runtime/mheap.go:1551
runtime.(*sweepLocked).sweep(0x5a02c0?, 0x0)
	/workdir/go/src/runtime/mgcsweep.go:757 +0x73c fp=0xc00003ce78 sp=0xc00003cd70 pc=0x396fc
...
r20  0x54	r21  0x584680
r22  0x7ffff3f4ab70	r23  0xc00003cdf0
r24  0x561a40	r25  0x16
r26  0x0	r27  0x0
r28  0x0	r29  0xc000030258
r30  0x583a40	r31  0x1bee0
pc   0x873fc	ctr  0x0
link 0x48b50	xer  0x0
ccr  0x34448084	trap 0xc00
*** Test killed with quit: ran too long (22m0s).

watchflakes

@gopherbot gopherbot added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Nov 10, 2023
@gopherbot gopherbot added the Tools This label describes issues relating to any tools in the x/tools repository. label Nov 10, 2023
@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
default <- pkg == "golang.org/x/tools/internal/gcimporter" && test == ""
2023-11-09 21:01 linux-ppc64le-power10osu tools@e3f67986 go@130baf3d x/tools/internal/gcimporter (log)
SIGQUIT: quit
PC=0x873fc m=0 sigcode=0

goroutine 2 [running]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_ppc64x.s:212 +0x10 fp=0xc00003cd70 sp=0xc00003cd50 pc=0x83180
runtime.(*mheap).freeSpan(...)
	/workdir/go/src/runtime/mheap.go:1551
runtime.(*sweepLocked).sweep(0x5a02c0?, 0x0)
	/workdir/go/src/runtime/mgcsweep.go:757 +0x73c fp=0xc00003ce78 sp=0xc00003cd70 pc=0x396fc
...
r20  0x54	r21  0x584680
r22  0x7ffff3f4ab70	r23  0xc00003cdf0
r24  0x561a40	r25  0x16
r26  0x0	r27  0x0
r28  0x0	r29  0xc000030258
r30  0x583a40	r31  0x1bee0
pc   0x873fc	ctr  0x0
link 0x48b50	xer  0x0
ccr  0x34448084	trap 0xc00
*** Test killed with quit: ran too long (22m0s).
2023-11-10 13:59 linux-ppc64le-power10osu tools@92a8009c go@abf84221 x/tools/internal/gcimporter (log)
SIGQUIT: quit
PC=0x875bc m=0 sigcode=0

goroutine 2 [running]:
	goroutine running on other thread; stack unavailable
created by runtime.init.5 in goroutine 1
	/workdir/go/src/runtime/proc.go:314 +0x2c

goroutine 3 [running]:
	goroutine running on other thread; stack unavailable
...
r20  0xaa	r21  0x5846e0
r22  0x7fffe31f23a0	r23  0xc000d42b10
r24  0xc000771930	r25  0xc000771908
r26  0xc003100c39	r27  0xc003100c35
r28  0x0	r29  0xc000032e98
r30  0x583aa0	r31  0x1c2b0
pc   0x875bc	ctr  0x0
link 0x48bd0	xer  0x0
ccr  0x34448024	trap 0xc00
*** Test killed with quit: ran too long (22m0s).

watchflakes

@gopherbot gopherbot added this to the Unreleased milestone Nov 10, 2023
@bcmills bcmills added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 10, 2023
@bcmills bcmills changed the title x/tools/internal/gcimporter: unrecognized failures runtime: unresponsive tests in systemstack_switch Nov 10, 2023
@bcmills bcmills removed the Tools This label describes issues relating to any tools in the x/tools repository. label Nov 10, 2023
@bcmills
Copy link
Contributor

bcmills commented Nov 10, 2023

@golang/runtime: this looks to me like a runtime deadlock or livelock. Notably, the test binary's internal timer failed to fire.

@gopherbot

This comment was marked as off-topic.

@bcmills
Copy link
Contributor

bcmills commented Nov 10, 2023

Huh. Yeah, my attempt at a watchflakes pattern is clearly not right — it's catching #56418 but not all the failures for this issue. 🙃

@bcmills bcmills changed the title runtime: unresponsive tests in systemstack_switch runtime: unresponsive tests with a running on other thread goroutine created by runtime.gcenable Nov 10, 2023
@mknyszek
Copy link
Contributor

I believe this is the same issue as #64050. Working on it now.

@mknyszek
Copy link
Contributor

Wait, actually I see the issue here very clearly in that stack trace. ... Huh. This seems unrelated to basically anything else that has landed recently, but I'm positive it's the problem.

@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/541635 mentions this issue: runtime: call enableMetadataHugePages and its callees on the systemstack

@mknyszek
Copy link
Contributor

Closing as a duplicate of #64067.

@mknyszek mknyszek closed this as not planned Won't fix, can't repro, duplicate, stale Nov 10, 2023
@pmur
Copy link
Contributor

pmur commented Nov 10, 2023

@mknyszek thanks for looking it this. Reverting CL 533455 seemed to avoid the issue on ppc64, though as you note, its likely not the cause.

I had to reboot a couple of the PPC64 VMs to unclog them. @bcmills do you know if make.bash is subject to a timeout during CI? Some of the jobs seemed to be stuck there for many hours.

@bcmills
Copy link
Contributor

bcmills commented Nov 10, 2023

@pmur, I honestly don't know. 😅

You could check the code in x/build/cmd/buildlet, I think? But then I especially don't know whether there is a timeout in the LUCI version of the builders either.

@gopherbot
Copy link
Contributor Author

Found new dashboard test flakes for:

#!watchflakes
post <- log ~ `Test killed with quit` && log ~ `goroutine \d+ \[running\]:\n\s+goroutine running on other thread; stack unavailable\n\s*created by runtime\.gcenable`
2023-11-10 18:46 linux-ppc64le-power10osu tools@3b6876f0 go@31887586 x/tools/internal/gcimporter (log)
SIGQUIT: quit
PC=0x875bc m=0 sigcode=0

goroutine 2 [running]:
	goroutine running on other thread; stack unavailable
created by runtime.init.5 in goroutine 1
	/workdir/go/src/runtime/proc.go:314 +0x2c

goroutine 3 [running]:
	goroutine running on other thread; stack unavailable
...
r20  0x150	r21  0x5846e0
r22  0x7fffd8a2bab0	r23  0x7fffd8a3b388
r24  0xc0152a40b0	r25  0x31a460
r26  0xc028463d30	r27  0xc028463d2c
r28  0x0	r29  0xc00002fc78
r30  0x583aa0	r31  0x1c2b0
pc   0x875bc	ctr  0x0
link 0x48bd0	xer  0x0
ccr  0x34428008	trap 0xc00
*** Test killed with quit: ran too long (22m0s).

watchflakes

@mknyszek
Copy link
Contributor

@bcmills @pmur I think it might indeed be the case that there's no timeout on make.bash, which isn't great.

In the LUCI world there's always single timeout for the overall build. I think it's quite high by default at like 2 hours or something, but it is there. We can reconfigure that fairly easily. For LUCI builders that shard out make.bash we can also set a tighter timeout for just that portion and for each test shard. But I believe the overall build will just timeout and fail and the shards will get cancelled, since I think we configured the build shards to never outlive their parent. (If not, that's also easy to enable.)

gopherbot pushed a commit that referenced this issue Nov 13, 2023
These functions acquire the heap lock. If they're not called on the
systemstack, a stack growth could cause a self-deadlock since stack
growth may allocate memory from the page heap.

This has been a problem for a while. If this is what's plaguing the
ppc64 port right now, it's very surprising (and probably just
coincidental) that it's showing up now.

For #64050.
For #64062.
Fixes #64067.

Change-Id: I2b95dc134d17be63b9fe8f7a3370fe5b5438682f
Reviewed-on: https://go-review.googlesource.com/c/go/+/541635
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
@gopherbot
Copy link
Contributor Author

Change https://go.dev/cl/541955 mentions this issue: [release-branch.go1.21] runtime: call enableMetadataHugePages and its callees on the systemstack

gopherbot pushed a commit that referenced this issue Nov 28, 2023
… callees on the systemstack

These functions acquire the heap lock. If they're not called on the
systemstack, a stack growth could cause a self-deadlock since stack
growth may allocate memory from the page heap.

This has been a problem for a while. If this is what's plaguing the
ppc64 port right now, it's very surprising (and probably just
coincidental) that it's showing up now.

For #64050.
For #64062.
For #64067.
Fixes #64073.

Change-Id: I2b95dc134d17be63b9fe8f7a3370fe5b5438682f
Reviewed-on: https://go-review.googlesource.com/c/go/+/541635
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
(cherry picked from commit 5f08b44)
Reviewed-on: https://go-review.googlesource.com/c/go/+/541955
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
@golang golang locked and limited conversation to collaborators Nov 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Archived in project
Development

No branches or pull requests

4 participants