Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: self-deadlock on mheap_.lock #64067

Closed
mknyszek opened this issue Nov 10, 2023 · 6 comments
Closed

runtime: self-deadlock on mheap_.lock #64067

mknyszek opened this issue Nov 10, 2023 · 6 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@mknyszek
Copy link
Contributor

mknyszek commented Nov 10, 2023

#!watchflakes
default <- (log ~ `test timed out` || log ~ `Test killed with quit`) && (goarch == "ppc64" || goarch == "ppc64le")

After landing a bunch of changes today (and yesterday) the ppc64 builders started timing out:

The root cause of the issue is a self-deadlock on mheap_.lock. An invariant of this lock is that it must be held only on the system stack, because a stack growth may cause the lock to be acquired again when allocating a new stack.

This invariant isn't followed in a very particular function that was added last release. The fix is simple: follow the invariant.

What's bizarre is why this started showing up now, a very long time later. Another question is why it started showing up in response to a set of changes I landed today, and how those changes perturbed the runtime into triggering this case more often. This is also a potential problem on all platforms, not just ppc64. I don't know what makes ppc64 more likely to trigger it.

One theory is that this call path is taken when the heap goal exceeds a certain threshold. The thing is, the heap goal is simply set arbitrarily high if GOGC=off (which is actually kind of a bug in the check, but that part is mostly harmless), so any test setting GOGC=off would be susceptible. I suspect it has something to do with this, but I am unsure.

I'm also unclear as to why this is happening in the compiler so often. Perhaps some packages we build create a heap big enough to exceed the check? Either that, or it could be related to the manipulation of GOGC that it does.

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Nov 10, 2023
@mknyszek mknyszek self-assigned this Nov 10, 2023
@mknyszek mknyszek added this to the Go1.22 milestone Nov 10, 2023
@mknyszek mknyszek added NeedsFix The path to resolution is known, but the work has not been done. release-blocker labels Nov 10, 2023
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/541635 mentions this issue: runtime: call enableMetadataHugePages and its callees on the systemstack

@bcmills
Copy link
Contributor

bcmills commented Nov 10, 2023

Huh! Is there any chance this might also explain #56418?

@mknyszek
Copy link
Contributor Author

Hm... I don't obviously see goroutines blocked on the same lock in #56418, so I can't be sure. But two goroutines are suspiciously reported as "running on another thread" and they could in theory be blocked on this lock, so I can't rule it out though. I suppose we'll find out. :)

As an aside, I think we should have a checker mode for this invariant. In the vast majority of cases (like this one) it should be pretty obvious statically that we're acquiring this lock not on the system stack. The runtime does not make very many indirect calls.

@mknyszek
Copy link
Contributor Author

@gopherbot Please open a backport issue for Go 1.21.

This issue can cause deadlocks in running programs. It's rare but there are no workarounds. It also only applies to code that landed in Go 1.21.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #64073 (for 1.21).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/541955 mentions this issue: [release-branch.go1.21] runtime: call enableMetadataHugePages and its callees on the systemstack

gopherbot pushed a commit that referenced this issue Nov 28, 2023
… callees on the systemstack

These functions acquire the heap lock. If they're not called on the
systemstack, a stack growth could cause a self-deadlock since stack
growth may allocate memory from the page heap.

This has been a problem for a while. If this is what's plaguing the
ppc64 port right now, it's very surprising (and probably just
coincidental) that it's showing up now.

For #64050.
For #64062.
For #64067.
Fixes #64073.

Change-Id: I2b95dc134d17be63b9fe8f7a3370fe5b5438682f
Reviewed-on: https://go-review.googlesource.com/c/go/+/541635
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Run-TryBot: Michael Knyszek <mknyszek@google.com>
Auto-Submit: Michael Knyszek <mknyszek@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Paul Murphy <murp@ibm.com>
(cherry picked from commit 5f08b44)
Reviewed-on: https://go-review.googlesource.com/c/go/+/541955
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Dmitri Shuralyov <dmitshur@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
None yet
Development

No branches or pull requests

3 participants