Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "fatal: morestack on g0" with PGO enabled on arm64 #62120

Closed
dmac opened this issue Aug 17, 2023 · 16 comments
Closed

runtime: "fatal: morestack on g0" with PGO enabled on arm64 #62120

dmac opened this issue Aug 17, 2023 · 16 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@dmac
Copy link

dmac commented Aug 17, 2023

What version of Go are you using (go version)?

$ go version
go version go1.21.0 linux/amd64

Does this issue reproduce with the latest release?

go1.21.0 is the latest release at the time of this writing.

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=''
GOARCH='amd64'
GOBIN='/home/dmac/go/bin'
GOCACHE='/home/dmac/.cache/go-build'
GOENV='/home/dmac/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS='-tags=noasm'
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/dmac/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/dmac/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/local/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.0'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/dmac/work/src/REDACTED/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1546012642=/tmp/go-build -gno-record-gcc-switches'

What did you do?

I took a pprof profile from a web server running on a linux/arm64 instance, then used that profile with -pgo to cross-compile a new arm64 binary from my amd64 development computer. Running that binary on the arm64 instance crashes due to a segfault within a few seconds of starting, after beginning to run its normal code paths.

This issue reproduces easily with this program, but it does not reproduce with a simple "hello world" web server. Unfortunately, this program is proprietary, but it is around 46,000 lines of code and the size of the compiled binary is about 50 MB.

I can provide a more detailed, redacted crash log if that would be helpful. I may also be able to supply a binary through a private channel if that turns out to be necessary. Please let me know if there's any more information I can provide that would be helpful.

What did you expect to see?

I expect the program to run without crashing.

What did you see instead?

The program crashes with a segfault. The head and tail of the crash log is:

fatal: morestack on g0
SIGSEGV: segmentation violation
PC=0x8e824 m=12 sigcode=1

goroutine 0 [idle]:
runtime.abort()
	runtime/asm_arm64.s:1187 +0x4 fp=0x4000692370 sp=0x4000692370 pc=0x8e824
runtime.morestack()
	runtime/asm_arm64.s:283 +0x14 fp=0x4000692370 sp=0x4000692370 pc=0x8c484

[...]

r0      0x0
r1      0x14e647b
r2      0x17
r3      0x7dbbc
r4      0x40006821a0
r5      0x1
r6      0x20bc4b0
r7      0x16019
r8      0x40
r9      0x0
r10     0x2
r11     0x2483d40
r12     0x0
r13     0x0
r14     0x1
r15     0x1
r16     0x40006923a0
r17     0x4000692330
r18     0x0
r19     0x4000095400
r20     0x40006925e8
r21     0x40006928e8
r22     0x4
r23     0x1
r24     0x1d
r25     0x12c
r26     0x0
r27     0x253d000
r28     0x40006821a0
r29     0x4000692368
lr      0x8c484
sp      0x4000692370
pc      0x8e824
fault   0x0
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 17, 2023
@dmitshur dmitshur added this to the Backlog milestone Aug 17, 2023
@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 17, 2023
@dmitshur
Copy link
Contributor

CC @cherrymui, @golang/runtime.

@cherrymui
Copy link
Member

The actual error is fatal: morestack on g0. The segfault is that the runtime sees this bad condition and intentionally crashes the program.

If you could reproduce the failure in a debugger, that might be helpful. We might be able to find out why we call morestack there in a debugger.

If you build the program with -gcflags=all=-d=pgodebug=1, it will print the optimizations due to PGO. Then you could try manually disable the optimization one by one, e.g. add //go:noinline to disable inlining, or add some comments in the function so the line number (relative to function start) won't match the profile. We probably could also provide a patch for binary search. Once you identify the function being miscompiled, we probably could go from there.

Thanks.

@cespare cespare changed the title runtime: SIGSEGV: segmentation violation with PGO enabled on arm64 runtime: "fatal: morestack on g0" with PGO enabled on arm64 Aug 18, 2023
@dmac
Copy link
Author

dmac commented Aug 22, 2023

I'm able to reproduce the crash in a debugger when compiling this small repro case with the .pgo file from the original build. When I expose this program to our production HTTP traffic it crashes within a minute or two.

This is probably not an absolutely minimal case, but small changes to the above program don't reproduce the error, including:

  • Removing the unused _ "net/http/pprof" import.
  • Modifying the ReadTimeout field.
  • Modifying the ConnState callback.

Since there is no longer any private information in the crash, I can share the full output here.

I also recorded the build output with -gcflags=all=-d=pgodebug=1 and -gcflags=all=-d=pgodebug=2.

I'm not quite sure how to determine which function is being miscompiled. I put a breakpoint on runtime.abort and let the program crash in a debugger (I'm using Delve). backtrace reports a deep sequence of calls to runtime.morestack.

@aclements
Copy link
Member

@cherrymui , want to roll your "morestack on g0" emergency traceback CL for arm64?

This could be a use for PGO bisection?

@cherrymui
Copy link
Member

Thanks for the small reproducer and the information! The recursive morestack is probably due to that the debugger doesn't understand the special calling convention of morestack, as it is called in a very special way. To get the actual stack trace we'd need to dump the registers and some memory and do manual unwinding.

If you could share the profile and/or a core file, I probably could try to reproduce the crash and do the manual unwinding.

@aclements sure I'll try to work out something for the "morestack on g0" traceback. My old code has a merge conflict with the new unwinder. I'll need to rebase, and port to ARM64.

PGO bisection would probably also help. I'll hack up something for that, too.

@dmac
Copy link
Author

dmac commented Aug 23, 2023

No problem, here is the executable and the core file, taken from the breakpoint on runtime.abort.

I may be able to privately share the profile as well; I'm discussing it with the security team at my company.

Thanks for taking a look at this, and please let me know if there's more information I can provide (other than the profile).

@cherrymui
Copy link
Member

@dmac thanks for sharing. Unfortunately, it seems the core file you shared does not include register information (or at least the GDB I used cannot read the registers from it). So it would be hard to investigate. Could you share a new one that includes the registers?

Also, CL https://go.dev/cl/419435 attempts to print a stack trace when a "morestack on g0" error occurs. Could you patch that CL (apply to the Go runtime, and rebuild your binary) and see if it prints anything helpful?

Thanks.

@dmac
Copy link
Author

dmac commented Aug 29, 2023

Unfortunately, it seems the core file you shared does not include register information (or at least the GDB I used cannot read the registers from it). [...] Could you share a new one that includes the registers?

I generated that core file with Delve, and while I can see registers when inspecting it with Delve, I also can't see registers when inspecting it with GDB.

I performed the repro with the minimal executable using GDB this time and generated a new core file. Does that work better?

CL https://go.dev/cl/419435 attempts to print a stack trace when a "morestack on g0" error occurs. Could you patch that CL (apply to the Go runtime, and rebuild your binary) and see if it prints anything helpful?

I was able to repro the crash using that patch on our full program; the partial output can be viewed here. However, I haven't yet been able to reproduce the issue using the CL patch on my minimal program yet. I might need to find a new minimal repro when building with the patch. (I'm not sure if the new stack trace is useful without the full binary/core.)

@cherrymui
Copy link
Member

Thanks!

This looks like an actual stack overflow. The g0 stack size is 8 KB, which matches the size we allocated https://cs.opensource.google/go/go/+/master:src/runtime/proc.go;l=1941 (I assume this is a non-cgo program). 8 KB g0 stack looks rather small to me. Maybe due to PGO the stack frames are larger and just pushes it over the limit... See also #62489.

Could you try if just increasing the g0 stack size to 16 KB would fix the issue? That is, apply the patch in #62489 (comment) and rebuild the program with the same profile.

Thanks.

@dmac
Copy link
Author

dmac commented Sep 7, 2023

I assume this is a non-cgo program

Correct, not cgo. And yes, that patch seems to fix the issue in both the minimal program and the original program.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/526995 mentions this issue: runtime: increase g0 stack size in non-cgo case

@cherrymui
Copy link
Member

@dmac thanks for confirming!

@cherrymui
Copy link
Member

@gopherbot please backport this to Go 1.21. This can cause programs built with PGO fail to run. There is no workaround except disabling PGO.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #62537 (for 1.21).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/527055 mentions this issue: [release-branch.go1.21] runtime: increase g0 stack size in non-cgo case

@dmac
Copy link
Author

dmac commented Sep 8, 2023

Thank you!

@dmitshur dmitshur modified the milestones: Backlog, Go1.22 Sep 20, 2023
@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. labels Sep 20, 2023
gopherbot pushed a commit that referenced this issue Sep 22, 2023
Currently, for non-cgo programs, the g0 stack size is 8 KiB on
most platforms. With PGO which could cause aggressive inlining in
the runtime, the runtime stack frames are larger and could
overflow the 8 KiB g0 stack. Increase it to 16 KiB. This is only
one per OS thread, so it shouldn't increase memory use much.

Updates #62120.
Updates #62489.
Fixes #62537.

Change-Id: I565b154517021f1fd849424dafc3f0f26a755cac
Reviewed-on: https://go-review.googlesource.com/c/go/+/526995
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
(cherry picked from commit c6d550a)
Reviewed-on: https://go-review.googlesource.com/c/go/+/527055
bradfitz pushed a commit to tailscale/go that referenced this issue Sep 25, 2023
Currently, for non-cgo programs, the g0 stack size is 8 KiB on
most platforms. With PGO which could cause aggressive inlining in
the runtime, the runtime stack frames are larger and could
overflow the 8 KiB g0 stack. Increase it to 16 KiB. This is only
one per OS thread, so it shouldn't increase memory use much.

Updates golang#62120.
Updates golang#62489.
Fixes golang#62537.

Change-Id: I565b154517021f1fd849424dafc3f0f26a755cac
Reviewed-on: https://go-review.googlesource.com/c/go/+/526995
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
(cherry picked from commit c6d550a)
Reviewed-on: https://go-review.googlesource.com/c/go/+/527055
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

6 participants
@dmac @dmitshur @aclements @gopherbot @cherrymui and others