Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: traceback stuck in runtime.systemstack #55851

Closed
lizthegrey opened this issue Sep 24, 2022 · 14 comments
Closed

runtime: traceback stuck in runtime.systemstack #55851

lizthegrey opened this issue Sep 24, 2022 · 14 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@lizthegrey
Copy link

lizthegrey commented Sep 24, 2022

Similar to #54332

What version of Go are you using (go version)?

$ go version
go version go1.19.1 linux/arm64

Does this issue reproduce with the latest release?

Yes, with go1.19.1

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/home/lizf/.cache/go-build"
GOENV="/home/lizf/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/home/lizf/hny/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/home/lizf/hny/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/snap/go/current"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/snap/go/current/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="go1.19.1"
GCCGO="gccgo"
GOAMD64="v1"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build3236722962=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Continuously profile for over one month.

0x7aec0
runtime.systemstack
/usr/local/go/src/runtime/asm_arm64.s:206
0x110000
encoding/binary.intDataSize
/usr/local/go/src/encoding/binary/binary.go:768

https://github.com/golang/go/blob/go1.19.1/src/runtime/asm_arm64.s#L206

What did you expect to see?

No traceback stuck assert

What did you see instead?

Traceback stuck assert

runtime: traceback stuck. pc=0x110000 sp=0x40d9082bc0
stack: frame={sp:0x40d9082bc0, fp:0x40d9082bc0} stack=[0x40d9082000,0x40d9084000)
0x00000040d9082ac0:  0x000000000151cd68 <github.com/honeycombio/hound/cmd/shepherd/app.beelinePanicMiddleware.func1+0x0000000000000078>  0x00000000002ab728 <net/http.HandlerFunc.ServeHTTP+0x0000000000000038> 
0x00000040d9082ad0:  0x000000000151bcc0 <github.com/honeycombio/hound/cmd/shepherd/app.instrDatasetSlug.func1+0x00000000000000e0>  0x00000000002ab728 <net/http.HandlerFunc.ServeHTTP+0x0000000000000038> 
0x00000040d9082ae0:  0x000000000151ca9c <github.com/honeycombio/hound/cmd/shepherd/app.(*App).instrMiddleware.func1+0x000000000000040c>  0x00000000002ab728 <net/http.HandlerFunc.ServeHTTP+0x0000000000000038> 
0x00000040d9082af0:  0x00000000007aebf8 <github.com/honeycombio/beeline-go/wrappers/hnygorilla.Middleware.func1+0x00000000000004b8>  0x00000000002ab728 <net/http.HandlerFunc.ServeHTTP+0x0000000000000038> 
0x00000040d9082b00:  0x00000000006a4b94 <github.com/gorilla/mux.(*Router).ServeHTTP+0x00000000000001a4>  0x000000000151bb10 <github.com/honeycombio/hound/cmd/shepherd/app.trailingSlashMiddleware.func1+0x00000000000000d0> 
0x00000040d9082b10:  0x00000000002ab728 <net/http.HandlerFunc.ServeHTTP+0x0000000000000038>  0x000000000151d308 <github.com/honeycombio/hound/cmd/shepherd/app.(*App).ServeHTTP+0x00000000000002a8> 
0x00000040d9082b20:  0x00000000006a2f2c <github.com/NYTimes/gziphandler.GzipHandlerWithOpts.func1.1+0x000000000000022c>  0x00000000002ab728 <net/http.HandlerFunc.ServeHTTP+0x0000000000000038> 
0x00000040d9082b30:  0x0000000001533028 <github.com/honeycombio/hound/cmd/shepherd/app.renameBeelineTraceHeaders.func1+0x0000000000000188>  0x00000040d9082b48 
0x00000040d9082b40:  0x00000000000399c4 <runtime.(*mheap).alloc+0x0000000000000054>  0x00000040d9082b98 
0x00000040d9082b50:  0x0000000000026024 <runtime.(*mcache).allocLarge+0x0000000000000074>  0x00000040d9082b70 
0x00000040d9082b60:  0x0000000000000000  0x0000000000000000 
0x00000040d9082b70:  0x0000000000039a00 <runtime.(*mheap).alloc.func1+0x0000000000000000>  0x0000000002f241c0 
0x00000040d9082b80:  0x00000000000000ab  0x0000000000000001 
0x00000040d9082b90:  0x00000040d9082b68  0x00000040d9082be8 
0x00000040d9082ba0:  0x000000000001c650 <runtime.mallocgc+0x0000000000000530>  0x0000000000156000 <crypto/internal/nistec.(*P256Point).Add+0x0000000000000150> 
0x00000040d9082bb0:  0x00000000000000ab  0x000000417b40a000 
0x00000040d9082bc0: >0x0000000000110000 <encoding/binary.intDataSize+0x00000000000005f0>  0x01010040d9082be8 
0x00000040d9082bd0:  0x000000000001c940 <runtime.mallocgc+0x0000000000000820>  0x00000000000000ab 
0x00000040d9082be0:  0x0000000000156000 <crypto/internal/nistec.(*P256Point).Add+0x0000000000000150>  0x00000040d9082c58 
0x00000040d9082bf0:  0x00000000000620e4 <runtime.growslice+0x0000000000000404>  0x00000040cc15f6c0 
0x00000040d9082c00:  0x0000000000156000 <crypto/internal/nistec.(*P256Point).Add+0x0000000000000150>  0x0000004000184001 
0x00000040d9082c10:  0x000000000001cafc <runtime.newobject+0x000000000000002c>  0x7000000101012c88 
0x00000040d9082c20:  0x0000000000061b6c <runtime.makeslice+0x000000000000007c>  0x0000ffffa676af18 
0x00000040d9082c30:  0x0000000000110000 <encoding/binary.intDataSize+0x00000000000005f0>  0x000000400058c400 
0x00000040d9082c40:  0x000000417b40a000  0x0000000000000000 
0x00000040d9082c50:  0x00000040cc15f6c0  0x00000040d9082cb8 
0x00000040d9082c60:  0x00000000000aacfc <io.ReadAll+0x000000000000009c>  0x0000000000156000 <crypto/internal/nistec.(*P256Point).Add+0x0000000000000150> 
0x00000040d9082c70:  0x0000000000000000  0x00000040cc15f600 
0x00000040d9082c80:  0x0000000000110001 <encoding/binary.intDataSize+0x00000000000005f1>  0x0000000000156000 <crypto/internal/nistec.(*P256Point).Add+0x0000000000000150> 
0x00000040d9082c90:  0x0000000000110000 <encoding/binary.intDataSize+0x00000000000005f0>  0x0000000000156000 <crypto/internal/nistec.(*P256Point).Add+0x0000000000000150> 
0x00000040d9082ca0:  0x0000000000110000 <encoding/binary.intDataSize+0x00000000000005f0>  0x000000417b40a000 
0x00000040d9082cb0:  0x000000417b40a000  0x00000040d9082d48 
fatal error: traceback stuck
runtime.sigprof(0x7aec0, 0x4000aa6c48?, 0x0?, 0x4000583a00, 0x400058c400)	
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Sep 24, 2022
@lizthegrey lizthegrey changed the title runtime: systemstack missing SPWRITE, rarely causes traceback stuck runtime: traceback stuck in runtime.systemstack Sep 24, 2022
@ianlancetaylor
Copy link
Contributor

CC @golang/runtime

@dmitshur dmitshur added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Sep 26, 2022
@dmitshur dmitshur added this to the Backlog milestone Sep 26, 2022
@lizthegrey
Copy link
Author

I wouldn't prioritise this super high right now, it appears to be very rare (since it's only happened once in the past month fleet-wide for us, whereas we were seeing multiple crashes a day with #54332)

@cherrymui cherrymui self-assigned this Sep 28, 2022
@cherrymui
Copy link
Member

Thanks for the information.

The signal PC is the entry PC of systemstack. There are two possibilities:

  1. We're on a user G, calling into systemstack to switch, but the actual switch hasn't happened (because we're still at the function entry). In this case, gp will be a user G (not g0) so the jumpstack logic https://cs.opensource.google/go/go/+/refs/tags/go1.19.1:src/runtime/traceback.go;l=182 won't kick in. In particular, we won't clear the SPWRITE flag and the traceback will simply stop.

  2. We're on g0 at a recursive systemstack (it is not the first one as we're already on g0). In this case the jumpstack logic will kick in, and we'll try to unwind on the user G stack which does the switch, There are two problems:

    1. There are frames between entering the g0 stack and this recursive systemstack call. Those frames will be lost.

    2. Worse, the frame size calculation is wrong because the jumpstack logic assumes we're at a state where a stack switch happened, in particular, the SP delta of the systemstack frame is non-zero (16 for ARM64). But we're still at the entry, so SP delta is 0. This makes https://cs.opensource.google/go/go/+/refs/tags/go1.19.1:src/runtime/traceback.go;l=191 compute a wrong frame.fp (which in turn will lead to a wrong frame.lr and everything will go off).

For a stop-gap solution I think we should not use the jumpstack logic if the SP delta is 0. I'll try sending a CL.

(As with mentioned in a few other issues) using a per-PC SPWRITE marker may be a better fix. If we haven't written the SP, we haven't switched the stack so we can just unwind like a normal function. I think this may also make recursive systemstack case work without losing frames -- recursive systemstack doesn't write SP but just do a normal call (or tail call), so if we don't see SPWRITE we can just unwind normally.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/437299 mentions this issue: runtime: don't jump stack if at entry of systemstack

@lizthegrey
Copy link
Author

lizthegrey commented Oct 28, 2022

Just saw another instance of this crash, go1.19.1. so, validates that it is infrequent but has now happened more than once. I can try backporting the patch into our runtime but it will take several months to know whether it worked ;)

$ go tool addr2line shepherd.full 
0x7af60
runtime.systemstack
/usr/local/go/src/runtime/asm_arm64.s:206

@lizthegrey
Copy link
Author

lizthegrey commented Nov 7, 2022

@cherrymui would you mind making a backport patchset for go1.19 series? thanks! nevermind, it's just the lines after that are subtly different, I think I got it

@cherrymui
Copy link
Member

Yeah, it probably make sense to backport, as it can cause runtime crashes, albeit rare. I'll make a CL. Thanks.

@cherrymui
Copy link
Member

@gopherbot please backport this to previous releases. This may cause runtime crashes. Thanks.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #56635 (for 1.18), #56636 (for 1.19).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/448516 mentions this issue: [release-branch.go1.19] runtime: don't jump stack if at entry of systemstack

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/448517 mentions this issue: [release-branch.go1.18] runtime: don't jump stack if at entry of systemstack

gopherbot pushed a commit that referenced this issue Nov 9, 2022
…emstack

The traceback code has special "jump stack" logic, to trace back
stack switches through systemstack. If we're at the entry of
systemstack, the stack switch hasn't happened, so don't jump to
user stack.

The jump stack logic is only used if we're on the g0 stack. It can
happen that we're at the entry of a recursive systemstack call on
the g0 stack. In we jump stack here, there will be two problems:
1. There are frames between entering the g0 stack and this
   recursive systemstack call. Those frames will be lost.
2. Worse, we switched frame.sp but frame.fp calculation will use
   the entry SP delta (0), which will be wrong, which in turn
   leads wrong frame.lr and things will go off.

For now, don't jump stack if we're at entry of systemstack (SP
delta is 0).

Using a per-PC SPWRITE marker may be a better fix. If we haven't
written the SP, we haven't switched the stack so we can just
unwind like a normal function.

Updates #55851.
Fixes #56636.

Change-Id: I2b624c8c086b235b34d9c7d3cebd4a37264f00f8
Reviewed-on: https://go-review.googlesource.com/c/go/+/437299
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit 500bc6b)
Reviewed-on: https://go-review.googlesource.com/c/go/+/448516
gopherbot pushed a commit that referenced this issue Nov 9, 2022
…emstack

The traceback code has special "jump stack" logic, to trace back
stack switches through systemstack. If we're at the entry of
systemstack, the stack switch hasn't happened, so don't jump to
user stack.

The jump stack logic is only used if we're on the g0 stack. It can
happen that we're at the entry of a recursive systemstack call on
the g0 stack. In we jump stack here, there will be two problems:
1. There are frames between entering the g0 stack and this
   recursive systemstack call. Those frames will be lost.
2. Worse, we switched frame.sp but frame.fp calculation will use
   the entry SP delta (0), which will be wrong, which in turn
   leads wrong frame.lr and things will go off.

For now, don't jump stack if we're at entry of systemstack (SP
delta is 0).

Using a per-PC SPWRITE marker may be a better fix. If we haven't
written the SP, we haven't switched the stack so we can just
unwind like a normal function.

Updates #55851.
Fixes #56635.

Change-Id: I2b624c8c086b235b34d9c7d3cebd4a37264f00f8
Reviewed-on: https://go-review.googlesource.com/c/go/+/437299
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit 500bc6b)
Reviewed-on: https://go-review.googlesource.com/c/go/+/448517
andrew-d pushed a commit to tailscale/go that referenced this issue Dec 7, 2022
…emstack

The traceback code has special "jump stack" logic, to trace back
stack switches through systemstack. If we're at the entry of
systemstack, the stack switch hasn't happened, so don't jump to
user stack.

The jump stack logic is only used if we're on the g0 stack. It can
happen that we're at the entry of a recursive systemstack call on
the g0 stack. In we jump stack here, there will be two problems:
1. There are frames between entering the g0 stack and this
   recursive systemstack call. Those frames will be lost.
2. Worse, we switched frame.sp but frame.fp calculation will use
   the entry SP delta (0), which will be wrong, which in turn
   leads wrong frame.lr and things will go off.

For now, don't jump stack if we're at entry of systemstack (SP
delta is 0).

Using a per-PC SPWRITE marker may be a better fix. If we haven't
written the SP, we haven't switched the stack so we can just
unwind like a normal function.

Updates golang#55851.
Fixes golang#56636.

Change-Id: I2b624c8c086b235b34d9c7d3cebd4a37264f00f8
Reviewed-on: https://go-review.googlesource.com/c/go/+/437299
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
(cherry picked from commit 500bc6b)
Reviewed-on: https://go-review.googlesource.com/c/go/+/448516
@lizthegrey
Copy link
Author

lizthegrey commented Jan 4, 2023

Sorry for the delay, we've just deployed go1.19.4 and will report back whether we continue to see crashing.

We previously saw 7 crashes in 30 days, so if we go two weeks with no crashes that'll be pretty indicative.

@aciduck
Copy link

aciduck commented Feb 7, 2023

This also happens to us, even after upgrading to 1.19.5. It happens in various environments and components, around 30 times per week, with different stack frames each time. Everything is running on Ubuntu on arm64.

Here are some samples:

runtime: traceback stuck. pc=0x42f850 sp=0x40060eff50
stack: frame={sp:0x40060eff50, fp:0x40060eff50} stack=[0x40060e8000,0x40060f0000)
0x00000040060efe50: 0x0000000000000010 0x00000040011bdc00
0x00000040060efe60: 0x00000040061c0cb0 0x0000000000000000
0x00000040060efe70: 0x000000000042f624 <google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x0000000000000074> 0x00000040060efee8
0x00000040060efe80: 0x00000040060efeb0 0x0000004001342a58
0x00000040060efe90: 0x0000000000000000 0x00000040039eee01
0x00000040060efea0: 0x000000400474c001 0x0000000000000000
0x00000040060efeb0: 0x0001000000010000 0x0000004003e0c420
0x00000040060efec0: 0x00000040041a6a80 0x0000000003316f07
0x00000040060efed0: 0x0000004003e0c420 0x0000000000000000
0x00000040060efee0: 0x000000000042f610 <google.golang.org/grpc/internal/transport.(*loopyWriter).run+0x0000000000000060> 0x00000040039f9140
0x00000040060efef0: 0x0000000000000000 0x00000040039f9260
0x00000040060eff00: 0x0000000000000000 0x0000000000000000
0x00000040060eff10: 0x0000000000444fd8 <google.golang.org/grpc/internal/transport.NewServerTransport.func2+0x00000000000000d8> 0x0000004003e0c410
0x00000040060eff20: 0x0000000001ab4001 0x0000004005d494a0
0x00000040060eff30: 0x00000000004b4b84 <google.golang.org/grpc.(*Server).Serve.func3+0x0000000000000064> 0x0100004002e40f68
0x00000040060eff40: 0x0000000000000000 0x0000000000000000
0x00000040060eff50: >0x000000000042f850 <google.golang.org/grpc/internal/transport.(*loopyWriter).run.func1+0x0000000000000000> 0x00000040060eff40
0x00000040060eff60: 0x00000040060eff50 0x0000000000000000
0x00000040060eff70: 0x0000000000081a64 <runtime.goexit+0x0000000000000004> 0x00000040048925a0
0x00000040060eff80: 0x00000040036c0e00 0x0000004003e0c410
0x00000040060eff90: 0x0000004003e0c460 0x0000000000000009
0x00000040060effa0: 0x0000000002cf8898 0x0000004003dde338
0x00000040060effb0: 0x0000000002cf8898 0x000000400362da00
0x00000040060effc0: 0x0000004003dde338 0x0000000000000000
0x00000040060effd0: 0x0000000000000000 0x000000000441b0e8
0x00000040060effe0: 0x0000000000000000 0x0000000000000000
0x00000040060efff0: 0x000000000441b0e8 0x0000000000000000
fatal error: traceback stuck
runtime: traceback stuck. pc=0x1424420 sp=0x4004695eb0
stack: frame={sp:0x4004695eb0, fp:0x4004695eb0} stack=[0x4004694000,0x4004696000)
0x0000004004695db0: 0x000000400347fba0 0x0000004003229ec0
0x0000004004695dc0: 0x00000040036ffe60 0x0000004005e028a0
0x0000004004695dd0: 0x000000400347fc28 0x000000400347fce8
0x0000004004695de0: 0x0000000000000000 0x0000000000000004
0x0000004004695df0: 0x0000000000000000 0x0000004002a83e18
0x0000004004695e00: 0x000000000142425c <github.com/wiz-sec/wiz/commonlib/memmonitor.(*memoryMonitor).run.func1+0x000000000000007c> 0x0000004003c088a0
0x0000004004695e10: 0x0000000000000000 0x0000004002c04f48
0x0000004004695e20: 0x0000000000ef4f84 <github.com/wiz-sec/wiz/commonlib/waitgroup.(*SafeWaitGroup).Go.func1+0x0000000000000024> 0x0000004004695ec0
0x0000004004695e30: 0x0000004003229e80 0x00000040036fd0e0
0x0000004004695e40: 0x0000000000000000 0x0000000000000002
0x0000004004695e50: 0x000000000033b601 <github.com/aws/aws-sdk-go/aws/credentials.(*Credentials).asyncIsExpired.func1+0x00000000000001e1> 0x0000004002c96f00
0x0000004004695e60: 0x010000400342d318 0x0000000000000000
0x0000004004695e70: 0x0000000004377a98 0x0000004002cd9ea8
0x0000004004695e80: 0x0000000002920000 <github.com/wiz-sec/wiz/graph-dal/internal/graph.mutationResolver.GetExportUrlArguments+0x0000000000000050> 0x00000040036e7090
0x0000004004695e90: 0x0000004002fd0960 0x0000000000000000
0x0000004004695ea0: 0x00000040036fd0e0 0x0000004002cd9ec8
0x0000004004695eb0: >0x0000000001424420 <github.com/wiz-sec/wiz/commonlib/memmonitor.(*memoryMonitor).run.func1.1+0x0000000000000000> 0x00000040036e7090
0x0000004004695ec0: 0x00000040036ffe60 0x0000000000000000
0x0000004004695ed0: 0x0000000000000000 0x0000000000000000
0x0000004004695ee0: 0x0000004004695eb0 0x0000004002cd9ef8
0x0000004004695ef0: 0x0000000000ef4bb8 <github.com/wiz-sec/wiz/commonlib/waitgroup.(*SafeErrorGroup).Go.func1+0x0000000000000088> 0x0000004002cd9f58
0x0000004004695f00: 0x0000000000ef40bc <golang.org/x/sync/errgroup.(*Group).Go.func1+0x000000000000005c> 0x000000400345ea20
0x0000004004695f10: 0x030000400345ea80 0x0000000000000000
0x0000004004695f20: 0x0000000000000000 0x0000000000ef4c20 <github.com/wiz-sec/wiz/commonlib/waitgroup.(*SafeErrorGroup).Go.func1.1+0x0000000000000000>
0x0000004004695f30: 0x0000000004377a98 0x00000040036fd0e0
0x0000004004695f40: 0x0000004004695f18 0x00000000037ce6f0
0x0000004004695f50: 0x0000004004695f28 0x0000000000000000
0x0000004004695f60: 0x000000000007fe04 <runtime.goexit+0x0000000000000004> 0x0000000000000000
0x0000004004695f70: 0x0000000000000000 0x0000000000000000
0x0000004004695f80: 0x0100000000000000 0x0000004003c065c0
0x0000004004695f90: 0x0000000000ef41b0 <golang.org/x/sync/errgroup.(*Group).Go.func1.2+0x0000000000000000> 0x0000004003c065c0
0x0000004004695fa0: 0x0000000000000000 0x0000000000000000
fatal error: traceback stuck

@lizthegrey
Copy link
Author

Please open a new issue, the cause of your problem may be different.

@golang golang locked and limited conversation to collaborators Feb 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

6 participants