Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: gentraceback() dead loop on arm64 casued the process hang #52116

Closed
tiancaiamao opened this issue Apr 2, 2022 · 20 comments
Closed

runtime: gentraceback() dead loop on arm64 casued the process hang #52116

tiancaiamao opened this issue Apr 2, 2022 · 20 comments
Labels
NeedsInvestigation

Comments

@tiancaiamao
Copy link

@tiancaiamao tiancaiamao commented Apr 2, 2022

What version of Go are you using (go version)?

$ go version
go version go1.18 linux/arm64

Does this issue reproduce with the latest release?

Yes, it reproduce in the latest Go1.18

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
[root@tidb-1-0 ~]# go env
GO111MODULE=""
GOARCH="arm64"
GOBIN=""
GOCACHE="/root/.cache/go-build"
GOENV="/root/.config/go/env"
GOEXE=""
GOEXPERIMENT=""
GOFLAGS=""
GOHOSTARCH="arm64"
GOHOSTOS="linux"
GOINSECURE=""
GOMODCACHE="/root/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="linux"
GOPATH="/root/go"
GOPRIVATE=""
GOPROXY="[https://proxy.golang.org,direct](https://proxy.golang.org%2Cdirect/)"
GOROOT="/usr/local/go"
GOSUMDB="[sum.golang.org](http://sum.golang.org/)"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/linux_arm64"
GOVCS=""
GOVERSION="go1.18"
GCCGO="gccgo"
AR="ar"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="/dev/null"
GOWORK=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build1414115226=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Run a benchmark workload in our database pingcap/tidb#31477
After we recently upgrade the toolchain to Go1.18, it reproduce stably.
Maybe it takes 6~8h or at most 24h, the tidb-server hang.
When that happen, the tidb-server CPU is 100% (one thread fully occupied) and it stop serving any request.

I use gdb and dlv to debug it, and find that one thread is dead loop in the gentraceback() function (which seems also holding a lock), and all the other threads are in futex() function (blocked by that lock)

I don't know how it can be reproduced in a minimal code snippet. Maybe I can provide the code dump of the process? but that would be too large.

Some more information I can provide:

  • This bug only reproduce in arm64, in x86 we never encounter it.
  • It's said to be first observed in TiDB 5.4, which use Go1.16.3 or Go1.16.4 (I'm not quite remember), also on arm64.
  • Go1.18 seems to be easier to trigger the bug, we run TiDB v5.1.4 which is built using Go1.16.4, it didn't reproduce after more than 36h

The phenomenon looks like another issue #50772, but I've checked that fix is included in Go1.18, so it might be another new case.

More details:
The code doesn't enter this block

go/src/runtime/traceback.go

Lines 357 to 377 in 4aa1efe

if inldata := funcdata(f, _FUNCDATA_InlTree); inldata != nil {
inltree := (*[1 << 20]inlinedCall)(inldata)
for {
ix := pcdatavalue(f, _PCDATA_InlTreeIndex, tracepc, &cache)
if ix < 0 {
break
}
if inltree[ix].funcID == funcID_wrapper && elideWrapperCalling(lastFuncID) {
// ignore wrappers
} else if skip > 0 {
skip--
} else if n < max {
(*[1 << 20]uintptr)(unsafe.Pointer(pcbuf))[n] = pc
n++
}
lastFuncID = inltree[ix].funcID
// Back up to an instruction in the "caller".
tracepc = frame.fn.entry() + uintptr(inltree[ix].parentPc)
pc = tracepc + 1
}
}

Then the code run to this branch

go/src/runtime/traceback.go

Lines 379 to 380 in 4aa1efe

if f.funcID == funcID_wrapper && elideWrapperCalling(lastFuncID) {
// Ignore wrapper functions (except when they trigger panics).

funcID is funcID_wrapper

Then line

n-- // offset n++ below

and line

Note, after n-- and n++, the value of n is never changed, so for n < max can't break the loop.

This code block will set frame to its upper frame, but frame.fn and flr are the same one!

go/src/runtime/traceback.go

Lines 480 to 486 in 4aa1efe

// Unwind to next frame.
frame.fn = flr
frame.pc = frame.lr
frame.lr = 0
frame.sp = frame.fp
frame.fp = 0
frame.argmap = nil

Thus it result in a dead loop.

I can workaround this bug by this patch tiancaiamao@5d1aea4
But I still can't figure out the root cause of the bug.

Print the stack when debugging in dlv:

(dlv) thread 1945
Switched from 1925 to 1945
(dlv) bt
 0  0x00000000013211d8 in runtime.funcdata
    at /usr/local/go1.18/src/runtime/symtab.go:1097
 1  0x00000000013211d8 in runtime.gentraceback
    at /usr/local/go1.18/src/runtime/traceback.go:357
 2  0x00000000012caa8c in runtime.makechan
    at :0
 3  0x000000000135a19c in context.(*cancelCtx).Done
    at /usr/local/go1.18/src/context/context.go:367
 4  0x000000000135bcd4 in context.(*valueCtx).Done
    at <autogenerated>:1
 5  0x000000000135bcd4 in context.(*valueCtx).Done
    at <autogenerated>:1
 6  0x000000000135bcd4 in context.(*valueCtx).Done
    at <autogenerated>:1
 7  0x000000000135bcd4 in context.(*valueCtx).Done
    at <autogenerated>:1
 8  0x000000000135bcd4 in context.(*valueCtx).Done
    at <autogenerated>:1
 9  0x000000000135bcd4 in context.(*valueCtx).Done
    at <autogenerated>:1
10  0x000000000135bcd4 in context.(*valueCtx).Done
    at <autogenerated>:1
11  0x000000000135bcd4 in context.(*valueCtx).Done
    at <autogenerated>:1
12  0x0000000001359878 in context.propagateCancel
    at /usr/local/go1.18/src/context/context.go:251
13  0x0000000001359744 in context.WithCancel
    at /usr/local/go1.18/src/context/context.go:237
14  0x0000000001bdb318 in github.com/tikv/client-go/v2/internal/retry.(*Backoffer).Fork
    at /root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220321123529-f4eae62b7ed5/internal/retry/backoff.go:259
15  0x0000000001cbc258 in github.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1
    at /root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220321123529-f4eae62b7ed5/txnkv/transaction/2pc.go:1992
16  0x000000000132eaa4 in runtime.goexit
    at /usr/local/go1.18/src/runtime/asm_arm64.s:1259

(dlv) gr
Thread 1945 at /usr/local/go1.18/src/runtime/symtab.go:845
Goroutine 1703734534:
        Runtime: /usr/local/go1.18/src/runtime/symtab.go:845 runtime.findfunc (0x1319340)
        User: /usr/local/go1.18/src/context/context.go:367 context.(*cancelCtx).Done (0x135a19c)
        Go: /root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220321123529-f4eae62b7ed5/txnkv/transaction/2pc.go:1978 github.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker (0x1cbbf80)
        Start: /root/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220321123529-f4eae62b7ed5/txnkv/transaction/2pc.go:1978 github.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1 (0x1cbc1c0)
        Labels: "plan_digest":"�a-g+\x00c��+B��Xl�\"]��\x1e3ߩ�\x11i.]���", "sql_digest":"1�:,U[�pb\vҔ�\x1a�\x0f��\x1a\x01DJ-�b���_XM�"

In gdb:

(gdb) bt
#0  0x0000000001321c08 in runtime.gentraceback (pc0=20059304, sp0=281472080470624, lr0=<optimized out>, gp=0x40009081a0, skip=0, pcbuf=0x4000042968, max=64, callback={void (runtime.stkframe *, void *, bool *)} 0x4000042940, v=0x0, flags=6, ~r0=<optimized out>)
    at /usr/local/go1.18/src/runtime/traceback.go:379
#1  0x00000000013072d0 in runtime.sigprof (pc=6, sp=<optimized out>, lr=<optimized out>, gp=0x15, mp=0x4000600c00) at /usr/local/go1.18/src/runtime/proc.go:4507
#2  0x0000000001311e64 in runtime.sighandler (sig=<optimized out>, info=<optimized out>, ctxt=<optimized out>, gp=0x15) at /usr/local/go1.18/src/runtime/signal_unix.go:613
#3  0x0000000001311614 in runtime.sigtrampgo (sig=27, info=0x4000042da0, ctx=0x4000042e20) at /usr/local/go1.18/src/runtime/signal_unix.go:477
#4  0x00000000013302ac in runtime.sigtrampgo (sig=27, info=0x4000042da0, ctx=0x4000042e20) at <autogenerated>:1
#5  0x000000000132fbc4 in runtime.sigtramp () at /usr/local/go1.18/src/runtime/sys_linux_arm64.s:489
#6  <signal handler called>
#7  0x00000000013214a8 in runtime.gentraceback (pc0=19704460, sp0=276041063296, lr0=<optimized out>, gp=0x40562c0ea0, skip=0, pcbuf=0x0, max=2147483647, callback={void (runtime.stkframe *, void *, bool *)} 0xffff535ee5c0, v=0xffff535ee618, flags=0, ~r0=<optimized out>)
    at /usr/local/go1.18/src/runtime/traceback.go:331
#8  0x0000000001315550 in runtime.copystack (gp=0x40562c0ea0, newsize=4096) at /usr/local/go1.18/src/runtime/stack.go:930
#9  0x00000000013159ac in runtime.newstack () at /usr/local/go1.18/src/runtime/stack.go:1110
#10 0x000000000132c4a0 in runtime.morestack () at /usr/local/go1.18/src/runtime/asm_arm64.s:310
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The stack in dlv and gdb doesn't looks exactly same, although the last frame are both in gentraceback (or its children function).
I try to get the stack information from the pcbuf to reason about the real stack, the pc address and symbol relation is just my guess.

(gdb) p pcbuf[0]
$2 = 20059305
13214a9 = runtime.gentraceback?

(gdb) p pcbuf[1]
$3 = 20010320
1315550 = runtime.copystack?

(gdb) p pcbuf[2]
$4 = 20011436
13159ac = runtime.newstack?

(gdb) p pcbuf[3]
$5 = 19704460
12caa8c = runtime.makechan?

(gdb) p pcbuf[4]
$6 = 19912820
12fd874

(gdb) p pcbuf[5]
$7 = 19912865
12fd8a1

(gdb) p pcbuf[6]
$8 = 0

Anything else I can provide for you to debug it?

What did you expect to see?

No deadloop in gentraceback() caused server hang.

What did you see instead?

gentraceback() dead loop on arm64 casued the process hang

@mengzhuo mengzhuo added the NeedsInvestigation label Apr 2, 2022
@mengzhuo
Copy link
Contributor

@mengzhuo mengzhuo commented Apr 2, 2022

cc @cherrymui @zhangfannie

@zhangfannie
Copy link
Contributor

@zhangfannie zhangfannie commented Apr 6, 2022

@tiancaiamao

This code block will set frame to its upper frame, but frame.fn and flr are the
same one!

Can you print the information of each unwind stack frame (pc, lr, sp, fp)?

I looked at the code and found that the following two codes will cause frame.pc = frame.lr.
Here

frame.fp = frame.sp + uintptr(funcspdelta(f, frame.pc, &cache))
gets frame.fp, if funcspdelta() returns 0, which cause frame.fp to be equal to frame.sp. Then here
frame.lr = *(*uintptr)(unsafe.Pointer(lrPtr))
we get the frame.lr, which will always be equal to frame.pc.

  1. Maye be we need to figure out why funcspdelta() returns 0? Because the current frame is the wrapper's frame? I am not familiar with this function.
  2. The assignment of frame.lr
    frame.lr = *(*uintptr)(unsafe.Pointer(lrPtr))
    does not check if frame.sp is equal to frame.fp. If frame.sp == frame.fp, we need to skip unwind its caller's stack, we should unwind its caller's caller's stack?

Thank you.

@tiancaiamao
Copy link
Author

@tiancaiamao tiancaiamao commented Apr 6, 2022

Can you print the information of each unwind stack frame (pc, lr, sp, fp)?

I'm not sure how to get it. Iterate the frame.lr to print each frame struct?

From my previous observation, frame.fp is equal to frame.sp.
So funcspdelta() is 0, this is not expected?

We've reinstall the cluster to test something else, so the reproduction is not available now.
I'll provide more information when we get the reproduction. @zhangfannie

@tiancaiamao
Copy link
Author

@tiancaiamao tiancaiamao commented Apr 11, 2022

Sorry for the long wait, now I get the reproduction, using 1.16.3

Here is how funcspdelta() return 0, it's found through cache:

   │739     // Returns the PCData value, and the PC where this value starts.                                                                                                                                                                                                     │
   │740     // TODO: the start PC is returned only when cache is nil.                                                                                                                                                                                                            │
   │741     func pcvalue(f funcInfo, off uint32, targetpc uintptr, cache *pcvalueCache, strict bool) (int32, uintptr) {                                                                                                                                                          │
   │742             if off == 0 {                                                                                                                                                                                                                                                │
   │743                     return -1, 0                                                                                                                                                                                                                                         │
   │744             }                                                                                                                                                                                                                                                            │
   │745                                                                                                                                                                                                                                                                          │
   │746             // Check the cache. This speeds up walks of deep stacks, which                                                                                                                                                                                               │
   │747             // tend to have the same recursive functions over and over.                                                                                                                                                                                                  │
   │748             //                                                                                                                                                                                                                                                           │
   │749             // This cache is small enough that full associativity is                                                                                                                                                                                                     │
   │750             // cheaper than doing the hashing for a less associative                                                                                                                                                                                                     │
   │751             // cache.                                                                                                                                                                                                                                                    │
   │752             if cache != nil {                                                                                                                                                                                                                                            │
   │753                     x := pcvalueCacheKey(targetpc)                                                                                                                                                                                                                       │
   │754                     for i := range cache.entries[x] {                                                                                                                                                                                                                    │
   │755                             // We check off first because we're more                                                                                                                                                                                                     │
   │756                             // likely to have multiple entries with                                                                                                                                                                                                      │
   │757                             // different offsets for the same targetpc                                                                                                                                                                                                   │
   │758                             // than the other way around, so we'll usually                                                                                                                                                                                               │
   │759                             // fail in the first clause.                                                                                                                                                                                                                 │
   │760                             ent := &cache.entries[x][i]                                                                                                                                                                                                                  │
   │761                             if ent.off == off && ent.targetpc == targetpc {                                                                                                                                                                                              │
  >│762                                     return ent.val, 0                                                                                                                                                                                                                    │
   │763                             }                                                                                                                                                                                                                                            │
   │764                     } 

and the value of the cache:

(gdb) p	cache
$11 = (runtime.pcvalueCache *) 0x400058a6d8
(gdb) p	*cache
$12 = {entries = {{{targetpc = 49394656, off = 136908, val = 0}, {targetpc = 0,	off = 0, val = 0}, {targetpc = 20234160, off = 104068, val = 416}, {targetpc = 0, off = 0, val = 0}, {targetpc = 49284320, off = 6097298, val = -1}, {targetpc = 49284320, off = 6097170, 
        val = 0}, {targetpc = 20252212,	off = 107867, val = 128}, {targetpc = 0, off = 0, val = 0}}, {{targetpc = 31036827, off = 2392400, val = -1}, {targetpc = 0, off = 0, val = 0}, {targetpc = 0, off = 0,	val = 0}, {targetpc = 30988204,	off = 2377579, val = 0}, {
        targetpc = 0, off = 0, val = 0}, {targetpc = 0,	off = 0, val = 0}, {targetpc = 31036828, off = 2392162,	val = 192}, {targetpc = 30988203, off = 2379034, val = -1}}}}
(gdb) p	targetpc
$13 = 49394656

@tiancaiamao
Copy link
Author

@tiancaiamao tiancaiamao commented Apr 11, 2022

The value in the pcbuf:

(gdb) p	pcbuf[0]
$19 = 20253253
(gdb) p	pcbuf[1]
$20 = 20253221
(gdb) p	pcbuf[2]
$21 = 20252212
(gdb) p	pcbuf[3]
$22 = 20234160
(gdb) p	pcbuf[4]
$23 = 30988204
(gdb) p	pcbuf[5]
$24 = 31036828
(gdb) p	pcbuf[6]
$25 = 31036828
(gdb) p	pcbuf[7]
$26 = 49284321
(gdb) p	pcbuf[8]

The call stack this time:

(gdb) bt
#0  runtime.gentraceback (pc0=20253252, sp0=281472796255936, lr0=20252212, gp=0x4000582180, skip=0, pcbuf=0x400058a938, max=64, callback={void (runtime.stkframe *, void *, bool *)} 0x400058a910, v=0x0, flags=6, ~r10=<optimized out>)
    at /usr/local/go/src/runtime/traceback.go:379
#1  0x000000000133d584 in runtime.sigprof (pc=20253252, sp=281472796255936, lr=20252212, gp=0x4000582180, mp=0x4000580000) at /usr/local/go/src/runtime/proc.go:4499
#2  0x0000000001348968 in runtime.sighandler (sig=27, info=0x400058ada0, ctxt=0x400058ae20, gp=0x4000582180) at /usr/local/go/src/runtime/signal_linux_arm64.go:51
#3  0x0000000001347e50 in runtime.sigtrampgo (sig=27, info=0x400058ada0, ctx=0x400058ae20) at /usr/local/go/src/runtime/signal_unix.go:459
#4  0x000000000136a364 in runtime.sigtramp () at /usr/local/go/src/runtime/sys_linux_arm64.s:458
#5  <signal handler called>
#6  runtime.readvarint (p=..., read=<optimized out>, val=<optimized out>) at /usr/local/go/src/runtime/symtab.go:1003
#7  runtime.step (p=<error reading variable: access outside bounds of object referenced via synthetic pointer>, pc=0xffff7e08e750, val=0xffff7e08e740, first=false, newp=..., ok=true) at /usr/local/go/src/runtime/symtab.go:991
#8  0x0000000001350634 in runtime.funcMaxSPDelta (f=..., ~r1=<optimized out>) at /usr/local/go/src/runtime/symtab.go:920
#9  0x000000000134bfb0 in runtime.newstack () at /usr/local/go/src/runtime/stack.go:1056
#10 0x0000000001366ea0 in runtime.morestack () at /usr/local/go/src/runtime/asm_arm64.s:321
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Can you print the information of each unwind stack frame (pc, lr, sp, fp)?

(gdb) p	frame.pc
$19 = 49394656
(gdb) p	frame.lr
$20 = 49394656
(gdb) p	frame.sp
$21 = 276551227360
(gdb) p	frame.fp
$22 = 276551227360

I don't know how to get each of the stack's frame info, I need to know the calling protocol for the stack layout.

(gdb) p *(frame.sp+16)
$32 = 1905865760
(gdb) p *(frame.sp+8) 
$33 = 20586664
(gdb) p *(frame.sp)   
$34 = 49394656
(gdb) p	frame.pc
$35 = 49394656

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Apr 13, 2022

Could you print the function names at the PC and LR? You could probably do p/a frame.pc and p/a frame.lr in GDB. Thanks.

@tiancaiamao
Copy link
Author

@tiancaiamao tiancaiamao commented Apr 14, 2022

Could you print the function names at the PC and LR? You could probably do p/a frame.pc and p/a frame.lr in GDB. Thanks.

(gdb) p/a frame.pc
$25 = 0x2f1b3e0 <github.com/pingcap/tidb/util/expensivequery.(*Handle).LogOnQueryExceedMemQuota-fm>
(gdb) p/a frame.lr
$26 = 0x2f1b3e0 <github.com/pingcap/tidb/util/expensivequery.(*Handle).LogOnQueryExceedMemQuota-fm>
(gdb) p/a pcbuf[0]
$16 = 0x1350a45 <runtime.step+245>
(gdb) p/a pcbuf[1]
$17 = 0x1350a25 <runtime.step+213>
(gdb) p/a pcbuf[2]
$18 = 0x1350634 <runtime.funcMaxSPDelta+180>
(gdb) p/a pcbuf[3]
$19 = 0x134bfb0 <runtime.newstack+400>
(gdb) p/a pcbuf[4]
$20 = 0x1d8d7ac <github.com/tikv/client-go/v2/txnkv/transaction.actionCommit.handleSingleBatch+6444>
(gdb) p/a pcbuf[5]
$21 = 0x1d9959c <github.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1+188>
(gdb) p/a pcbuf[6]
$22 = 0x1d9959c <github.com/tikv/client-go/v2/txnkv/transaction.(*batchExecutor).startWorker.func1+188>
(gdb) p/a pcbuf[7]
$23 = 0x2f004e1 <github.com/pingcap/tidb/executor.FormatSQL.func1+1>
(gdb) p/a pcbuf[8]
$24 = 0x0

@cherrymui

@tiancaiamao
Copy link
Author

@tiancaiamao tiancaiamao commented Apr 14, 2022

pc and lr is the same, they already trap in the dead loop.

The last stack frame LogOnQueryExceedMemQuota code is here
When we log error, maybe the log package will obtain the calling stack and visit gentraceback()

@gopherbot
Copy link

@gopherbot gopherbot commented Apr 15, 2022

Change https://go.dev/cl/400575 mentions this issue: runtime: use saved LR when unwinding through morestack

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Apr 15, 2022

Could you try if CL https://go.dev/cl/400575 fixes it? Thanks.

@tiancaiamao
Copy link
Author

@tiancaiamao tiancaiamao commented Apr 19, 2022

Could you try if CL https://go.dev/cl/400575 fixes it? Thanks.

Now we've tried that CL and watched it for 19h, no reproducing yet.
It seems to work, we'll continue the test for at least 1 or 2 more days.
@cherrymui

@tiancaiamao
Copy link
Author

@tiancaiamao tiancaiamao commented Apr 22, 2022

Several days passed, and still no reproductiing.
Confirmed CL https://go.dev/cl/400575 can solve our issue.
Thanks @cherrymui

@tiancaiamao
Copy link
Author

@tiancaiamao tiancaiamao commented May 7, 2022

Will this fix be cherry-picked to 1.16? @cherrymui

@Ivy-YinSu
Copy link

@Ivy-YinSu Ivy-YinSu commented May 10, 2022

Will this fix be cherry-picked to 1.18? And when will we get the first version with this issue fixed? @cherrymui

tiancaiamao added a commit to tiancaiamao/go that referenced this issue May 23, 2022
On LR machine, consider F calling G calling H, which grows stack. The stack looks like ... G's frame:
	... locals ... saved LR = return PC in F <- SP points here at morestack
H's frame (to be created)

At morestack, we save gp.sched.pc = H's morestack call gp.sched.sp = H's entry SP (the arrow above) gp.sched.lr = return PC in G

Currently, when unwinding through morestack (if _TraceJumpStack is set), we switch PC and SP but not LR. We then have
	frame.pc = H's morestack call frame.sp = H's entry SP (the arrow above)
As LR is not set, we load it from stack at *sp, so frame.lr = return PC in F
As the SP hasn't decremented at the morestack call,
	frame.fp = frame.sp = H's entry SP

Unwinding a frame, we have frame.pc = old frame.lr = return PC in F frame.sp = old frame.fp = H's entry SP a.k.a. G's SP
The PC and SP don't match. The unwinding will go off if F and G have different frame sizes.

Fix this by preserving the LR when switching stack.

Also add code to detect infinite loop in unwinding.

TODO: add some test. I can reproduce the infinite loop (or throw with added check) but the frequency is low.

May fix golang#52116.
EricMountain pushed a commit to DataDog/go that referenced this issue May 27, 2022
On LR machine, consider F calling G calling H, which grows stack.
The stack looks like
...
G's frame:
	... locals ...
	saved LR = return PC in F  <- SP points here at morestack
H's frame (to be created)

At morestack, we save
	gp.sched.pc = H's morestack call
	gp.sched.sp = H's entry SP (the arrow above)
	gp.sched.lr = return PC in G

Currently, when unwinding through morestack (if _TraceJumpStack
is set), we switch PC and SP but not LR. We then have
	frame.pc = H's morestack call
	frame.sp = H's entry SP (the arrow above)
As LR is not set, we load it from stack at *sp, so
	frame.lr = return PC in F
As the SP hasn't decremented at the morestack call,
	frame.fp = frame.sp = H's entry SP

Unwinding a frame, we have
	frame.pc = old frame.lr = return PC in F
	frame.sp = old frame.fp = H's entry SP a.k.a. G's SP
The PC and SP don't match. The unwinding will go off if F and G
have different frame sizes.

Fix this by preserving the LR when switching stack.

Also add code to detect infinite loop in unwinding.

TODO: add some test. I can reproduce the infinite loop (or throw
with added check) but the frequency is low.

May fix golang#52116.

Change-Id: I6e1294f1c6e55f664c962767a1cf6c466a0c0eff
Reviewed-on: https://go-review.googlesource.com/c/go/+/400575
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Eric Fang <eric.fang@arm.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
EricMountain pushed a commit to DataDog/go that referenced this issue May 27, 2022
On LR machine, consider F calling G calling H, which grows stack.
The stack looks like
...
G's frame:
	... locals ...
	saved LR = return PC in F  <- SP points here at morestack
H's frame (to be created)

At morestack, we save
	gp.sched.pc = H's morestack call
	gp.sched.sp = H's entry SP (the arrow above)
	gp.sched.lr = return PC in G

Currently, when unwinding through morestack (if _TraceJumpStack
is set), we switch PC and SP but not LR. We then have
	frame.pc = H's morestack call
	frame.sp = H's entry SP (the arrow above)
As LR is not set, we load it from stack at *sp, so
	frame.lr = return PC in F
As the SP hasn't decremented at the morestack call,
	frame.fp = frame.sp = H's entry SP

Unwinding a frame, we have
	frame.pc = old frame.lr = return PC in F
	frame.sp = old frame.fp = H's entry SP a.k.a. G's SP
The PC and SP don't match. The unwinding will go off if F and G
have different frame sizes.

Fix this by preserving the LR when switching stack.

Also add code to detect infinite loop in unwinding.

TODO: add some test. I can reproduce the infinite loop (or throw
with added check) but the frequency is low.

May fix golang#52116.

Change-Id: I6e1294f1c6e55f664c962767a1cf6c466a0c0eff
Reviewed-on: https://go-review.googlesource.com/c/go/+/400575
TryBot-Result: Gopher Robot <gobot@golang.org>
Run-TryBot: Cherry Mui <cherryyz@google.com>
Reviewed-by: Eric Fang <eric.fang@arm.com>
Reviewed-by: Benny Siegert <bsiegert@gmail.com>
@felixge
Copy link
Contributor

@felixge felixge commented May 27, 2022

We're also experiencing this issue and would appreciate a backport of this fix for Go 1.18.

@cherrymui
Copy link
Member

@cherrymui cherrymui commented May 27, 2022

@gopherbot please backport this to previous releases. This is a runtime bug which can cause programs to hang or crash. Thanks.

@gopherbot
Copy link

@gopherbot gopherbot commented May 27, 2022

Backport issue(s) opened: #53111 (for 1.17), #53112 (for 1.18).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot
Copy link

@gopherbot gopherbot commented May 27, 2022

Change https://go.dev/cl/408821 mentions this issue: [release-branch.go1.18] runtime: use saved LR when unwinding through morestack

@gopherbot
Copy link

@gopherbot gopherbot commented May 27, 2022

Change https://go.dev/cl/408822 mentions this issue: [release-branch.go1.17] runtime: use saved LR when unwinding through morestack

@lizthegrey
Copy link

@lizthegrey lizthegrey commented May 30, 2022

Thanks for finding this, we also have been encountering it on go1.18.2 and were stumped because we'd also found a similar hang in go1.17.6 (#50772) but knew that one had been fixed already.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation
Projects
None yet
Development

No branches or pull requests

8 participants