Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: stack split at a bad time on mipsle #21431

Closed
josharian opened this issue Aug 14, 2017 · 13 comments

Comments

Projects
None yet
6 participants
@josharian
Copy link
Contributor

commented Aug 14, 2017

Recent build failure from the dashboard (linux-mipsle at 816deac
): https://build.golang.org/log/e101d6d552cc5dce1ebf5b5560c0b399dbe86f6e

runtime: newstack sp=0x8861994 stack=[0x8861000, 0x8862000]
	morebuf={pc:0x93440 sp:0x8861994 lr:0x0}
	sched={pc:0x55e3c sp:0x8861994 lr:0x93440 ctxt:0x0}
syscall.Syscall(0x90010, 0xfa3, 0x5, 0x8861a0c, 0x4, 0x0, 0x0)
	/usr/local/go-bootstrap-mipsle/src/syscall/asm_linux_mipsx.s:31 +0x58
syscall.readlen(0x19, 0x0, 0x888e780, 0x8861ac8, 0x234960, 0x6)
	/usr/local/go-bootstrap-mipsle/src/syscall/zsyscall_linux_mipsle.go:1050 +0x5c
created by main.cmdbootstrap
	/data/mipsle/go/src/cmd/dist/build.go:1071 +0x498
fatal error: runtime: stack split at bad time

[stack traces]
@aclements

This comment has been minimized.

Copy link
Member

commented Nov 22, 2017

It looks like we've hit this a few times:

$ greplogs -dashboard -E "stack split at bad time" -l -md | grep linux-mips
2016-12-21T12:27:41-0ef4815/linux-mipsle
2017-04-21T14:34:10-bb6309c/linux-mipsle
2017-08-13T17:33:10-816deac/linux-mipsle

In all three, it's been something around syscalls, though all of the details differ. In all cases the traceback of the crashing goroutine is curiously truncated.

2017-08-13: The crash appears to happen when syscall.Syscall calls runtime.exitsyscall, but exitsyscall is marked nosplit, so it doesn't even have a morestack prologue. However, there may be other runtime frames in there since newstack manually calls traceback to print the traceback, but the default behavior is to hide runtime frames, so we may just not be seeing them.

The arguments are also odd: SYS_READ is 4003 (0xfa3), which appears as the second argument to syscall.Syscall rather than the first. The next argument of syscall.Syscall is supposed to be the FD, which is a completely reasonable 5 in the traceback, but none of the printed arguments to syscall.readlen is 5. My guess here is that the traceback code maybe got confused somehow by syscall.Syscall and went off the rails.

The location of the failure is also odd. The exact location in syscall.Syscall indicates that the read syscall failed, but the only place we use readlen is in forkExec in exec_unix.go, and I can't think of any reason why that read could fail.

Unfortunately, I haven't been able to reproduce the binary to check the specific PCs because this is the dist binary, so it was compiled with the bootstrap compiler on 2017-08-13 and I don't know what that was (ping @bradfitz).

2017-04-21: It's not even at a call, though, again, maybe the traceback is messed up. I tried to track down the morebuf PC. I can't locally build a cgo-enabled runtime_test linux/mipsle binary (and there seems to be no hope of reaching the linux-mipsle gomote), so I assumed syscall.Syscall was being called by readlen (the arguments suggest it is), assumed 0x48ff68 was the return address of the call to syscall.Syscall in readlen (because syscall.Syscall declares a zero-sized frame, but exitsyscall takes a dummy argument, that dummy argument should line up with syscall.Syscall's saved LR), and used that as a relocation delta against the non-cgo binary I could build locally. Unfortunately, that put me in the middle of entersyscallblock_handoff, which is called on the system stack, so isn't the culprit.

@gopherbot

This comment has been minimized.

Copy link

commented Nov 22, 2017

Change https://golang.org/cl/79518 mentions this issue: runtime: print runtime frames in throwsplit trace

@gopherbot

This comment has been minimized.

Copy link

commented Nov 22, 2017

Change https://golang.org/cl/79517 mentions this issue: runtime: call throw on systemstack in exitsyscall

gopherbot pushed a commit that referenced this issue Nov 22, 2017

runtime: call throw on systemstack in exitsyscall
If exitsyscall tries to grow the stack it will panic, but throw calls
print, which can grow the stack. Move the two bare throws in
exitsyscall to the system stack.

Updates #21431.

Change-Id: I5b29da5d34ade908af648a12075ed327a864476c
Reviewed-on: https://go-review.googlesource.com/79517
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>

gopherbot pushed a commit that referenced this issue Nov 22, 2017

runtime: print runtime frames in throwsplit trace
newstack manually prints the stack trace if we try to grow the stack
when throwsplit is set. However, the default behavior is to omit
runtime frames. Since runtime frames can be critical to understanding
this crash, this change fixes this traceback to include them.

Updates #21431.

Change-Id: I5aa43f43aa2f10a8de7d67bcec743427be3a3b5d
Reviewed-on: https://go-review.googlesource.com/79518
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@gopherbot

This comment has been minimized.

Copy link

commented Nov 24, 2017

Change https://golang.org/cl/79815 mentions this issue: runtime: fix final stack split in exitsyscall

gopherbot pushed a commit that referenced this issue Nov 24, 2017

runtime: fix final stack split in exitsyscall
exitsyscall should be recursively nosplit, but we don't have a way to
annotate that right now (see #21314). There's exactly one remaining
place where this is violated right now: exitsyscall -> casgstatus ->
print. The other prints in casgstatus are wrapped in systemstack
calls. This fixes the remaining print.

Updates #21431 (in theory could fix it, but that would just indicate
that we have a different G status-related crash and we've *never* seen
that failure on the dashboard.)

Change-Id: I9a5e8d942adce4a5c78cfc6b306ea5bda90dbd33
Reviewed-on: https://go-review.googlesource.com/79815
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@odeke-em

This comment has been minimized.

Copy link
Member

commented Dec 11, 2017

Hello @aclements, gently pinging here to see if there is more work that needs to be done for Go1.10 for this issue.

@aclements

This comment has been minimized.

Copy link
Member

commented Dec 14, 2017

I wish I knew what to do; then I would do more for Go1.10. :)

It looks like it's happened a few more times recently:
2017-11-22T21:16:29-c6c0f47/linux-mipsle
2017-11-30T14:39:19-4435fcf/linux-mipsle
2017-12-05T18:22:53-eb441e6/linux-mipsle
2017-12-05T18:42:38-49fec9b/linux-mipsle

These failures are all essentially the same. As before, they're all around system calls. And, as before, the traceback doesn't make much sense. runtime.reentersyscall is nosplit, so it can't be calling morestack, and reentersyscall doesn't call reentersyscall.

@cherrymui and I spent some time digging into the new failures. Some observations:

According to morebuf and the stack trace, the call to the stack-splitting function is at reentersyscall+0xe4. This corresponds to the nil check of p.ptr() in if _g_.m.p.ptr().runSafePointFn != 0. This could in principle cause a SIGSEGV that turns into a synthesized call to sigpanic from this PC, and sigpanic can indeed split the stack. If sched.pc from the panic message is sigpanic, that would confirm this, but I still haven't had any luck building any of the failing binaries. I'm going to make a CL to add this symbolization directly to the panic so we can hopefully catch it in the wild.

The traceback claims reentersyscall is being called from reentersyscall+0x74, which is a call, but to runtime.casgstatus. We're not sure what to make of that. Given that the traceback cuts off at that point, it's probably failing to walk to stack and simply found the remnants of some earlier call.

We also noticed that syscall.Syscall on mipsle isn't technically following the MIPS C calling convention, which is supposed to always save space for registers on the stack (even those passed in registers), and always save at least 16 bytes for the argument space. syscall.Syscall doesn't save any space. Fixing this would be non-trivial because Go's calling convention places the saved LR in this argument scratch space, so we'd need to move that around. It also doesn't seem to matter both because it would be really weird for the kernel to scribble over this space, and because none of the runtime syscalls reserve this space and they seem to be fine.

@gopherbot

This comment has been minimized.

Copy link

commented Dec 14, 2017

Change https://golang.org/cl/84115 mentions this issue: runtime: symbolize morestack caller in throwsplit panic

@cherrymui

This comment has been minimized.

Copy link
Contributor

commented Dec 14, 2017

(from earlier failure)

The arguments are also odd: SYS_READ is 4003 (0xfa3), which appears as the second argument to syscall.Syscall rather than the first.

The "first arg" there, 0x48a528, looks pretty much like a PC. This suggests there is an off-by-one error somewhere, because one word below the first arg is the saved LR.

preparePanic subtracts 4 from SP and stores the faulting address there. The traceback code handles this specially if it sees sigpanic's PC (https://go.googlesource.com/go/+/67295d6eb0b9ef8d40fcddf052d18ebaa03566e4/src/runtime/traceback.go#482). But if it is sigpanic's prolog try to grow the stack but failed, we're not really in sigpanic yet and the traceback code doesn't do this. This causes argp off by one word and prints a saved LR as first arg.

This also agrees with what we see in more recent failures:

runtime.reentersyscall(0x48d588, 0x48d588)
	/data/mipsle/go/src/runtime/proc.go:2811 +0xe4 fp=0x91152c4 sp=0x91152a8 pc=0x43dab4
runtime.reentersyscall(0x0, 0x0)
	/data/mipsle/go/src/runtime/proc.go:2790 +0x74 fp=0x91152e0 sp=0x91152c4 pc=0x43da44

It appears that reentersyscall get called with two identical args, but it cannot. In fact, reentersyscall can be called by entersyscall with getcallerpc() as its first arg, which is also the saved LR on entersyscall's frame. The off-by-one error makes them appears as two args. So the actual call stack may be

reentersyscall
entersyscall
function at 0x48d588 (probably syscall.Syscall?)

The traceback code thinks reentersyscall recurse to itself because it uses the LR saved by preparePanic as its return address, which is whatever in the LR register when the fault happened, which happens to be the return address of casgstatus in reentersyscall. From there, the backtrace goes far off. This also explains why the backtrace is always truncated.

gopherbot pushed a commit that referenced this issue Dec 15, 2017

runtime: symbolize morestack caller in throwsplit panic
This attempts to symbolize the PC of morestack's caller when there's a
stack split at a bad time. The stack trace starts at the *caller* of
the function that attempted to grow the stack, so this is useful if it
isn't obvious what's being called at that point, such as in #21431.

Change-Id: I5dee305d87c8069611de2d14e7a3083d76264f8f
Reviewed-on: https://go-review.googlesource.com/84115
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@aclements

This comment has been minimized.

Copy link
Member

commented Jan 12, 2018

This hasn't happened against yet since 2017-12-05, so no diagnostics, unfortunately. However, I think @cherrymui's theory is pretty sound.

This got me thinking that sigpanic itself really ought to be nosplit, since it can be "called" from contexts where it's unsafe to grow the stack. Then, if throwsplit is set, it should switch to the system stack and throw rather than attempt any further handling of the signal panic (which may grow the stack).

Of course, that wouldn't help with whatever underlying issue is causing the signal here (assuming that's what's really going on), but it would help debug issues like these.

@gopherbot

This comment has been minimized.

Copy link

commented Jan 12, 2018

Change https://golang.org/cl/87595 mentions this issue: runtime: don't grow the stack on sigpanic if unsafe

@gopherbot

This comment has been minimized.

Copy link

commented Jan 22, 2018

Change https://golang.org/cl/89016 mentions this issue: runtime: print hexdump on traceback failure

gopherbot pushed a commit that referenced this issue Jan 22, 2018

runtime: print hexdump on traceback failure
Currently, if anything goes wrong when printing a traceback, we simply
cut off the traceback without any further diagnostics. Unfortunately,
right now, we have a few issues that are difficult to debug because
the traceback simply cuts off (#21431, #23484).

This is an attempt to improve the debuggability of traceback failure
by printing a diagnostic message plus a hex dump around the failed
traceback frame when something goes wrong.

The failures look like:

goroutine 5 [running]:
runtime: unexpected return pc for main.badLR2 called from 0xbad
stack: frame={sp:0xc42004dfa8, fp:0xc42004dfc8} stack=[0xc42004d800,0xc42004e000)
000000c42004dea8:  0000000000000001  0000000000000001
000000c42004deb8:  000000c42004ded8  000000c42004ded8
000000c42004dec8:  0000000000427eea <runtime.dopanic+74>  000000c42004ded8
000000c42004ded8:  000000000044df70 <runtime.dopanic.func1+0>  000000c420001080
000000c42004dee8:  0000000000427b21 <runtime.gopanic+961>  000000c42004df08
000000c42004def8:  000000c42004df98  0000000000427b21 <runtime.gopanic+961>
000000c42004df08:  0000000000000000  0000000000000000
000000c42004df18:  0000000000000000  0000000000000000
000000c42004df28:  0000000000000000  0000000000000000
000000c42004df38:  0000000000000000  000000c420001080
000000c42004df48:  0000000000000000  0000000000000000
000000c42004df58:  0000000000000000  0000000000000000
000000c42004df68:  000000c4200010a0  0000000000000000
000000c42004df78:  00000000004c6400  00000000005031d0
000000c42004df88:  0000000000000000  0000000000000000
000000c42004df98:  000000c42004dfb8  00000000004ae7d9 <main.badLR2+73>
000000c42004dfa8: <00000000004c6400  00000000005031d0
000000c42004dfb8:  000000c42004dfd0 !0000000000000bad
000000c42004dfc8: >0000000000000000  0000000000000000
000000c42004dfd8:  0000000000451821 <runtime.goexit+1>  0000000000000000
000000c42004dfe8:  0000000000000000  0000000000000000
000000c42004dff8:  0000000000000000
main.badLR2(0x0)
	/go/src/runtime/testdata/testprog/badtraceback.go:42 +0x49

For #21431, #23484.

Change-Id: I8718fc76ced81adb0b4b0b4f2293f3219ca80786
Reviewed-on: https://go-review.googlesource.com/89016
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>

gopherbot pushed a commit that referenced this issue Jan 23, 2018

runtime: don't grow the stack on sigpanic if throwsplit
Currently, if a _SigPanic signal arrives in a throwsplit context,
nothing is stopping the runtime from injecting a call to sigpanic that
may attempt to grow the stack. This will fail and, in turn, mask the
real problem.

Fix this by checking for throwsplit in the signal handler itself
before injecting the sigpanic call.

Updates #21431, where this problem is likely masking the real problem.

Change-Id: I64b61ff08e8c4d6f6c0fb01315d7d5e66bf1d3e2
Reviewed-on: https://go-review.googlesource.com/87595
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@aclements

This comment has been minimized.

Copy link
Member

commented Jan 23, 2018

Still no new reproductions since 2017-12-05T18:42:38-49fec9b/linux-mipsle.

@aclements aclements modified the milestones: Go1.10, Go1.11 Jan 23, 2018

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2018

Hasn't happened in six months. I don't know what fixed this, but I'm calling it fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.