Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.Sign up
runtime: stack split at a bad time on mipsle #21431
Recent build failure from the dashboard (linux-mipsle at 816deac
It looks like we've hit this a few times:
In all three, it's been something around syscalls, though all of the details differ. In all cases the traceback of the crashing goroutine is curiously truncated.
2017-08-13: The crash appears to happen when syscall.Syscall calls runtime.exitsyscall, but exitsyscall is marked nosplit, so it doesn't even have a morestack prologue. However, there may be other runtime frames in there since newstack manually calls traceback to print the traceback, but the default behavior is to hide runtime frames, so we may just not be seeing them.
The arguments are also odd: SYS_READ is 4003 (0xfa3), which appears as the second argument to syscall.Syscall rather than the first. The next argument of syscall.Syscall is supposed to be the FD, which is a completely reasonable 5 in the traceback, but none of the printed arguments to syscall.readlen is 5. My guess here is that the traceback code maybe got confused somehow by syscall.Syscall and went off the rails.
The location of the failure is also odd. The exact location in syscall.Syscall indicates that the read syscall failed, but the only place we use readlen is in forkExec in exec_unix.go, and I can't think of any reason why that read could fail.
Unfortunately, I haven't been able to reproduce the binary to check the specific PCs because this is the dist binary, so it was compiled with the bootstrap compiler on 2017-08-13 and I don't know what that was (ping @bradfitz).
2017-04-21: It's not even at a call, though, again, maybe the traceback is messed up. I tried to track down the morebuf PC. I can't locally build a cgo-enabled runtime_test linux/mipsle binary (and there seems to be no hope of reaching the linux-mipsle gomote), so I assumed syscall.Syscall was being called by readlen (the arguments suggest it is), assumed 0x48ff68 was the return address of the call to syscall.Syscall in readlen (because syscall.Syscall declares a zero-sized frame, but exitsyscall takes a dummy argument, that dummy argument should line up with syscall.Syscall's saved LR), and used that as a relocation delta against the non-cgo binary I could build locally. Unfortunately, that put me in the middle of entersyscallblock_handoff, which is called on the system stack, so isn't the culprit.
I wish I knew what to do; then I would do more for Go1.10. :)
It looks like it's happened a few more times recently:
These failures are all essentially the same. As before, they're all around system calls. And, as before, the traceback doesn't make much sense.
@cherrymui and I spent some time digging into the new failures. Some observations:
According to morebuf and the stack trace, the call to the stack-splitting function is at
The traceback claims
We also noticed that
(from earlier failure)
The "first arg" there, 0x48a528, looks pretty much like a PC. This suggests there is an off-by-one error somewhere, because one word below the first arg is the saved LR.
This also agrees with what we see in more recent failures:
It appears that
The traceback code thinks
This hasn't happened against yet since 2017-12-05, so no diagnostics, unfortunately. However, I think @cherrymui's theory is pretty sound.
This got me thinking that
Of course, that wouldn't help with whatever underlying issue is causing the signal here (assuming that's what's really going on), but it would help debug issues like these.