New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: misc/cgo/test.TestSetgid panic with unexpected return pc for runtime.sigpanic
on linux-arm64-packet since 2022-06-04
#53374
Comments
(attn @golang/runtime) Marking as release-blocker for Go 1.19 because this appears to be a regression on |
It looks like what happened here is that the stack frame was somehow messed up and we ended up returning to a stack address instead of a valid PC. This caused a SIGSEGV, and while trying to print a traceback we encounted the same messed up frame and reported the "unexpected return pc". I don't think it was actually sigpanic that made the bad return. I think that is just the frame we injected after the fault occurred. |
I agree with @prattmic 's assessment. Picking apart 2022-06-04T16:11:54-fc66cae/linux-arm64-packet in detail: 0x0000ffff537fd830: 0x0000aaaabb49ac24 <runtime.sigpanic+0x00000000000001e4> Saved LR from sigpanic => throw("unexpected signal during runtime execution") minit frame: mstart1 frame: mstart0 frame: So the mstart0 => mstart1 => minit chain makes sense, but I lose the thread. It looks like we may have called minitSignals, but things aren't quite where I would expect for that. |
Based on the fact that this is happening in TestSetgid, I wonder if the a SIGSETXID came in as a new thread was starting up and we received it the moment |
I was thinking on that, too, but I also couldn't think of how it could happen. We set up the signal stack before setting the signal mask. So a signal shouldn't clobber the g0 stack. |
It's extra weird because this is a cgo binary, so we should just be using libc's setxid signal handler for all of this. I wonder if the integration with libc on this is messed up on early thread init? |
Good point. We start a thread with signals blocked https://cs.opensource.google/go/go/+/master:src/runtime/cgo/gcc_linux_arm64.c;l=28 . So we can get a signal early on before the signal stack is set?... |
@cherrymui You're right. That seems entirely possible. Argh. |
Oh no! That does seem plausible. I'm surprised the two failure stacks are so identical, but maybe that just reflects when we made some syscall and the kernel checked the pending signals. We're running on the g0 stack at this point, so it should be okay to take the signal, but maybe something bad happens in our signal path before or after we hand off to libc? |
Change https://go.dev/cl/412474 mentions this issue: |
Cherry, could we test your hypothesis about the signal frame clobbering the stack by doing a self- |
The CL above makes it not rely on the LR saved below the SP (even temporarily). Maybe that will fix it. Or, maybe we can call the actual syscall to block all the signals for thread creation? We'll restore the old signal mask anyway? That may race with a SIGSETXID signal delivering to the parent thread, though, so it may miss the signal and deadlock? |
@aclements Maybe. If it is really what the CL above is trying to fix, the signal has to land between the instruction storing the LR and the instruction decrementing the SP, so perhaps it has to be written ... in assembly (or have the assembler insert it)? |
Oh good point. Maybe a stress test where we constantly create new threads and bombard the process with SIGSETXID? |
Not sure this is the best way to stress this, but running this on thelinux-arm64-packet builder it has a not-too-small rate to fail with seg fault. Plain seg fault with no extra output, not the unwinder error like above, though.
With the CL above, no failure in 1000 run. I'll see if I can make it more likely to fail. |
If I add this to the runtime
The program above fails 100%. I think this is consistent with the theory above. |
I found that this doesn't fail with Go 1.18, even with the runtime hack above. I think this is due to https://go-review.googlesource.com/c/go/+/379075 changes the prologue. Previously, for small frames it is a single So another possibility is to undo CL 379075 for small frames. It is two instructions either way. (Previously |
We chatted about this and the plan is to return to the old prologue for small frames as a stop-gap for 1.19, confirm by hand that this generates signal-safe prologues for the handful of functions that run before we have a signal stack, and look into enhancing the traceback metadata for 1.20 to capture ranges where the frame is non-empty but the LR should be retrieved from the LR register instead of the stack. |
Change https://go.dev/cl/413428 mentions this issue: |
greplogs -l -e '(?ms)unexpected return pc for runtime\.sigpanic.*exit status 2\nFAIL\s+misc/cgo/test'
2022-06-08T15:47:58-4afb0b9/linux-arm64-packet
2022-06-04T16:11:54-fc66cae/linux-arm64-packet
The text was updated successfully, but these errors were encountered: