Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: misc/cgo/test.TestSetgid panic with unexpected return pc for runtime.sigpanic on linux-arm64-packet since 2022-06-04 #53374

Open
bcmills opened this issue Jun 14, 2022 · 19 comments
Assignees
Labels
NeedsFix release-blocker
Milestone

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Jun 14, 2022

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x2 addr=0xffff537fd898 pc=0xffff537fd898]

runtime stack:
runtime.throw({0xaaaabb5aaf4f?, 0x8?})
	/workdir/go/src/runtime/panic.go:1047 +0x40 fp=0xffff537fd860 sp=0xffff537fd830 pc=0xaaaabb4836d0
runtime: g 0: unexpected return pc for runtime.sigpanic called from 0xffff537fd898
stack: frame={sp:0xffff537fd860, fp:0xffff537fd890} stack=[0xffff52ffdd00,0xffff537fd900)
0x0000ffff537fd760:  0x0100aaaabb483c90  0x000000000000000a 
0x0000ffff537fd770:  0x000000000000001f  0x0000ffff537fd898 
0x0000ffff537fd780:  0x0000ffff537fd898  0x0000000000000002 
0x0000ffff537fd790:  0x0000aaaabb4836d0 <runtime.throw+0x0000000000000040>  0x0000ffff537fd830 
0x0000ffff537fd7a0:  0x0000aaaabb5a811e  0x0000ffff537fd7e8 
0x0000ffff537fd7b0:  0x0000aaaabb483968 <runtime.fatalthrow+0x0000000000000058>  0x00000040001021a0 
0x0000ffff537fd7c0:  0x0000aaaabb484f18 <runtime.printunlock+0x0000000000000048>  0x0000000000000001 
0x0000ffff537fd7d0:  0x0000ffff537fd830  0x0000aaaabb4836d0 <runtime.throw+0x0000000000000040> 
0x0000ffff537fd7e0:  0x00000040001021a0  0x0000ffff537fd828 
0x0000ffff537fd7f0:  0x0000aaaabb4836d0 <runtime.throw+0x0000000000000040>  0x0000ffff537fd808 
0x0000ffff537fd800:  0x00000040001021a0  0x0000aaaabb483980 <runtime.fatalthrow.func1+0x0000000000000000> 
0x0000ffff537fd810:  0x00000040001021a0  0x0000aaaabb4836d0 <runtime.throw+0x0000000000000040> 
0x0000ffff537fd820:  0x0000ffff537fd830  0x0000ffff537fd858 
0x0000ffff537fd830:  0x0000aaaabb49ac24 <runtime.sigpanic+0x00000000000001e4>  0x0000ffff537fd840 
0x0000ffff537fd840:  0x0000aaaabb4836e0 <runtime.throw.func1+0x0000000000000000>  0x0000aaaabb5aaf4f 
0x0000ffff537fd850:  0x000000000000002a  0x0000ffff537fd898 
0x0000ffff537fd860: <0x0000ffff537fd898  0x0000aaaabb5aaf4f 
0x0000ffff537fd870:  0x0000000000000008  0x0000000000000002 
0x0000ffff537fd880:  0x0000000000000000  0x0000ffff537fd898 
0x0000ffff537fd890: >0x0000ffff537fd898  0x0000aaaabb4805ac <runtime.minit+0x000000000000001c> 
0x0000ffff537fd8a0:  0x0000aaaabb48894c <runtime.mstart1+0x000000000000004c>  0x0000ffff812f7000 
0x0000ffff537fd8b0:  0x0000ffff58b270d0  0x0000ffff537fd8e8 
0x0000ffff537fd8c0:  0x0000aaaabb4888d8 <runtime.mstart0+0x0000000000000068>  0x0000ffff812f7000 
0x0000ffff537fd8d0:  0x0000000000000000  0x0000ffff812c7a98 
0x0000ffff537fd8e0:  0x00000040001021a0  0x0000ffff537fd918 
0x0000ffff537fd8f0:  0x0000aaaabb4b3a00 <runtime.mstart+0x0000000000000010>  0x000000770000006e 
runtime.sigpanic()
	/workdir/go/src/runtime/signal_unix.go:819 +0x1e4 fp=0xffff537fd890 sp=0xffff537fd860 pc=0xaaaabb49ac24

goroutine 1 [chan receive]:
runtime.gopark(0x4000091a68?, 0xaaaabb4527c4?, 0x78?, 0x1a?, 0xaaaabb45a20c?)
	/workdir/go/src/runtime/proc.go:363 +0xe4 fp=0x40000919f0 sp=0x40000919d0 pc=0xaaaabb4860a4
runtime.chanrecv(0x4000128070, 0x4000091aff, 0x1)
	/workdir/go/src/runtime/chan.go:583 +0x454 fp=0x4000091a80 sp=0x40000919f0 pc=0xaaaabb453b14
runtime.chanrecv1(0x4000112060?, 0x1?)
	/workdir/go/src/runtime/chan.go:442 +0x14 fp=0x4000091ab0 sp=0x4000091a80 pc=0xaaaabb4536b4
testing.(*T).Run(0x40001024e0, {0xaaaabb5a1d6a?, 0x92716c0d48a5e?}, 0xaaaabb62cd10)
	/workdir/go/src/testing/testing.go:1494 +0x314 fp=0x4000091b50 sp=0x4000091ab0 pc=0xaaaabb516ed4
testing.runTests.func1(0x0?)
	/workdir/go/src/testing/testing.go:1846 +0x70 fp=0x4000091ba0 sp=0x4000091b50 pc=0xaaaabb518b40
testing.tRunner(0x40001024e0, 0x4000091cb8)
	/workdir/go/src/testing/testing.go:1446 +0x10c fp=0x4000091bf0 sp=0x4000091ba0 pc=0xaaaabb51623c
testing.runTests(0x400011e000?, {0xaaaabb71b600, 0x60, 0x60}, {0x4000091d28?, 0xa0aaaabb4635e8?, 0xaaaabb735c00?})
	/workdir/go/src/testing/testing.go:1844 +0x3f0 fp=0x4000091ce0 sp=0x4000091bf0 pc=0xaaaabb518a00
testing.(*M).Run(0x400011e000)
	/workdir/go/src/testing/testing.go:1726 +0x4f0 fp=0x4000091ee0 sp=0x4000091ce0 pc=0xaaaabb5177a0
main.main()
	_testmain.go:243 +0x1d0 fp=0x4000091f70 sp=0x4000091ee0 pc=0xaaaabb59bc90
runtime.main()
	/workdir/go/src/runtime/proc.go:250 +0x24c fp=0x4000091fd0 sp=0x4000091f70 pc=0xaaaabb485cdc
runtime.goexit()
	/workdir/go/src/runtime/asm_arm64.s:1155 +0x4 fp=0x4000091fd0 sp=0x4000091fd0 pc=0xaaaabb4b60e4

goroutine 2 [force gc (idle)]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/workdir/go/src/runtime/proc.go:363 +0xe4 fp=0x4000042fa0 sp=0x4000042f80 pc=0xaaaabb4860a4
runtime.goparkunlock(...)
	/workdir/go/src/runtime/proc.go:369
runtime.forcegchelper()
	/workdir/go/src/runtime/proc.go:302 +0xac fp=0x4000042fd0 sp=0x4000042fa0 pc=0xaaaabb485f3c
runtime.goexit()
	/workdir/go/src/runtime/asm_arm64.s:1155 +0x4 fp=0x4000042fd0 sp=0x4000042fd0 pc=0xaaaabb4b60e4
created by runtime.init.6
	/workdir/go/src/runtime/proc.go:290 +0x24

goroutine 3 [GC sweep wait]:
runtime.gopark(0x0?, 0x0?, 0x0?, 0x0?, 0x0?)
	/workdir/go/src/runtime/proc.go:363 +0xe4 fp=0x4000043770 sp=0x4000043750 pc=0xaaaabb4860a4
runtime.goparkunlock(...)
	/workdir/go/src/runtime/proc.go:369
runtime.bgsweep(0x0?)
	/workdir/go/src/runtime/mgcsweep.go:278 +0xa4 fp=0x40000437b0 sp=0x4000043770 pc=0xaaaabb4714e4
runtime.gcenable.func1()
	/workdir/go/src/runtime/mgc.go:178 +0x28 fp=0x40000437d0 sp=0x40000437b0 pc=0xaaaabb465af8
runtime.goexit()
	/workdir/go/src/runtime/asm_arm64.s:1155 +0x4 fp=0x40000437d0 sp=0x40000437d0 pc=0xaaaabb4b60e4
created by runtime.gcenable
	/workdir/go/src/runtime/mgc.go:178 +0x74

goroutine 4 [GC scavenge wait]:
runtime.gopark(0x400006c000?, 0xaaaabb5dc9f0?, 0x1?, 0x0?, 0x0?)
	/workdir/go/src/runtime/proc.go:363 +0xe4 fp=0x4000043f50 sp=0x4000043f30 pc=0xaaaabb4860a4
runtime.goparkunlock(...)
	/workdir/go/src/runtime/proc.go:369
runtime.(*scavengerState).park(0xaaaabb735c80)
	/workdir/go/src/runtime/mgcscavenge.go:389 +0x5c fp=0x4000043f80 sp=0x4000043f50 pc=0xaaaabb46f4cc
runtime.bgscavenge(0x0?)
	/workdir/go/src/runtime/mgcscavenge.go:617 +0x44 fp=0x4000043fb0 sp=0x4000043f80 pc=0xaaaabb46fa44
runtime.gcenable.func2()
	/workdir/go/src/runtime/mgc.go:179 +0x28 fp=0x4000043fd0 sp=0x4000043fb0 pc=0xaaaabb465a98
runtime.goexit()
	/workdir/go/src/runtime/asm_arm64.s:1155 +0x4 fp=0x4000043fd0 sp=0x4000043fd0 pc=0xaaaabb4b60e4
created by runtime.gcenable
	/workdir/go/src/runtime/mgc.go:179 +0xb8

goroutine 5 [finalizer wait]:
runtime.gopark(0xaaaabb736440?, 0x40000036c0?, 0x0?, 0x0?, 0x1?)
	/workdir/go/src/runtime/proc.go:363 +0xe4 fp=0x4000042580 sp=0x4000042560 pc=0xaaaabb4860a4
runtime.goparkunlock(...)
	/workdir/go/src/runtime/proc.go:369
runtime.runfinq()
	/workdir/go/src/runtime/mfinal.go:180 +0x120 fp=0x40000427d0 sp=0x4000042580 pc=0xaaaabb464d20
runtime.goexit()
	/workdir/go/src/runtime/asm_arm64.s:1155 +0x4 fp=0x40000427d0 sp=0x40000427d0 pc=0xaaaabb4b60e4
created by runtime.createfing
	/workdir/go/src/runtime/mfinal.go:157 +0x84

goroutine 18 [runnable]:
runtime.gopark(0x400003e6c8?, 0x2?, 0xb8?, 0xe5?, 0x400003e6bc?)
	/workdir/go/src/runtime/proc.go:363 +0xe4 fp=0x400003e560 sp=0x400003e540 pc=0xaaaabb4860a4
runtime.selectgo(0x400003e6c8, 0x400003e6b8, 0xaaaabb6586be?, 0x0, 0x51b?, 0x1)
	/workdir/go/src/runtime/select.go:328 +0x684 fp=0x400003e680 sp=0x400003e560 pc=0xaaaabb496d24
misc/cgo/test.runTestSetgid()
	/workdir/go/misc/cgo/test/setgid_linux.go:30 +0xb4 fp=0x400003e6f0 sp=0x400003e680 pc=0xaaaabb592d34
misc/cgo/test.testSetgid(0x4000102680)
	/workdir/go/misc/cgo/test/setgid_linux.go:40 +0x20 fp=0x400003e740 sp=0x400003e6f0 pc=0xaaaabb592de0
misc/cgo/test.TestSetgid(0x0?)
	/workdir/go/misc/cgo/test/cgo_linux_test.go:16 +0x1c fp=0x400003e760 sp=0x400003e740 pc=0xaaaabb55e31c
testing.tRunner(0x4000102680, 0xaaaabb62cd10)
	/workdir/go/src/testing/testing.go:1446 +0x10c fp=0x400003e7b0 sp=0x400003e760 pc=0xaaaabb51623c
testing.(*T).Run.func1()
	/workdir/go/src/testing/testing.go:1493 +0x2c fp=0x400003e7d0 sp=0x400003e7b0 pc=0xaaaabb516f7c
runtime.goexit()
	/workdir/go/src/runtime/asm_arm64.s:1155 +0x4 fp=0x400003e7d0 sp=0x400003e7d0 pc=0xaaaabb4b60e4
created by testing.(*T).Run
	/workdir/go/src/testing/testing.go:1493 +0x300
exit status 2
FAIL	misc/cgo/test	0.029s
2022/06/08 16:27:39 Failed: exit status 1

greplogs -l -e '(?ms)unexpected return pc for runtime\.sigpanic.*exit status 2\nFAIL\s+misc/cgo/test'
2022-06-08T15:47:58-4afb0b9/linux-arm64-packet
2022-06-04T16:11:54-fc66cae/linux-arm64-packet

@bcmills
Copy link
Member Author

@bcmills bcmills commented Jun 14, 2022

(attn @golang/runtime)

Marking as release-blocker for Go 1.19 because this appears to be a regression on linux/arm64, which is a first-class port.

@bcmills bcmills added this to the Go1.19 milestone Jun 14, 2022
@bcmills bcmills added release-blocker NeedsInvestigation labels Jun 14, 2022
@prattmic
Copy link
Member

@prattmic prattmic commented Jun 14, 2022

It looks like what happened here is that the stack frame was somehow messed up and we ended up returning to a stack address instead of a valid PC. This caused a SIGSEGV, and while trying to print a traceback we encounted the same messed up frame and reported the "unexpected return pc".

I don't think it was actually sigpanic that made the bad return. I think that is just the frame we injected after the fault occurred.

@aclements
Copy link
Member

@aclements aclements commented Jun 15, 2022

I agree with @prattmic 's assessment.

Picking apart 2022-06-04T16:11:54-fc66cae/linux-arm64-packet in detail:

0x0000ffff537fd830: 0x0000aaaabb49ac24 <runtime.sigpanic+0x00000000000001e4> Saved LR from sigpanic => throw("unexpected signal during runtime execution")
0x0000ffff537fd838: 0x0000ffff537fd840 Might be a saved BP
0x0000ffff537fd840: 0x0000aaaabb4836e0 <runtime.throw.func1+0x0000000000000000> Probably an argument to throw => systemstack
0x0000ffff537fd848: 0x0000aaaabb5aaf4f
0x0000ffff537fd850: 0x000000000000002a
0x0000ffff537fd858: 0x0000ffff537fd898 Looks like the saved BP from minit
0x0000ffff537fd860: <0x0000ffff537fd898 Looks like the saved BP from minit
0x0000ffff537fd868: 0x0000aaaabb5aaf4f
0x0000ffff537fd870: 0x0000000000000008
0x0000ffff537fd878: 0x0000000000000002
0x0000ffff537fd880: 0x0000000000000000
0x0000ffff537fd880: 0x0000ffff537fd898 Looks like the saved BP from minit
0x0000ffff537fd890: >0x0000ffff537fd898 Looks like the saved BP from minit

minit frame:
0x0000ffff537fd898: 0x0000aaaabb4805ac <runtime.minit+0x000000000000001c> Should be saved BP from minit, but instead saved LR from minit => minitSignals
0x0000ffff537fd8a0: 0x0000aaaabb48894c <runtime.mstart1+0x000000000000004c> Saved LR from mstart1 => minit; minit leaves RSP pointing here
0x0000ffff537fd8a8: 0x0000ffff812f7000
0x0000ffff537fd8b0: 0x0000ffff58b270d0

mstart1 frame:
0x0000ffff537fd8b8: 0x0000ffff537fd8e8 Saved BP from mstart1
0x0000ffff537fd8c0: 0x0000aaaabb4888d8 <runtime.mstart0+0x0000000000000068> Saved LR from mstart0 => mstart1
0x0000ffff537fd8c8: 0x0000ffff812f7000
0x0000ffff537fd8d0: 0x0000000000000000
0x0000ffff537fd8d8: 0x0000ffff812c7a98
0x0000ffff537fd8e0: 0x00000040001021a0

mstart0 frame:
0x0000ffff537fd8e8: 0x0000ffff537fd918 Saved BP from mstart0
0x0000ffff537fd8f0: 0x0000aaaabb4b3a00 <runtime.mstart+0x0000000000000010> Saved LR from mstart => mstart0
0x0000ffff537fd8f8: 0x000000770000006e

So the mstart0 => mstart1 => minit chain makes sense, but I lose the thread. It looks like we may have called minitSignals, but things aren't quite where I would expect for that.

@aclements
Copy link
Member

@aclements aclements commented Jun 15, 2022

Based on the fact that this is happening in TestSetgid, I wonder if the a SIGSETXID came in as a new thread was starting up and we received it the moment minitSignalMask set the signal mask. I haven't yet pieced together if that could actually cause what we're seeing.

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Jun 15, 2022

I was thinking on that, too, but I also couldn't think of how it could happen. We set up the signal stack before setting the signal mask. So a signal shouldn't clobber the g0 stack.

@aclements
Copy link
Member

@aclements aclements commented Jun 15, 2022

It's extra weird because this is a cgo binary, so we should just be using libc's setxid signal handler for all of this. I wonder if the integration with libc on this is messed up on early thread init?

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Jun 15, 2022

Good point. We start a thread with signals blocked https://cs.opensource.google/go/go/+/master:src/runtime/cgo/gcc_linux_arm64.c;l=28 .
But apparently glibc doesn't really block SIGSETXID? https://sourceware.org/git/?p=glibc.git;a=blob;f=nptl/pthread_sigmask.c;h=20f811cc6b844969eb0e5436659c0ef0655301ea;hb=HEAD#l29

So we can get a signal early on before the signal stack is set?...

@ianlancetaylor
Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Jun 15, 2022

@cherrymui You're right. That seems entirely possible. Argh.

@aclements
Copy link
Member

@aclements aclements commented Jun 15, 2022

Oh no! That does seem plausible. I'm surprised the two failure stacks are so identical, but maybe that just reflects when we made some syscall and the kernel checked the pending signals.

We're running on the g0 stack at this point, so it should be okay to take the signal, but maybe something bad happens in our signal path before or after we hand off to libc?

@gopherbot
Copy link

@gopherbot gopherbot commented Jun 15, 2022

Change https://go.dev/cl/412474 mentions this issue: cmd/internal/obj/arm64: save LR after decrementing SP

@aclements
Copy link
Member

@aclements aclements commented Jun 15, 2022

Cherry, could we test your hypothesis about the signal frame clobbering the stack by doing a self-kill in, say, minit?

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Jun 15, 2022

The CL above makes it not rely on the LR saved below the SP (even temporarily). Maybe that will fix it.

Or, maybe we can call the actual syscall to block all the signals for thread creation? We'll restore the old signal mask anyway? That may race with a SIGSETXID signal delivering to the parent thread, though, so it may miss the signal and deadlock?

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Jun 15, 2022

@aclements Maybe. If it is really what the CL above is trying to fix, the signal has to land between the instruction storing the LR and the instruction decrementing the SP, so perhaps it has to be written ... in assembly (or have the assembler insert it)?

@aclements
Copy link
Member

@aclements aclements commented Jun 15, 2022

Oh good point. Maybe a stress test where we constantly create new threads and bombard the process with SIGSETXID?

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Jun 15, 2022

package main

/*
#include <sys/types.h>
#include <unistd.h>
*/
import "C"
import "runtime"

var c = make(chan int, 1000)

func main() {
	for i := 0; i < 1000; i++ {
		go F(i)
	}
	for i := 0; i < 1000; i++ {
		<-c
	}
}

func F(i int) {
	runtime.LockOSThread() // so every goroutine uses a thread
	C.setgid(C.uint(i))
	c <- 1
}

Not sure this is the best way to stress this, but running this on thelinux-arm64-packet builder it has a not-too-small rate to fail with seg fault. Plain seg fault with no extra output, not the unwinder error like above, though.

5s: 12 runs so far, 0 failures
10s: 33 runs so far, 0 failures

/tmp/go-stress-20220615T200050-1378220594


ERROR: signal: segmentation fault


/tmp/go-stress-20220615T200050-1649915091


ERROR: signal: segmentation fault

15s: 50 runs so far, 2 failures (4.00%)

With the CL above, no failure in 1000 run.

I'll see if I can make it more likely to fail.

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Jun 15, 2022

If I add this to the runtime

diff --git a/runtime/os_linux.go b/tmp/os_linux.go
index 25aea6522d..4328d4371d 100644
--- a/runtime/os_linux.go
+++ b/tmp/os_linux.go
@@ -390,6 +390,10 @@ func gettid() uint32
 // Called to initialize a new m (including the bootstrap m).
 // Called on the new thread, cannot allocate memory.
 func minit() {
+       for i := 0; i < 1000; i++ {
+               dummy()
+       }
+
        minitSignals()
 
        // Cgo-created threads and the bootstrap m are missing a
@@ -398,6 +402,12 @@ func minit() {
        getg().m.procid = uint64(gettid())
 }
 
+//go:noinline
+func dummy() { dummy2() }
+
+//go:noinline
+func dummy2() {}
+
 // Called from dropm to undo the effect of an minit.
 //
 //go:nosplit

The program above fails 100%. I think this is consistent with the theory above.

@prattmic prattmic added NeedsFix and removed NeedsInvestigation labels Jun 16, 2022
@cherrymui
Copy link
Member

@cherrymui cherrymui commented Jun 16, 2022

I found that this doesn't fail with Go 1.18, even with the runtime hack above. I think this is due to https://go-review.googlesource.com/c/go/+/379075 changes the prologue. Previously, for small frames it is a single MOVD.W instruction that saves the LR and decrements the SP (followed by an instruction that saves the frame pointer, which isn't really matter here). Now it is two separate instructions. And the functions from mstart to setting the signal stack are probably all have small frames.

So another possibility is to undo CL 379075 for small frames. It is two instructions either way. (Previously MOVD.W storing LR, then MOVD storing FP; now SUB decrementing SP, then STP storing LR and FP.) And make sure all functions we call up to sigaltstack have small frames. (Then maybe we don't need to change the large frame cases.)

@aclements
Copy link
Member

@aclements aclements commented Jun 22, 2022

We chatted about this and the plan is to return to the old prologue for small frames as a stop-gap for 1.19, confirm by hand that this generates signal-safe prologues for the handful of functions that run before we have a signal stack, and look into enhancing the traceback metadata for 1.20 to capture ranges where the frame is non-empty but the LR should be retrieved from the LR register instead of the stack.

@gopherbot
Copy link

@gopherbot gopherbot commented Jun 22, 2022

Change https://go.dev/cl/413428 mentions this issue: cmd/internal/obj/mips,s390x: save LR after decrementing SP

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix release-blocker
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

6 participants