Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: unexpected return pc crash on linux-amd64-alpine builder #54306

Closed
rsc opened this issue Aug 5, 2022 · 17 comments
Closed

runtime: unexpected return pc crash on linux-amd64-alpine builder #54306

rsc opened this issue Aug 5, 2022 · 17 comments
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@rsc
Copy link
Contributor

rsc commented Aug 5, 2022

The revived linux-amd64-alpine builder has flaked twice in its short new lifetime with 'unexpected return pc' crashes during the cgo tests.

Here is a repro case using a gomote (note that if you ssh in, you have to set up your environment manually, and in particular you have to put /workdir/go/bin at the front of PATH and have to set GOROOT_BOOTSTRAP=/workdir/go1.4). Not sure why the environment is so messed up on Alpine. gomote run does not have these problems, only gomote ssh.

VM=$(gomote create linux-amd64-alpine)
gomote push $VM
gomote run $VM go/src/make.bash
gomote put -mode 0777 $VM - try.sh <<'EOF'
#!/bin/bash
cd /workdir/go/misc/cgo/test
for i in $(seq 100); do 
    date
    if ! /workdir/go/bin/go test >log 2>&1; then
        cat log
    fi
done
EOF
gomote run $VM try.sh

You may need to repeat the try.sh a few times depending on how flaky the machine is feeling but most runs get at least one failure.

Here are some failures from that script:

runtime: g 3: unexpected return pc for runtime.gcenable.func1 called from 0x0
stack: frame={sp:0xc0000557c8, fp:0xc0000557e0} stack=[0xc000055000,0xc000055800)
0x000000c0000556c8:  0x000000c000055750  0x000000000040d21d <runtime.chansend+0x000000000000055d> 
0x000000c0000556d8:  0x0000000000581220  0x000000c00007e060 
0x000000c0000556e8:  0x00000000005e9f78  0x0000000000000000 
0x000000c0000556f8:  0x0000000000000000  0x0000000000000000 
0x000000c000055708:  0x0000000000000000  0x0000000000000000 
0x000000c000055718:  0x000000c00007e058  0x0000000000000000 
0x000000c000055728:  0x0000000000000000  0x0000000000000000 
0x000000c000055738:  0x0000000000000000  0x0000000000000000 
0x000000c000055748:  0x0000000000000000  0x000000c000055780 
0x000000c000055758:  0x000000000040cc9d <runtime.chansend1+0x000000000000001d>  0x000000c00007e000 
0x000000c000055768:  0x0000000000440bb6 <runtime.gopark+0x00000000000000d6>  0x0000000000000001 
0x000000c000055778:  0x0000000000000000  0x000000c0000557b8 
0x000000c000055788:  0x000000000042ba2e <runtime.bgsweep+0x000000000000008e>  0x0000000000000000 
0x000000c000055798:  0x0000000000000000  0x0000000000000000 
0x000000c0000557a8:  0x0000000000000000  0x000000c00007e000 
0x000000c0000557b8:  0x000000c0000557d0  0x0000000000420706 <runtime.gcenable.func1+0x0000000000000026> 
0x000000c0000557c8: <0x00007f8a890934b6  0x00007f8a61816b64 
0x000000c0000557d8: !0x0000000000000000 >0x0000000000000000 
0x000000c0000557e8:  0x0000000000000000  0x00007f8a890d3600 
0x000000c0000557f8:  0x00007f8a89092acf 
fatal error: unknown caller pc
runtime: g 19: unexpected return pc for runtime.gcenable.func2 called from 0x0
stack: frame={sp:0xc000050fc8, fp:0xc000050fe0} stack=[0xc000050800,0xc000051000)
0x000000c000050ec8:  0x000000000000000e  0x000000c0000061a0 
0x000000c000050ed8:  0x000000c000050f60  0x000000000040d265 <runtime.chansend+0x00000000000005a5> 
0x000000c000050ee8:  0x0000000000000050  0x000000c00009c000 
0x000000c000050ef8:  0x0000000000000000  0x0000010000000000 
0x000000c000050f08:  0x0000000000000003  0x0000000000000030 
0x000000c000050f18:  0x0000000000000000  0x0000000000000050 
0x000000c000050f28:  0x000000c000096058  0x000000c00007e000 
0x000000c000050f38:  0x0000000000000000  0x0000000000000000 
0x000000c000050f48:  0x0000000000440bb6 <runtime.gopark+0x00000000000000d6>  0x000000000040d320 <runtime.chansend.func1+0x0000000000000000> 
0x000000c000050f58:  0x000000c000096000  0x000000c000050f90 
0x000000c000050f68:  0x0000000000429ad3 <runtime.(*scavengerState).park+0x0000000000000053>  0x000000c000096000 
0x000000c000050f78:  0x00000000005e9f78  0x0000000000000001 
0x000000c000050f88:  0x0000000000000000  0x000000c000050fb8 
0x000000c000050f98:  0x000000000042a0a5 <runtime.bgscavenge+0x0000000000000045>  0x00000000006f9960 
0x000000c000050fa8:  0x0000000000000000  0x000000c000096000 
0x000000c000050fb8:  0x000000c000050fd0  0x00000000004206a6 <runtime.gcenable.func2+0x0000000000000026> 
0x000000c000050fc8: <0x00007f47256144b6  0x00007f46fdea3b64 
0x000000c000050fd8: !0x0000000000000000 >0x0000000000000000 
0x000000c000050fe8:  0x0000000000000000  0x00007f4725654600 
0x000000c000050ff8:  0x00007f4725613acf 
fatal error: unknown caller pc

This one did not happen during garbage collection:

runtime: g 20: unexpected return pc for testing.tRunner called from 0x7feeabb0dacf
stack: frame={sp:0xc000051770, fp:0xc0000517c0} stack=[0xc000051000,0xc000051800)
0x000000c000051670:  0x000000012a05f200  0x000000c0000880a0 
0x000000c000051680:  0x000000c000094180  0x000000c0000516f8 
0x000000c000051690:  0x000000c000102b80  0x000000c000102b60 
0x000000c0000516a0:  0x0000000000000000  0x00000000005890c0 
0x000000c0000516b0:  0x00000000006d7d50  0x0000000000000000 
0x000000c0000516c0:  0x0000000000000000  0x0000000000000000 
0x000000c0000516d0:  0x0000000000000000  0x000000c000051730 
0x000000c0000516e0:  0x0000000000454a36 <runtime.sigpanic+0x00000000000002f6>  0x00000000005890c0 
0x000000c0000516f0:  0x00000000006d7d50  0x000000c000051748 
0x000000c000051700:  0x0000000000561ceb <misc/cgo/test.testSetgid+0x00000000000000ab>  0x000000c0001121e0 
0x000000c000051710:  0x000000c000102b60  0x0000000000000001 
0x000000c000051720:  0x00000000006ea660  0x00000000005eb418 
0x000000c000051730:  0x000000c000051760  0x0000000000478bfe <sync.(*RWMutex).Lock+0x000000000000001e> 
0x000000c000051740:  0x0000000000000000  0x000000c000051760 
0x000000c000051750:  0x0000000000526bd9 <misc/cgo/test.TestSetgid+0x0000000000000019>  0x000000c0001029c0 
0x000000c000051760:  0x000000c0000517b0  0x00000000004d6d15 <testing.tRunner+0x0000000000000115> 
0x000000c000051770: <0x0000000000000000  0x0300000000000000 
0x000000c000051780:  0x00000000004d6d80 <testing.tRunner.func2+0x0000000000000000>  0x00007feeabb0e4b6 
0x000000c000051790:  0x00007feeabb4ed8c  0x0000000000000000 
0x000000c0000517a0:  0x0000000000000000  0x0000000000000000 
0x000000c0000517b0:  0x00007feeabb4e600 !0x00007feeabb0dacf 
0x000000c0000517c0: >0x0000000000000000  0x00000000ffffffff 
0x000000c0000517d0:  0x0000000000000000  0x00000000004710a1 <runtime.goexit+0x0000000000000001> 
0x000000c0000517e0:  0x0000000000000000  0x0000000000000000 
0x000000c0000517f0:  0x0000000000000000  0x00007feeabb0e5d2 
fatal error: unknown caller pc

runtime stack:
runtime.throw({0x5ae5a1?, 0x6ea660?})
	/workdir/go/src/runtime/panic.go:1047 +0x5d fp=0x7fee843e3648 sp=0x7fee843e3618 pc=0x43de7d
runtime.gentraceback(0x100000000467aba?, 0xc000100000?, 0xc000102b60?, 0x7fee843e3a18?, 0x0, 0x0, 0x7fffffff, 0x7fee843e3a08, 0x0?, 0x0)
	/workdir/go/src/runtime/traceback.go:258 +0x1cf7 fp=0x7fee843e39b8 sp=0x7fee843e3648 pc=0x4658b7
runtime.addOneOpenDeferFrame.func1()
	/workdir/go/src/runtime/panic.go:645 +0x6b fp=0x7fee843e3a30 sp=0x7fee843e39b8 pc=0x43d00b
runtime.systemstack()
	/workdir/go/src/runtime/asm_amd64.s:492 +0x49 fp=0x7fee843e3a38 sp=0x7fee843e3a30 pc=0x46eee9

goroutine 20 [running]:
runtime.systemstack_switch()
	/workdir/go/src/runtime/asm_amd64.s:459 fp=0xc0000515e8 sp=0xc0000515e0 pc=0x46ee80
runtime.addOneOpenDeferFrame(0xc0000221e0?, 0xc000094180?, 0xc000112180?)
	/workdir/go/src/runtime/panic.go:644 +0x69 fp=0xc000051628 sp=0xc0000515e8 pc=0x43cf49
panic({0x5890c0, 0x6d7d50})
	/workdir/go/src/runtime/panic.go:844 +0x112 fp=0xc0000516e8 sp=0xc000051628 pc=0x43d792
runtime.panicmem(...)
	/workdir/go/src/runtime/panic.go:260
runtime.sigpanic()
	/workdir/go/src/runtime/signal_unix.go:837 +0x2f6 fp=0xc000051740 sp=0xc0000516e8 pc=0x454a36
sync.(*RWMutex).Lock(0x0?)
	/workdir/go/src/sync/rwmutex.go:147 +0x1e fp=0xc000051770 sp=0xc000051740 pc=0x478bfe

Here are the two build dashboard failures:

https://build.golang.org/log/658036e08c7a1d218c33808fdd1d6612b40502d8

runtime: g 2: unexpected return pc for runtime.forcegchelper called from 0x0
stack: frame={sp:0xc000056fb0, fp:0xc000056fe0} stack=[0xc000056800,0xc000057000)
0x000000c000056eb0:  0x0000000000000000  0x0000000000000000 
0x000000c000056ec0:  0x0000000000000000  0x0000000000000000 
0x000000c000056ed0:  0x0000000000000000  0x0000000000000000 
0x000000c000056ee0:  0x0000000000000000  0x0000000000000000 
0x000000c000056ef0:  0x0000000000000000  0x0000000000000000 
0x000000c000056f00:  0x0000000000000000  0x0000000000000000 
0x000000c000056f10:  0x0000000000000000  0x0000000000000000 
0x000000c000056f20:  0x0000000000000000  0x0000000000000000 
0x000000c000056f30:  0x0000000000000000  0x0000000000000000 
0x000000c000056f40:  0x0000000000000000  0x0000000000000000 
0x000000c000056f50:  0x0000000000000000  0x0000000000000000 
0x000000c000056f60:  0x0000000000000000  0x0000000000000000 
0x000000c000056f70:  0x0000000000000000  0x0000000000000000 
0x000000c000056f80:  0x0000000000000000  0x00005637530dbdb6 <runtime.gopark+0x00000000000000d6> 
0x000000c000056f90:  0x0000000000000000  0x0000000000000000 
0x000000c000056fa0:  0x000000c000056fd0  0x00005637530dbc4d <runtime.forcegchelper+0x00000000000000ad> 
0x000000c000056fb0: <0x0000000000000000  0x0000000000000000 
0x000000c000056fc0:  0x0000000000000000  0x00007efee325e4b6 
0x000000c000056fd0:  0x00007efebba04b64 !0x0000000000000000 
0x000000c000056fe0: >0x0000000000000000  0x0000000000000000 
0x000000c000056ff0:  0x00007efee329e600  0x00007efee325dacf 
fatal error: unknown caller pc

and

https://build.golang.org/log/94cf14d78b116487dc76a921baf6ba76480a4c7a

runtime: g 5: unexpected return pc for runtime.sigpanic called from 0x7f52c162dd8c
stack: frame={sp:0xc000058700, fp:0xc000058758} stack=[0xc000058000,0xc000058800)
0x000000c000058600:  0x0000564cf403107b <runtime.write+0x000000000000003b>  0x0000000000000002 
0x000000c000058610:  0x000000c000058648  0x0000564cf40109ce <runtime.recordForPanic+0x000000000000004e> 
0x000000c000058620:  0x0000564cf403107b <runtime.write+0x000000000000003b>  0x0000000000000002 
0x000000c000058630:  0x0000564cf4144017  0x0000000000000001 
0x000000c000058640:  0x0000000000000001  0x000000c000058680 
0x000000c000058650:  0x0000564cf4010cd2 <runtime.gwrite+0x00000000000000f2>  0x0000564cf4144017 
0x000000c000058660:  0x0000000000000001  0x0000000000000001 
0x000000c000058670:  0x000000c0000586e2  0x000000000000000e 
0x000000c000058680:  0x0000564cf4040210 <runtime.systemstack+0x0000000000000030>  0x0000564cf400f3cc <runtime.fatalthrow+0x000000000000006c> 
0x000000c000058690:  0x000000c0000586a0  0x000000c000007ba0 
0x000000c0000586a0:  0x0000564cf400f400 <runtime.fatalthrow.func1+0x0000000000000000>  0x000000c000007ba0 
0x000000c0000586b0:  0x0000564cf400f07f <runtime.throw+0x000000000000005f>  0x000000c0000586d0 
0x000000c0000586c0:  0x000000c0000586f0  0x0000564cf400f07f <runtime.throw+0x000000000000005f> 
0x000000c0000586d0:  0x000000c0000586d8  0x0000564cf400f0a0 <runtime.throw.func1+0x0000000000000000> 
0x000000c0000586e0:  0x0000564cf414445e  0x0000000000000005 
0x000000c0000586f0:  0x000000c000058748  0x0000564cf4025ca5 <runtime.sigpanic+0x00000000000002c5> 
0x000000c000058700: <0x0000564cf414445e  0x000000c0000161e0 
0x000000c000058710:  0x000000c000058728  0x0000000000000001 
0x000000c000058720:  0x00007f52c162dd8c  0x000000c000007ba0 
0x000000c000058730:  0x0000564cf41800e0  0x0000564cf40a7e14 <testing.tRunner+0x0000000000000034> 
0x000000c000058740:  0x0000000000000000  0x00007f52c15ed4b6 
0x000000c000058750: !0x00007f52c162dd8c >0x0000000000000000 
0x000000c000058760:  0x0000000000000000  0x0000000000000000 
0x000000c000058770:  0x00007f52c162d600  0x00007f52c15ecacf 
0x000000c000058780:  0x0000000000000000  0x00000000ffffffff 
0x000000c000058790:  0x0000564cf40a7fa0 <testing.tRunner.func1+0x0000000000000000>  0x000000c000007a00 
0x000000c0000587a0:  0x000000c000058780  0x000000c000058790 
0x000000c0000587b0:  0x000000c0000587d0  0x00007f52c15ed5d2 
0x000000c0000587c0:  0x00007f52c15f0080  0x00007f52c162d600 
0x000000c0000587d0:  0x00000000ffffffff  0x00007f52c15efbbb 
0x000000c0000587e0:  0x0000000000000000  0x00007f52c15efb6d 
0x000000c0000587f0:  0x00007f52c162d604  0x0000000000000000 

Perhaps this is Alpine-specific, or perhaps it is musl-related.
The Alpine image may have an old Linux kernel; maybe we should update it.

There are a few other open 'unexpected return pc' issues.
Maybe they are all stale:

#35005 is the most interesting one but the repro case is a very large program running under Docker.

@rsc rsc added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Aug 5, 2022
@rsc rsc added this to the Go1.20 milestone Aug 5, 2022
@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Aug 5, 2022
@prattmic
Copy link
Member

prattmic commented Aug 8, 2022

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/422097 mentions this issue: env/linux-x86-alpine: update to Alpine 3.16

@mknyszek
Copy link
Contributor

@rsc Assigning to you right now while you're updating the image, but feel free to unassign once you're done.

Updates from internal discussion:

  • Seems to reproduce after the update, too.
  • Maybe related to stack smashing protections?

CC @cherrymui @mdempsky

@rsc
Copy link
Contributor Author

rsc commented Aug 10, 2022

I updated the image already, just need to submit the CL.

gopherbot pushed a commit to golang/build that referenced this issue Aug 10, 2022
Also update Go version in buildlet/stage0 to make
build work again.

For golang/go#54306 (but does not fix it).

Change-Id: I7dd656de9cb9f563b816929330fa53059c93b5b8
Reviewed-on: https://go-review.googlesource.com/c/build/+/422097
Run-TryBot: Russ Cox <rsc@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Auto-Submit: Russ Cox <rsc@golang.org>
TryBot-Result: Gopher Robot <gobot@golang.org>
@prattmic
Copy link
Member

prattmic commented Aug 16, 2022

Since the 10th, when https://go.dev/cl/422097 was submitted. (I'm almost certain the coordinator has been redeployed since then).

2022-08-16T20:39:44-e49e876/linux-amd64-alpine
2022-08-12T16:38:52-f001df5/linux-amd64-alpine
2022-08-12T01:51:51-449691b/linux-amd64-alpine

(Edit: rereading the CL, I see it wasn't intended to fix this)

@prattmic
Copy link
Member

This is reproducible with go test in misc/cgo/test. Specifically, on a linux-amd64-alpine gomote, I had success with:

$ for i in $(seq 1 100); do gomote run -dir ./go/misc/cgo/test $INSTANCE ./go/bin/go test -count 1; done

@prattmic
Copy link
Member

Elsewhere, @cherrymui mentioned that this looks like it could be corruption from a stack overflow. I agree. In the partial trace below, everything below 0xc000122770 looks like the Go text or heap address, but above 0xc000122770, we see lots of system-looking addresses 0x00007fd.... And this is right at the top of the stack, where the stack below may be overflowing.

goroutine 37 [syscall (scan)]:
runtime: g 37: unexpected return pc for runtime.notetsleepg called from 0x0
stack: frame={sp:0xc000122768, fp:0xc0001227a0} stack=[0xc000122000,0xc000122800)
0x000000c000122000:  0x0000000000000000  0x0000000000000000 
... mostly zeroes ...
0x000000c0001225f0:  0x0000000000000000  0x0000000000000000 
0x000000c000122600:  0x0000000000000000  0x0000000000000000 
0x000000c000122610:  0x0000000000000000  0x0000000000000000 
0x000000c000122620:  0x0000000000000000  0x0000000000000000 
0x000000c000122630:  0x0000000000000000  0x0000000000000000 
0x000000c000122640:  0x000000c046505845  0x000000c000122688 
0x000000c000122650:  0x0000000000487d77 <time.NewTimer+0x00000000000000b7>  0x000000012a05f200 
0x000000c000122660:  0x0000000000000001  0x000005ff66f38c41 
0x000000c000122670:  0x000000012a05f200  0x000000c000118050 
0x000000c000122680:  0x000000c000126060  0x000000c0001226f8 
0x000000c000122690:  0x0000000000564979 <misc/cgo/test.runTestSetgid+0x0000000000000079>  0x0000000000606b28 
0x000000c0001226a0:  0x0000000000000000  0x000000000061acf3 
0x000000c0001226b0:  0x0000000000000022  0x000000000000051b 
0x000000c0001226c0:  0x00000000004d9940 <testing.tRunner+0x0000000000000000>  0x00000000006af138 
0x000000c0001226d0:  0x000000000043bdd6 <runtime.futexsleep+0x0000000000000036>  0x0000000000773660 
0x000000c0001226e0:  0x0000000000000080  0x0000000000000000 
0x000000c0001226f0:  0x0000000000000000  0x0000000000000000 
0x000000c000122700:  0x00000000005bf4f0  0x0000000300000002 
0x000000c000122710:  0x000000c00010a820  0x000000c000122758 
0x000000c000122720:  0x0000000000415045 <runtime.notetsleep_internal+0x0000000000000185>  0x000000c000122760 
0x000000c000122730:  0x00000000004d97a5 <testing.callerName+0x0000000000000045>  0x00000000004d9974 <testing.tRunner+0x0000000000000034> 
0x000000c000122740:  0x0000000000000000  0xffffffffffffffff 
0x000000c000122750:  0x000000c00010a820  0x000000c000122790 
0x000000c000122760:  0x0000000000415165 <runtime.notetsleepg+0x0000000000000045> <0x0000000000773660 
0x000000c000122770:  0x0000000000000000  0x00007fd7fa3c46fa 
0x000000c000122780:  0x00007fd7fa408b84  0x0000000000000000 
0x000000c000122790:  0x0000000000000000 !0x0000000000000000 
0x000000c0001227a0: >0xffffffffffffffff  0x00007fd7fa3c3cf5 
0x000000c0001227b0:  0x0000000000000000  0x0000000000000001 
0x000000c0001227c0:  0x000000c00010a340  0x00000000005bcb50 
0x000000c0001227d0:  0x00007fd7fa408400  0x00007fd7fa3c4820 
0x000000c0001227e0:  0x00007fd7fa3c74a3  0x00007fd7fa408400 
0x000000c0001227f0:  0x0000000000000000  0x00007fd7fa3c6fca 
runtime.notetsleepg(0xffffffffffffffff?, 0x7fd7fa3c3cf5?)
        /workdir/go/src/runtime/lock_futex.go:236 +0x34 fp=0xc0001227a0 sp=0xc000122768 pc=0x415154
created by os/signal.Notify.func1.1
        /workdir/go/src/os/signal/signal.go:151 +0x2a

@prattmic
Copy link
Member

FWIW, I've been unable to reproduce this with _StackLimit increased by 10x, which seems consistent with a stack overflow somewhere.

@prattmic
Copy link
Member

I take that back. It look about an hour (instead of the usual ~5 minutes), but I did get a repro with 10 * _StackLimit.

@prattmic prattmic assigned prattmic and unassigned rsc Aug 19, 2022
@prattmic
Copy link
Member

I've been making slow progress on this. The most notable is that this reproduces when running only TestSetgid and TestSetgidStress, while it does not reproduce while running only various other tests I've tried. (I haven't tried each test individually, as there are dozens and the repro time is a bit high). So this may be related to setgid, or just signals in general.

@prattmic
Copy link
Member

It looks like the problem is that signal 34 (SIGRT_2) used by musl for setgid is not getting SA_ONSTACK set.

If I'm interpreting strace correctly, it looks like this signal is still SIG_DFL when Go queries (it would set SA_ONSTACK if a handler was already installed):

1184993 rt_sigaction(SIGRT_2, NULL, {sa_handler=SIG_DFL, sa_mask=[], sa_flags=0}, 8) = 0

Only later does musl install the signal handler:

1184993 rt_sigaction(SIGRT_2, {sa_handler=0x7f29efb24078, sa_mask=~[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f29efb15c9f},  <unfinished ...>

@prattmic
Copy link
Member

musl does not install the SIGRT_2 handler until the first call to __synccall (actually, it reinstalls it on every call), which is why we don't know to add SA_ONSTACK at startup.

__synccall is used by setgroups, setrlimit, and setxid.

@cherrymui
Copy link
Member

reinstalls it on every call

(!)

That means even if we set SA_ONSTACK for their handler, they will reinstall and overwrite it?

@cherrymui
Copy link
Member

https://git.musl-libc.org/cgit/musl/tree/src/thread/synccall.c#n102

Does it mean that they remove the handler at exit of the call? Hm....

@prattmic
Copy link
Member

Correct, they don't even try to match the existing flags or forward to an existing handler, so we can't install a dummy SA_ONSTACK handler.

Does it mean that they remove the handler at exit of the call?

Yes, that is what I see:

1184993 rt_sigaction(SIGRT_2, {sa_handler=0x7f29efb24078, sa_mask=~[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f29efb15c9f},  <unfinished ...>
1184993 tkill(1184998, SIGRT_2)         = 0
1184998 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 tkill(1184996, SIGRT_2 <unfinished ...>
1184996 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 tkill(1184995, SIGRT_2 <unfinished ...>
1184995 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 tkill(1184997, SIGRT_2 <unfinished ...>
1184997 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 tkill(1184994, SIGRT_2 <unfinished ...>
1184994 --- SIGRT_2 {si_signo=SIGRT_2, si_code=SI_TKILL, si_pid=1184993, si_uid=0} ---
1184993 rt_sigaction(SIGRT_2, {sa_handler=SIG_IGN, sa_mask=~[], sa_flags=SA_RESTORER|SA_RESTART, sa_restorer=0x7f29efb15c9f},  <unfinished ...>

@prattmic
Copy link
Member

prattmic commented Aug 19, 2022

To summarize:

  • Go uses small goroutines stacks, so there is no guarantee that there is enough space on the stack for signal context and frame at all times.
  • To handle this, Go creates a separate signal stack for each thread installed with sigaltstack. All signal handlers must set SA_ONSTACK to use the signal stack and avoid smashing the goroutine stack.
  • To try to cooperate with libc, at startup Go inspects all signal handlers (even ones it doesn't care to handle), and adds SA_ONSTACK if it is not already set.
  • musl uses signal 34 for the various setxid calls, but does not install the handler at startup. Instead, it is temporarily installed on each call to the setxid functions (in __synccall).
  • As a result, Go never has a chance to add SA_ONSTACK.

I don't see how we can work around this in Go given that we can't adjust the signal handler flags, nor does __synccall respect flags from an existing signal handler. We would have to make goroutine stacks much larger, which would be a significant increase in stack allocations.

There are several changes on the musl side that could address this:

  • musl could install the signal 34 handler once at startup so that Go can adjust the flag.
  • Or, __synccall could query for an existing signal handler, and if it has SA_ONSTACK then keep that flag for their handler. In this case, Go would install a dummy signal 34 handler at startup just to expose SA_ONSTACK.
  • Or, even simpler, according to man 2 sigaction's SA_ONSTACK description: "If an alternate stack is not available, the default stack will be used." If this is accurate (I haven't verified), then __synccall could set SA_ONSTACK unconditionally, which would normally make no difference, but would use Go's sigaltstack when linked with Go.

@prattmic
Copy link
Member

Ah, it turns out this is a duplicate of #39857, which has been discussed at some length but not resolved.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

5 participants