Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: SIGSEGV in misc/cgo/testshared on s390x since CL 358674 #49386

Closed
bcmills opened this issue Nov 5, 2021 · 12 comments
Closed

runtime: SIGSEGV in misc/cgo/testshared on s390x since CL 358674 #49386

bcmills opened this issue Nov 5, 2021 · 12 comments
Labels
arch-s390x NeedsInvestigation okay-after-beta1 release-blocker
Milestone

Comments

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 5, 2021

This is a regression with a fairly clear starting point, so marking as release-blocker for Go 1.18.

@bcmills bcmills added arch-s390x NeedsInvestigation release-blocker labels Nov 5, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 5, 2021
@jonathan-albrecht-ibm
Copy link
Contributor

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Nov 9, 2021

@mknyszek, I started debugging this yesterday to see if I can help in any way. I'm still trying to make sense of it but it seems to be related to the writeBarrier checks that are enabled when the GODEBUG flag cgocheck=2 is set.

I can reproduce the same failure on go1.17.3 on s390x in the misc/cgo/testshared TestGopathShlib test by running:

go test -c
GODEBUG=cgocheck=2 ./testshared.test -test.run TestGopathShlib

I ran ./testshared.test -test.v -testwork -test.run TestGopathShlib to keep the exe file created by the test and tried to debug it but there are no debug symbols. I was able to inspect the memory for the runtime.writeBarrier struct that is referenced in the os.newFile function that causes the segfault. At least I think so, please let me know if I got any of the stuff below wrong.

(gdb) x/40ni $pc                                                                                                            
=> 0x2aa00093f48 <os.newFile+24>:       lg      %r0,80(%r15)
<... skip some lines ...>
   0x2aa00093fca <os.newFile+154>:      lgrl    %r11,0x2aa00112bf8
   0x2aa00093fd0 <os.newFile+160>:      llgf    %r2,0(%r11)
   0x2aa00093fd6 <os.newFile+166>:      cijne   %r2,0,0x2aa00093fec <os.newFile+188>
   0x2aa00093fdc <os.newFile+172>:      lg      %r4,88(%r15)
   0x2aa00093fe2 <os.newFile+178>:      stg     %r4,56(%r1)
   0x2aa00093fe8 <os.newFile+184>:      j       0x2aa00093ffe <os.newFile+206>
   0x2aa00093fec <os.newFile+188>:      aghik   %r2,%r1,56
   0x2aa00093ff2 <os.newFile+194>:      lg      %r3,88(%r15)
   0x2aa00093ff8 <os.newFile+200>:      brasl   %r14,0x2aa00066610 <runtime.gcWriteBarrier@plt>
   0x2aa00093ffe <os.newFile+206>:      llgc    %r4,63(%r15)
(gdb) b *0x2aa00093fd0
Breakpoint 2 at 0x2aa00093fd0
(gdb) c
Continuing.

Thread 1 "exe" hit Breakpoint 2, 0x000002aa00093fd0 in os.newFile ()
(gdb) info reg $r11
r11            0x3fffdf32220       4398012113440
(gdb) x/16xb 0x3fffdf32220
0x3fffdf32220 <runtime.writeBarrier>:   0x01    0x00    0x00    0x00    0x00    0x01    0x00    0x00
0x3fffdf32228 <runtime.writeBarrier+8>: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00

and then continuing will generate the segfault. IIUC, the two 0x01 bytes are the runtime.writeBarrier.enabled and runtime.writeBarrier.cgo fields.

If I do the same with GODEBUG=cgocheck=0 gdb ./gopath/bin/exe the runtime.writeBarrier looks like:

(gdb)  x/16xb 0x3fffdf32220
0x3fffdf32220 <runtime.writeBarrier>:   0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x3fffdf32228 <runtime.writeBarrier+8>: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00

and no segfault.

But if I do the same with a build from the source checked out at 961aab2 where the s390x builder started failing using GODEBUG=cgocheck=0 gdb ./gopath/bin/exe, it appears the runtime.writeBarrier.enabled and runtime.writeBarrier.cgo fields are true:

(gdb) x/16xb  0x3fffdf66220
0x3fffdf66220 <runtime.writeBarrier>:   0x01    0x00    0x00    0x00    0x01    0x00    0x00    0x00
0x3fffdf66228 <runtime.writeBarrier+8>: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00

If I break at the beginning of the os.newFile function and wait a bit then continue to after $r11 is loaded, I usually find that the runtime.writeBarrier.enabled and runtime.writeBarrier.cgo fields have been flipped back to false and then no segfault.

So I think there is a bug on s390x with cgocheck=2 that's been around for a while and its been exposed by some kind of issue with the runtime.writeBarrier fields being set or being visible. I'll keep looking but I'm not sure I'll be able to get much farther. Happy to help in any way I can though.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Nov 9, 2021

@jonathan-albrecht-ibm Thanks for looking into it and for the detailed analysis! It confirms my suspicions.

The runtime.writeBarrier fields are likely just set because a GC is actively happening. This is related to 961aab2 because that enabled the new pacer, which lowers the minimum heap size, so a whole bunch of tests started executing GCs that didn't before. Write barriers are now on, so if one is triggered where it's not safe to do so, it can cause a crash (usually because the executing code doesn't have a valid P, in Go scheduler terminology).

I suspect the cgocheck mode is failing for exactly the same reason: it's enabling write barriers in more places, and a write barrier is happening where it's not safe to do so. However, it's also almost certainly correct to have the write barrier on in os.newFile here, so I suspect there's something about the execution environment of the code that makes it not safe to do regular Go things.

@jonathan-albrecht-ibm
Copy link
Contributor

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Nov 9, 2021

Thanks for the explanation @mknyszek. That helps clear things up a bit. I'll continue looking at it but will likely be slow going.

Note that s390x trial vms are available at https://linuxone.cloud.marist.edu/ if anyone is interested.

@jonathan-albrecht-ibm
Copy link
Contributor

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Nov 11, 2021

From stepping through the code, I think the PLT symbol code (not sure what to call it) might be clobbering some registers. I found the source at src/cmd/link/internal/s390x/asm.go in the addpltsym function generates those instructions so I'm going to look at that next.

@jeremyfaller jeremyfaller added the okay-after-beta1 label Nov 12, 2021
@jeremyfaller
Copy link
Contributor

@jeremyfaller jeremyfaller commented Nov 12, 2021

This builder seems to have gone away. @cherrymui has some ideas on quick tests to help diagnose.

Moving this to OK after Beta1.

@jonathan-albrecht-ibm
Copy link
Contributor

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Nov 12, 2021

I had a look at the builders and they look ok. Would they have been disabled after some number of failures?

Glad to hear @cherrymui has some ideas on testing. Let me know if I can help with that.

My guess is that register R1 and maybe others are being clobbered by the branch to runtime.gcWriteBarrier@plt so I'm trying to understand the code that sets up that call. Here is some debug info in case it helps:

<...SKIP SOME LINES...>
<R1 == 0xc000218000>
   0x2aa00078888 <os.newFile+200>: brasl %r14,0x2aa00047ea8 <runtime.gcWriteBarrier@plt>
<R1 == 0x3fffde16f90>
   0x2aa0007888e <os.newFile+206>: llgc %r4,63(%r15)
   0x2aa00078894 <os.newFile+212>: stc %r4,81(%r1)        <=== SEGFAULT HERE

where inside runtime.gcWriteBarrier@plt it looked like:

<R1 == 0xc000218000>
   0x2aa00047ea8 <runtime.gcWriteBarrier@plt>:  larl    %r1,0x2aa000f2e68 <runtime.gcWriteBarrier@got.plt>
<R1 == 0x2aa000f2e68>
   0x2aa00047eae <runtime.gcWriteBarrier@plt+6>:        lg      %r1,0(%r1)
<R1 == 0x3fffde16f90>
   0x2aa00047eb4 <runtime.gcWriteBarrier@plt+12>:       br      %r1
   0x2aa00047eb6 <runtime.gcWriteBarrier@plt+14>:       basr    %r1,%r0
   0x2aa00047eb8 <runtime.gcWriteBarrier@plt+16>:       lgf     %r1,12(%r1)
   0x2aa00047ebe <runtime.gcWriteBarrier@plt+22>:       jg      0x2aa00046e08
   0x2aa00047ec4 <runtime.gcWriteBarrier@plt+28>:       .long   0x00000c60

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Nov 13, 2021

@jonathan-albrecht-ibm Could you test if CL https://go-review.googlesource.com/c/go/+/363698 helps? Thanks!

(I cannot test it myself as the builder disappears.)

@gopherbot
Copy link

@gopherbot gopherbot commented Nov 13, 2021

Change https://golang.org/cl/363698 mentions this issue: cmd/compile, runtime: mark R1 as clobbered for write barrier call

@jonathan-albrecht-ibm
Copy link
Contributor

@jonathan-albrecht-ibm jonathan-albrecht-ibm commented Nov 13, 2021

Thanks @cherrymui, it looks good. I ran all.bash and all of the tests pass. I also ran:

cd ../misc/cgo/testshared/
~/src/goroot-r1-fix/bin/go test -c
GODEBUG=cgocheck=2 ./testshared.test

and that also passes.

@cherrymui
Copy link
Member

@cherrymui cherrymui commented Nov 15, 2021

@jonathan-albrecht-ibm thanks!

gopherbot pushed a commit that referenced this issue Nov 15, 2021
If the call to gcWriteBarrier is via PLT, the PLT stub will
clobber R1. Mark R1 clobbered.

For #49386.

Change-Id: I72df5bb3b8d10381fec5c567b15749aaf7d2ad70
Reviewed-on: https://go-review.googlesource.com/c/go/+/363698
Trust: Cherry Mui <cherryyz@google.com>
Run-TryBot: Cherry Mui <cherryyz@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
@cherrymui
Copy link
Member

@cherrymui cherrymui commented Nov 15, 2021

Fixed by the CL above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-s390x NeedsInvestigation okay-after-beta1 release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants