Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: SIGBUS / SIGSEGV during asmcgocall #46170

Open
bcmills opened this issue May 14, 2021 · 18 comments
Open

runtime: SIGBUS / SIGSEGV during asmcgocall #46170

bcmills opened this issue May 14, 2021 · 18 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented May 14, 2021

One while running cmd/go:

2021-05-14T14:37:54-d137b74/solaris-amd64-oraclerel

##### 
fatal error: unexpected signal during runtime execution
[signal SIGBUS: bus error code=0x3 addr=0x7fff929fefd8 pc=0x46eadd]

runtime stack:
runtime.throw({0xa53921, 0x2a})
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/panic.go:1198 +0x74 fp=0x7fff929fef80 sp=0x7fff929fef50 pc=0x43a2d4
runtime.sigpanic()
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/signal_unix.go:719 +0x4a5 fp=0x7fff929fefe0 sp=0x7fff929fef80 pc=0x451be5
runtime.asmcgocall(0x0, 0x0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/asm_amd64.s:795 +0xbd fp=0x7fff929fefe8 sp=0x7fff929fefe0 pc=0x46eadd

Another while running cmd/vet:

2021-04-14T19:38:22-b161b57/solaris-amd64-oraclerel

# vendor/golang.org/x/net/nettest
fatal error: unexpected signal during runtime execution
[signal SIGBUS: bus error code=0x3 addr=0x7fff945feff8 pc=0x468fd2]

runtime stack:
runtime.throw(0x728723, 0x2a)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/panic.go:1191 +0x74
runtime.sigpanic()
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/signal_unix.go:719 +0x4a5
runtime.asmcgocall(0x0, 0x0)
	/tmp/workdir-host-solaris-oracle-amd64-oraclerel/go/src/runtime/asm_amd64.s:796 +0xb2

To me that smells like a runtime or compiler bug, and since these are the only two in the logs it looks like a Go 1.17 regression.

CC @prattmic @cherrymui @randall77

@bcmills bcmills added this to the Go1.17 milestone May 14, 2021
@bcmills
Copy link
Member Author

@bcmills bcmills commented May 14, 2021

Marking as OS-Solaris for now, but with only two occurrences it's not obvious to me whether this is a Solaris-specific bug or just more readily triggered by something on that particular builder (such as signal-delivery timing or CPU count).

@prattmic
Copy link
Member

@prattmic prattmic commented May 14, 2021

This instruction is CALL AX, and the fault address looks like a stack address, so we are perhaps trying to jump into the stack? Trying to jump to a non-PROT_EXEC mapping would be SIGSEGV on Linux, but perhaps it is SIGBUS on Solaris?

Either way, perhaps this is a regabi-related regression?

cc @mknyszek

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 14, 2021

Trying to jump to a non-PROT_EXEC mapping would be SIGSEGV on Linux, but perhaps it is SIGBUS on Solaris?

Ooh, neat! That yields two more matching logs, which both bear a strong similarity to #46080.
2021-05-06T19:28:34-c0140e8/openbsd-386-64
2021-04-30T20:08:34-7a6108e/openbsd-386-64

@bcmills bcmills changed the title runtime: SIGBUS during asmcgocall on solaris-amd64-oraclerel runtime: SIGBUS / SIGSEGV during asmcgocall May 14, 2021
@prattmic
Copy link
Member

@prattmic prattmic commented May 14, 2021

Yeah, I'd say these are all related. It's curious that the openbsd fault addresses are all page-aligned, while the solaris ones aren't.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2021

I'm looking at the Solaris ones first, because those OpenBSD failures are on 386 and so are less likely to be affected by anything regabi related. I'm not totally convinced it's the same issue, as a result, though maybe it's correlated with some change in compiler behavior.

As I'm looking at the code for this, I don't fully understand what value is supposed to be passed to asmcgocall. From the first failure (2021-05-14T14:37:54-d137b74/solaris-amd64-oraclerel), I gather that goroutine 387 is making a call to a libc function, and that this is where the asmcgocall gets its fn value. The value that's passed is unsafe.Pointer(&asmsysvicall6x), but looking at asmsysvicall6x I see:

type libcFunc uintptr

//go:linkname asmsysvicall6x runtime.asmsysvicall6
var asmsysvicall6x libcFunc // name to take addr of asmsysvicall6

func asmsysvicall6() // declared for vet; do NOT call

This is weird. The faulting address does appear to be a stack address, but how did it get there? We're clearly in the 'nosave' path of asmcgocall, and AX is untouched between when it gets the fn value and where we do CALL AX. Note that all regabi flags are turned off for Solaris, including ABI wrappers.

In the 1.17 release, asmcgocall only changed in one line, and that's the CALL gosave_systemstack_switch<>(SB) line which I'm fairly confident is not getting exercised because we're in the 'nosave' path.

My only thought is that the stack address somehow propagates from the caller, but unsafe.Pointer(&asmsysvicall6x) appears to be... the address of a global variable? Which honestly doesn't seem right. But because Solaris is a libc platform, this is getting called a lot, and the Solaris builder isn't totally broken, so clearly this is correct in some sense.

Perhaps the trick here is that it shouldn't be going on the nosave path at all. That doesn't explain why AX seems busted, but the call is happening from what appears to be a regular goroutine, so there should be a system stack switch there. Maybe the line numbers are messed up somehow?

Oh, actually, yes! There IS a system stack switch that happens, because asmcgocall has a system-stack-like address (indicating a stack obtained from the OS) and its supposed caller has a heap-like address (indicating a stack created by Go). I think the line numbers here might actually be busted and this could be somehow related to golang.org/cl/288799 which was landed in February.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2021

@bcmills Out of curiosity, how far back do these failures go? What you posted above, is that all of them?

It's possible this is some fun combination of https://golang.org/cl/288799 and a regabi-related CL, too, given the current timeline.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 14, 2021

unsafe.Pointer(&asmsysvicall6x) appears to be... the address of a global variable? Which honestly doesn't seem right

This is actually expected. This is how Solaris port works. asmsysvicall6x is a C function. It is declared as a variable (!) and using linkname to connect to the C function. Arguably it doesn't look nice.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2021

@cherrymui Thanks, good to know. At least that part makes sense now.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 14, 2021

The "nosave" may not be wrong, either. It may be already on the system stack, so asmcgocall doesn't switch stack. But the traceback stops as asmcgocall, because it doesn't know how to unwind through it (maybe it is possible to teach traceback code for that case, if it can tell we are on the "nosave" path.).

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 14, 2021

Can the SIGBUS be alignment? if the C function called by asmcgocall uses a 16-byte aligned instruction to load something from the stack, but the the address is not aligned?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2021

The "nosave" may not be wrong, either. It may be already on the system stack, so asmcgocall doesn't switch stack. But the traceback stops as asmcgocall, because it doesn't know how to unwind through it (maybe it is possible to teach traceback code for that case, if it can tell we are on the "nosave" path.).

Perhaps I'm missing something. Do you have any ideas as to where does the stack switch happens? I thought that it must be asmcgocall given that its caller is just ordinary Go code in the runtime package, and its sp and fp in the traceback look like Go pointers. Unless entersyscallblock switches to the system stack?

Though, that goroutine that I'm interested in (number 378), it's runnable, not in _Gsyscall like it should be if it went through entersyscallblock...

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2021

Can the SIGBUS be alignment? if the C function called by asmcgocall uses a 16-byte aligned instruction to load something from the stack, but the the address is not aligned?

Ah! Yeah, that's a good point. I think that might actually be it.

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 14, 2021

Out of curiosity, how far back do these failures go? What you posted above, is that all of them?

That's all of the recent ones I could find using greplogs. (There is a bit of a discontinuity and then some failures from darwin-386 back in 2019, but there's enough distance in between that I'm not sure those old darwin failures are related.)

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 14, 2021

if the C function called by asmcgocall uses a 16-byte aligned instruction to load something from the stack, but the the address is not aligned?

Oh, hey, that sounds familiar: see previously #17641.

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 14, 2021

If asmcgocall is called by goroutine 387 in syscall_sysvicall6, then it does look weird, both the stack switch and the G status. Maybe it is called from somewhere else?

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 14, 2021

If it is alignment, I would expect it fails more consistently, instead of very rarely. Maybe it is something else. Maybe OS bug...

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented May 14, 2021

Both OpenBSD failures have stack like this

goroutine 5155 [runnable]:
syscall.syscall(0x80b4a10, 0xc, 0x0, 0x0)
	/tmp/workdir/go/src/runtime/sys_openbsd3.go:22 +0x20
syscall.Close(0xc)
	/tmp/workdir/go/src/syscall/zsyscall_openbsd_386.go:513 +0x39
syscall.forkExec({0x893a4ff0, 0x16}, {0x7ac92c40, 0xe, 0xe}, 0x7b038bb0)
	/tmp/workdir/go/src/syscall/exec_unix.go:220 +0x3f3
syscall.StartProcess(...)
	/tmp/workdir/go/src/syscall/exec_unix.go:264
os.startProcess({0x893a4ff0, 0x16}, {0x7ac92c40, 0xe, 0xe}, 0x7b038c74)
	/tmp/workdir/go/src/os/exec_posix.go:55 +0x256
os.StartProcess({0x893a4ff0, 0x16}, {0x7ac92c40, 0xe, 0xe}, 0x7b038c74)
	/tmp/workdir/go/src/os/exec.go:106 +0x57
os/exec.(*Cmd).Start(0x7e042160)
	/tmp/workdir/go/src/os/exec/exec.go:422 +0x588
os/exec.(*Cmd).Run(0x7e042160)
	/tmp/workdir/go/src/os/exec/exec.go:338 +0x1b

It might be related to #34988.

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 14, 2021

That's a good point — and a significant difference compared to the Solaris failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants