Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent TestSegv failures since 2021-10-26 #49182

Closed
bcmills opened this issue Oct 27, 2021 · 31 comments
Closed

runtime: frequent TestSegv failures since 2021-10-26 #49182

bcmills opened this issue Oct 27, 2021 · 31 comments
Assignees
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Milestone

Comments

@bcmills bcmills added NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Oct 27, 2021
@bcmills bcmills added this to the Go1.18 milestone Oct 27, 2021
@cuonglm
Copy link
Member

cuonglm commented Oct 27, 2021

I think since https://go-review.googlesource.com/c/go/+/339990

CL https://go-review.googlesource.com/c/go/+/339989 causes windows-arm64-10 builder fails.

@bcmills
Copy link
Member Author

bcmills commented Oct 27, 2021

Filed the windows-arm64-10 issue separately as #49188.

@gopherbot
Copy link

gopherbot commented Oct 27, 2021

Change https://golang.org/cl/359254 mentions this issue: runtime: disable TestSegv on darwin, illumos, solaris

gopherbot pushed a commit that referenced this issue Oct 28, 2021
CL 339990 made this test more strict, exposing pre-existing issues on
these OSes. Skip for now until they can be resolved.

Updates #49182

Change-Id: I3ac400dcd21b801bf4ab4eeb630e23b5c66ba563
Reviewed-on: https://go-review.googlesource.com/c/go/+/359254
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
TryBot-Result: Go Bot <gobot@golang.org>
@bcmills
Copy link
Member Author

bcmills commented Oct 28, 2021

Hmm, looks like relaxing the check didn't completely solve the problem:

greplogs --dashboard -md -l -e 'FAIL: TestSegv' --since=2021-10-28T16:54:00

2021-10-28T16:54:58-6bd0e7f/linux-arm-aws

@bcmills
Copy link
Member Author

bcmills commented Nov 2, 2021

linux-arm-aws is the only one that appears to still be flaky:

greplogs --dashboard -md -l -e '(?ms)TestSegv.*unexpectedly saw "runtime: "'

2021-11-02T20:47:30-b246873/linux-arm-aws
2021-11-02T17:01:14-62b29b0/linux-arm-aws
2021-10-28T16:54:58-6bd0e7f/linux-arm-aws

@aclements
Copy link
Member

aclements commented Nov 3, 2021

This looks like the signal is landing while we're in the VDSO, presumably executing a kernel-provided atomic for casgstatus.

@prattmic, you looked at and fixed a few similar issues recently (e.g., 86f6bf1). Any insights on this one?

@aclements
Copy link
Member

aclements commented Nov 3, 2021

Yeah, the "unknown PC" is in the kernel-provided CAS implementation. The problem may just be that we're entirely missing the vdsoSP protection around the VDSO cas call (presumably the same thing applies to memory_barrier<>, too).

@prattmic
Copy link
Member

prattmic commented Nov 3, 2021

Yes, I believe that is the case. I'll get those covered as well.

@prattmic prattmic self-assigned this Nov 8, 2021
@gopherbot
Copy link

gopherbot commented Nov 9, 2021

Change https://golang.org/cl/362796 mentions this issue: runtime/internal/atomic: treat ARM kernel helpers as VDSO

@gopherbot
Copy link

gopherbot commented Nov 9, 2021

Change https://golang.org/cl/362795 mentions this issue: runtime: refactor ARM VDSO call setup to helper

@gopherbot
Copy link

gopherbot commented Nov 10, 2021

Change https://golang.org/cl/362977 mentions this issue: runtime: start ARM atomic kernel helper traceback in caller

@jeremyfaller jeremyfaller added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 12, 2021
@jeremyfaller
Copy link
Contributor

jeremyfaller commented Nov 12, 2021

Mostly fixed. Not a Beta1 blocker as it's not a new breakage, just a stricter test.

@bcmills
Copy link
Member Author

bcmills commented Nov 17, 2021

Looks like this is fixed on ARM and skipped on amd64, but still occasionally failing on some of the more exotic architectures.
(It's not clear to me whether the remaining failures are arch-specific bugs.)

greplogs --dashboard -md -l -e 'FAIL: TestSegv ' --since=2021-11-13

2021-11-17T04:55:12-1d004fa/linux-mips64le-mengzhuo
2021-11-13T00:23:16-39bc666/linux-riscv64-unmatched

@bcmills bcmills reopened this Nov 17, 2021
@bcmills
Copy link
Member Author

bcmills commented Nov 17, 2021

Looks like both of those failures are in the SegvInCgo subtest.

@bcmills
Copy link
Member Author

bcmills commented Dec 1, 2021

https://storage.googleapis.com/go-build-log/3c4f5e79/linux-riscv64-jsing_db2bc678.log (A linux-riscv64-jsing SlowBot), also failed in SegvInCgo.

@prattmic
Copy link
Member

prattmic commented Dec 3, 2021

The netbsd and openbsd failures there appear to be #49209.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2021

Excluding the NetBSD and OpenBSD failures, all the ones that are left are in TestSegvInCgo like the two @bcmills posted about before. I don't think this is the same problem as the one @prattmic fixed earlier, because even an arm64 builder is failing there.

I'll poke at this one a bit.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2021

In all these cases it looks like the signal is landing in some part of the cgocall path (not all that surprising), though I'm not sure I understand why gentraceback has issues producing a traceback in these cases.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 6, 2021

oh oh, ok, actually for the linux/arm64 failure, the PC looks like the PC for the branch that @prattmic added in https://go.dev/cl/362977 (i.e. the top bits of the PC (in a 48-bit address space) are 0xffff). So there's some kind of other VDSO (?) call on at least this platform where there's more going on.

The other platforms' failures don't seem to look like this, however.

@prattmic
Copy link
Member

prattmic commented Dec 6, 2021

The arm64 failure could occur even with 3634594 if one of the kernel helpers made a CALL (thus setting LR). I thought I verified that none of the helpers we used did this, but I guess I should go back and double-check on the specific kernel version on the builders.

Beyond this, it looks like we have failures on linux-mips64le, linux-riscv64, and linux-s390x.

@mknyszek
Copy link
Contributor

mknyszek commented Dec 7, 2021

I'll note that 3634594 only applies to 32-bit ARM, AFAICT, so it shouldn't be taking that path. It could still be that we have some VDSO call that makes frames.

The RISC-V failures are kind of interesting because there's no hexdump of the stack -- I'm not really sure why. Notably, on cgoSigtramp, the LR is 0 (cgoSigtramp is a thin wrapper around sigtramp that just jumps and links a zero LR.

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@prattmic
Copy link
Member

prattmic commented Dec 16, 2021

@prattmic
Copy link
Member

prattmic commented Dec 16, 2021

In the linux-386 failure, PC=0xf7f3db09 seems likely to be in the VDSO.

The VDSO load address is randomized, but this seems in approximately the correct range. Additionally, page offset 0xb09 is a valid instruction just after a syscall, which seems like a likely place to get a signal delivered.

   0xf7fd1b00 <__kernel_vsyscall>:      push   %ecx
   0xf7fd1b01 <__kernel_vsyscall+1>:    push   %edx
   0xf7fd1b02 <__kernel_vsyscall+2>:    push   %ebp
   0xf7fd1b03 <__kernel_vsyscall+3>:    mov    %esp,%ebp
   0xf7fd1b05 <__kernel_vsyscall+5>:    sysenter 
   0xf7fd1b07 <__kernel_vsyscall+7>:    int    $0x80
   0xf7fd1b09 <__kernel_vsyscall+9>:    pop    %ebp
   0xf7fd1b0a <__kernel_vsyscall+10>:   pop    %edx
   0xf7fd1b0b <__kernel_vsyscall+11>:   pop    %ecx
   0xf7fd1b0c <__kernel_vsyscall+12>:   ret  

Go doesn't use __kernel_vsyscall directly, but libc does. Maybe we get there somehow in a context where the signal handler doesn't realize this is cgo?

Edit: Note that assembly dump is from an unrelated runtime.test process that did not fail.

@prattmic
Copy link
Member

prattmic commented Dec 16, 2021

e.g., __kernel_vsyscall is reachable from cgo_start_thread, which is called via asmcgocall.

It looks like the signal handler won't properly detect raw asmcgocall calls as cgo calls because sighandler checks g.m.incgo, which is only set in cgocall.

@cherrymui does this sound correct to you, or did I miss some detection mechanism?

@cherrymui
Copy link
Member

cherrymui commented Dec 16, 2021

@prattmic I think that is right. Direct asmcgocall is used in limited situations, but I guess it is still possible to receive an external signal.

VDSO calls made from Go would have m.vdsoPC and m.vdsoSP set, so we can tell. But if it is made from non-Go code then we can't tell.

@cherrymui
Copy link
Member

cherrymui commented Dec 16, 2021

I still think it is too strong to require that we can unwind the stack any time we receive an external signal. It think it is important to print SIGSEGV, but the unwinding could be best effort. Maybe we could try unwinding the curg stack if the signal is landed on g0 (even if incgo is not set) so it could be somewhat helpful.

@prattmic
Copy link
Member

prattmic commented Jan 7, 2022

I still think it is too strong to require that we can unwind the stack any time we receive an external signal.

I tend to agree that as a general feature getting correct tracebacks when an external SIGSEGV lands in an esoteric location is not important to users. I've been pushing on these bugs primarily to support testing. To ensure that the common cases work and don't regress, I want to be able to write a straightforward test that tracebacks work. It is very difficult to write such a test if there are lots of edge cases that don't work properly.

For now, I'll skip this part of the test on 386, and I've filed #50504 as a low priority issue for this asmcgocall problem.

@gopherbot
Copy link

gopherbot commented Jan 7, 2022

Change https://golang.org/cl/376656 mentions this issue: runtime: skip TestSegv traceback check on 386

@bcmills
Copy link
Member Author

bcmills commented Jan 14, 2022

The linux-mips64le-mengzhuo failures are still occurring on the builder. It's not obvious to me whether that has the same root cause as the 386 failure mode, so I filed it separately as #50605.

jproberts pushed a commit to jproberts/go that referenced this issue Jun 21, 2022
The VDSO (__kernel_vsyscall) is reachable via
asmcgocall(cgo_start_thread) on linux-386, which causes traceback to
throw.

Fixes golang#49182.
For golang#50504.

Change-Id: Idb78cb8de752203ce0ed63c2dbd2d12847338688
Reviewed-on: https://go-review.googlesource.com/c/go/+/376656
Reviewed-by: Cherry Mui <cherryyz@google.com>
Trust: Michael Pratt <mpratt@google.com>
Run-TryBot: Michael Pratt <mpratt@google.com>
@prattmic prattmic self-assigned this Jun 24, 2022
gopherbot pushed a commit that referenced this issue Aug 17, 2022
We have a very complex process to make VDSO calls on ARM. Create a
wrapper helper function which reduces duplication and allows for
additional calls from other packages.

vdsoCall has a few differences from the original code in
walltime/nanotime:

* It does not use R0-R3, as they are passed through as arguments to fn.
* It does not save g if g.m.gsignal.stack.lo is zero. This may occur if
it called at startup on g0 between assigning g0.m.gsignal and setting
its stack.

For #49182

Change-Id: I51aca514b4835b71142011341d2f09125334d30f
Reviewed-on: https://go-review.googlesource.com/c/go/+/362795
Run-TryBot: Michael Pratt <mpratt@google.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Cherry Mui <cherryyz@google.com>
TryBot-Result: Gopher Robot <gobot@golang.org>
@gopherbot
Copy link

gopherbot commented Sep 20, 2022

Change https://go.dev/cl/431975 mentions this issue: runtime: enable TestSegv on darwin, illumos, solaris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker
Projects
None yet
Development

No branches or pull requests

8 participants