Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent SIGSEGV in clock_gettime_trampoline on openbsd-386 #49532

Closed
bcmills opened this issue Nov 11, 2021 · 16 comments
Closed

runtime: frequent SIGSEGV in clock_gettime_trampoline on openbsd-386 #49532

bcmills opened this issue Nov 11, 2021 · 16 comments
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-OpenBSD
Milestone

Comments

@bcmills
Copy link
Member

bcmills commented Nov 11, 2021

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf1 pc=0x80a3000]

runtime stack:
runtime.throw({0x816ebe4, 0x2a})
	/tmp/workdir/go/src/runtime/panic.go:992 +0x64
runtime.sigpanic()
	/tmp/workdir/go/src/runtime/signal_unix.go:781 +0x22b
runtime.clock_gettime_trampoline()
	/tmp/workdir/go/src/runtime/sys_openbsd_386.s:524 +0x20

goroutine 8 [running]:
runtime.systemstack_switch()
	/tmp/workdir/go/src/runtime/asm_386.s:337 fp=0x72c2b768 sp=0x72c2b764 pc=0x80a0ba0
runtime.libcCall(0x80a2fe0, 0x72c2b794)
	/tmp/workdir/go/src/runtime/sys_libc.go:48 +0x5a fp=0x72c2b77c sp=0x72c2b768 pc=0x809386a
runtime.nanotime1()
	/tmp/workdir/go/src/runtime/sys_openbsd2.go:161 +0x51 fp=0x72c2b7a0 sp=0x72c2b77c pc=0x8093d71
runtime.nanotime(...)
	/tmp/workdir/go/src/runtime/time_nofake.go:19
runtime.gcBgMarkWorker()
	/tmp/workdir/go/src/runtime/mgc.go:1245 +0x13d fp=0x72c2b7f0 sp=0x72c2b7a0 pc=0x805e65d
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x72c2b7f4 sp=0x72c2b7f0 pc=0x80a1ee1
created by runtime.gcBgMarkStartWorkers
	/tmp/workdir/go/src/runtime/mgc.go:1122 +0x1f

greplogs --dashboard -md -l -e \(\?ms\)\\Aopenbsd-.\*clock_gettime_trampoline

2021-11-11T16:17:21-666fc17/openbsd-386-68
2021-11-11T04:02:33-3949faf/openbsd-386-68
2021-11-10T21:53:03-229b909/openbsd-386-68
2021-11-10T17:15:54-8a3be15/openbsd-386-68
2021-11-10T05:08:25-17980df/openbsd-386-68
2021-11-10T02:26:41-02d7eab/openbsd-386-68
2021-11-10T00:45:37-ec86bb5/openbsd-386-68
2021-11-09T21:58:03-805b4d5/openbsd-386-68
2021-11-09T00:08:09-bee0c73/openbsd-386-68
2021-11-08T17:46:34-2e210b4/openbsd-386-68
2021-11-06T19:41:14-cfb3dc7/openbsd-386-68
2021-11-06T16:43:43-3544082/openbsd-386-68
2021-11-05T22:55:56-e83a204/openbsd-386-68
2021-11-05T21:34:10-bb53fd7/openbsd-386-68
2021-11-05T21:28:34-90462df/openbsd-386-68
2021-11-05T21:27:34-7aed6dd/openbsd-386-68
2021-11-05T21:27:19-58ec925/openbsd-386-68
2021-11-05T00:52:04-1c4cfd8/openbsd-386-68
2021-11-04T23:56:29-0e5f287/openbsd-386-68
2021-11-04T21:41:49-156abe5/openbsd-386-68
2021-11-04T20:01:10-9b2dd1f/openbsd-386-68
2021-11-04T20:00:54-961aab2/openbsd-386-68
2021-10-06T22:28:59-d477ef3-b18ba59/openbsd-386-64
2021-09-30T20:30:12-1c35f2a-c035d82/openbsd-386-64
2021-09-29T20:06:10-1c35f2a-5930cff/openbsd-386-64
2021-09-27T22:22:35-ba6b94c-cd4d592/openbsd-386-64
2021-09-21T13:18:09-fe076c8-39e08c6/openbsd-386-64
2021-09-09T00:10:46-076821b-e30a090/openbsd-386-64
2021-08-30T02:40:46-3e0d083-56c3856/openbsd-386-64
2021-08-16T12:54:44-a55d515-c88e3ff/openbsd-386-64
2021-07-14T17:25:06-5061c41-60ddf42/openbsd-386-64
2021-06-21T20:53:11-d25f906-761edf7/openbsd-386-64
2021-06-08T20:19:02-689f4c7/openbsd-386-68
2021-06-08T20:19:02-689f4c7/openbsd-amd64-64
2021-06-03T16:41:39-4abb1e2-e0d029f/openbsd-386-64
2021-06-02T21:39:28-384c392-dd7ba3b/openbsd-386-64
2021-05-12T20:59:48-8287d5d-6db7480/openbsd-386-64
2021-05-10T23:42:56-79d39ff-5c48951/openbsd-386-64
2021-05-08T17:03:18-f05e912-b38b1b2/openbsd-386-64
2021-05-07T02:17:32-c0140e8-d2b0311/openbsd-386-64
2021-05-05T21:37:16-1949673-cf73f1a/openbsd-386-64

@bcmills bcmills added OS-OpenBSD NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. release-blocker labels Nov 11, 2021
@bcmills bcmills added this to the Go1.18 milestone Nov 11, 2021
@bcmills
Copy link
Member Author

bcmills commented Nov 11, 2021

This looks like a fairly old bug, but the failure rate varies. It seems to have gone from “a couple per week” to “several per day” around 2021-11-04.

@bcmills
Copy link
Member Author

bcmills commented Nov 11, 2021

I wonder if this is related to #47629.

@4a6f656c, any ideas?

@cherrymui
Copy link
Member

cherrymui commented Nov 11, 2021

/tmp/workdir/go/src/runtime/sys_openbsd_386.s:524

That line intentionally faults when clock_gettime syscall fails. Why the syscall could fail? And why it is nondeterministic?

@jeremyfaller jeremyfaller added the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Nov 12, 2021
@jeremyfaller
Copy link
Contributor

jeremyfaller commented Nov 12, 2021

Investigations lead us to believe this is okay after Beta1 because this has been happening for some time, seems to be confined to this platform, and isn't a first-class port.

@prattmic
Copy link
Member

prattmic commented Nov 16, 2021

Just a shot in the dark, but does anyone know if openbsd's libc requires any particular alignment on SP? runtime·clock_gettime_trampoline doesn't do any alignment, so I wonder if that is tripping up a check in libc or the kernel.

@cherrymui
Copy link
Member

cherrymui commented Nov 16, 2021

I thought about that. I don't know. But we also don't align the SP in other syscall trampolines, whereas it always fails here. Also, if that's the case, I'd expect it fails much more deterministically.

@4a6f656c
Copy link
Contributor

4a6f656c commented Nov 18, 2021

I would have to double check, but I do not believe there are alignment requirements beyond those normally required by the hardware architecture itself (it is also trivial to write test code to verify this) - clock_gettime(2) should rarely/never fail and is only documented as returning EINVAL (invalid clock which is highly unlikely given how we call in) and EFAULT (bad memory). It is also worth noting that there is timekeep memory that can avoid libc from having to make an actual syscall, although if this was involved we'd probably be seeing faults within libc itself.

I suspect the underlying issue is some form of memory corruption given we're seeing other failures like faulting on unlock:

https://build.golang.org/log/3e594f9d7a015c79ab62629337b14a2e21a86e9d

Additionally, the openbsd-386-68 builder had been passing fairly consistently during the development cycle and things seem to have turned red around early November (although in limited testing I've not been able to trigger failures myself).

@prattmic
Copy link
Member

prattmic commented Nov 18, 2021

I would have to double check, but I do not believe there are alignment requirements beyond those normally required by the hardware architecture itself (it is also trivial to write test code to verify this) - clock_gettime(2) should rarely/never fail and is only documented as returning EINVAL (invalid clock which is highly unlikely given how we call in) and EFAULT (bad memory). It is also worth noting that there is timekeep memory that can avoid libc from having to make an actual syscall, although if this was involved we'd probably be seeing faults within libc itself.

I also clicked through the source in libc and the kernel and did not notice anything particularly noteworthy. No explicit alignment checks, no odd error paths beyond the ones @4a6f656c mentioned. The only thing of note is that it appeared that on 386, libc may really need to make a system call. I didn't see the fast path getting set up for 386 like it does for amd64, though I may well have missed it.

@4a6f656c
Copy link
Contributor

4a6f656c commented Nov 19, 2021

I would have to double check, but I do not believe there are alignment requirements beyond those normally required by the hardware architecture itself (it is also trivial to write test code to verify this) - clock_gettime(2) should rarely/never fail and is only documented as returning EINVAL (invalid clock which is highly unlikely given how we call in) and EFAULT (bad memory). It is also worth noting that there is timekeep memory that can avoid libc from having to make an actual syscall, although if this was involved we'd probably be seeing faults within libc itself.

I also clicked through the source in libc and the kernel and did not notice anything particularly noteworthy. No explicit alignment checks, no odd error paths beyond the ones @4a6f656c mentioned. The only thing of note is that it appeared that on 386, libc may really need to make a system call. I didn't see the fast path getting set up for 386 like it does for amd64, though I may well have missed it.

Ah, no you're correct - I neglected to recall that _tc_get_timecount is not set on OpenBSD i386, which means we're always calling the clock_gettime(2) syscall. That means that we're almost certainly failing due to EFAULT (ktrace would confirm if we can get a reproducible test case).

@cherrymui
Copy link
Member

cherrymui commented Nov 23, 2021

It seems the failure only occurs on openbsd-386-68 builder, not -70 or -70-n1 builders.

@cherrymui
Copy link
Member

cherrymui commented Nov 24, 2021

It is also interesting that all failures seem to come from the cmd/dist binary, not anything else.

@gopherbot
Copy link

gopherbot commented Dec 1, 2021

Change https://golang.org/cl/368334 mentions this issue: runtime: print errno on clock_gettime failure on OpenBSD

gopherbot pushed a commit that referenced this issue Dec 2, 2021
For #49532.

Change-Id: I5afc64c987f0519903128550a7dac3a0f5e592cf
Reviewed-on: https://go-review.googlesource.com/c/go/+/368334
Trust: Austin Clements <austin@google.com>
Run-TryBot: Austin Clements <austin@google.com>
Reviewed-by: Joel Sing <joel@sing.id.au>
Reviewed-by: Cherry Mui <cherryyz@google.com>
@aclements
Copy link
Member

aclements commented Dec 2, 2021

CL 268334, which I just submitted, will change the message printed during this failure to a "clock_gettime failed" fatal error. New combined greplogs (not yet tested because there haven't been any new failures):

greplogs -dashboard -l -e "(?ms)\Aopenbsd-.*(clock_gettime_trampoline|fatal error: clock_gettime failed)"

@aclements
Copy link
Member

aclements commented Dec 6, 2021

It's possible I made this worse. I haven't seen any clock_gettime failures since my CL went in, but we did get a new "stopm holding p" with, unfortunately, no useful backtrace, which we haven't seen before.

@cherrymui cherrymui removed the okay-after-beta1 Used by release team to mark a release-blocker issue as okay to resolve either before or after beta1 label Dec 14, 2021
@cherrymui
Copy link
Member

cherrymui commented Dec 15, 2021

Seems this hasn't occurred since @aclements 's CL...

@ianlancetaylor
Copy link
Contributor

ianlancetaylor commented Jan 29, 2022

For better or for worse, this is no longer happening, so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. OS-OpenBSD
Projects
None yet
Development

No branches or pull requests

9 participants