Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: frequent SIGSEGV in clock_gettime_trampoline on openbsd-386 #49532

Open
bcmills opened this issue Nov 11, 2021 · 11 comments
Open

runtime: frequent SIGSEGV in clock_gettime_trampoline on openbsd-386 #49532

bcmills opened this issue Nov 11, 2021 · 11 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Nov 11, 2021

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf1 pc=0x80a3000]

runtime stack:
runtime.throw({0x816ebe4, 0x2a})
	/tmp/workdir/go/src/runtime/panic.go:992 +0x64
runtime.sigpanic()
	/tmp/workdir/go/src/runtime/signal_unix.go:781 +0x22b
runtime.clock_gettime_trampoline()
	/tmp/workdir/go/src/runtime/sys_openbsd_386.s:524 +0x20

goroutine 8 [running]:
runtime.systemstack_switch()
	/tmp/workdir/go/src/runtime/asm_386.s:337 fp=0x72c2b768 sp=0x72c2b764 pc=0x80a0ba0
runtime.libcCall(0x80a2fe0, 0x72c2b794)
	/tmp/workdir/go/src/runtime/sys_libc.go:48 +0x5a fp=0x72c2b77c sp=0x72c2b768 pc=0x809386a
runtime.nanotime1()
	/tmp/workdir/go/src/runtime/sys_openbsd2.go:161 +0x51 fp=0x72c2b7a0 sp=0x72c2b77c pc=0x8093d71
runtime.nanotime(...)
	/tmp/workdir/go/src/runtime/time_nofake.go:19
runtime.gcBgMarkWorker()
	/tmp/workdir/go/src/runtime/mgc.go:1245 +0x13d fp=0x72c2b7f0 sp=0x72c2b7a0 pc=0x805e65d
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_386.s:1311 +0x1 fp=0x72c2b7f4 sp=0x72c2b7f0 pc=0x80a1ee1
created by runtime.gcBgMarkStartWorkers
	/tmp/workdir/go/src/runtime/mgc.go:1122 +0x1f

greplogs --dashboard -md -l -e \(\?ms\)\\Aopenbsd-.\*clock_gettime_trampoline

2021-11-11T16:17:21-666fc17/openbsd-386-68
2021-11-11T04:02:33-3949faf/openbsd-386-68
2021-11-10T21:53:03-229b909/openbsd-386-68
2021-11-10T17:15:54-8a3be15/openbsd-386-68
2021-11-10T05:08:25-17980df/openbsd-386-68
2021-11-10T02:26:41-02d7eab/openbsd-386-68
2021-11-10T00:45:37-ec86bb5/openbsd-386-68
2021-11-09T21:58:03-805b4d5/openbsd-386-68
2021-11-09T00:08:09-bee0c73/openbsd-386-68
2021-11-08T17:46:34-2e210b4/openbsd-386-68
2021-11-06T19:41:14-cfb3dc7/openbsd-386-68
2021-11-06T16:43:43-3544082/openbsd-386-68
2021-11-05T22:55:56-e83a204/openbsd-386-68
2021-11-05T21:34:10-bb53fd7/openbsd-386-68
2021-11-05T21:28:34-90462df/openbsd-386-68
2021-11-05T21:27:34-7aed6dd/openbsd-386-68
2021-11-05T21:27:19-58ec925/openbsd-386-68
2021-11-05T00:52:04-1c4cfd8/openbsd-386-68
2021-11-04T23:56:29-0e5f287/openbsd-386-68
2021-11-04T21:41:49-156abe5/openbsd-386-68
2021-11-04T20:01:10-9b2dd1f/openbsd-386-68
2021-11-04T20:00:54-961aab2/openbsd-386-68
2021-10-06T22:28:59-d477ef3-b18ba59/openbsd-386-64
2021-09-30T20:30:12-1c35f2a-c035d82/openbsd-386-64
2021-09-29T20:06:10-1c35f2a-5930cff/openbsd-386-64
2021-09-27T22:22:35-ba6b94c-cd4d592/openbsd-386-64
2021-09-21T13:18:09-fe076c8-39e08c6/openbsd-386-64
2021-09-09T00:10:46-076821b-e30a090/openbsd-386-64
2021-08-30T02:40:46-3e0d083-56c3856/openbsd-386-64
2021-08-16T12:54:44-a55d515-c88e3ff/openbsd-386-64
2021-07-14T17:25:06-5061c41-60ddf42/openbsd-386-64
2021-06-21T20:53:11-d25f906-761edf7/openbsd-386-64
2021-06-08T20:19:02-689f4c7/openbsd-386-68
2021-06-08T20:19:02-689f4c7/openbsd-amd64-64
2021-06-03T16:41:39-4abb1e2-e0d029f/openbsd-386-64
2021-06-02T21:39:28-384c392-dd7ba3b/openbsd-386-64
2021-05-12T20:59:48-8287d5d-6db7480/openbsd-386-64
2021-05-10T23:42:56-79d39ff-5c48951/openbsd-386-64
2021-05-08T17:03:18-f05e912-b38b1b2/openbsd-386-64
2021-05-07T02:17:32-c0140e8-d2b0311/openbsd-386-64
2021-05-05T21:37:16-1949673-cf73f1a/openbsd-386-64

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 11, 2021

This looks like a fairly old bug, but the failure rate varies. It seems to have gone from “a couple per week” to “several per day” around 2021-11-04.

Loading

@bcmills
Copy link
Member Author

@bcmills bcmills commented Nov 11, 2021

I wonder if this is related to #47629.

@4a6f656c, any ideas?

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 11, 2021

/tmp/workdir/go/src/runtime/sys_openbsd_386.s:524

That line intentionally faults when clock_gettime syscall fails. Why the syscall could fail? And why it is nondeterministic?

Loading

@jeremyfaller
Copy link
Contributor

@jeremyfaller jeremyfaller commented Nov 12, 2021

Investigations lead us to believe this is okay after Beta1 because this has been happening for some time, seems to be confined to this platform, and isn't a first-class port.

Loading

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 16, 2021

Just a shot in the dark, but does anyone know if openbsd's libc requires any particular alignment on SP? runtime·clock_gettime_trampoline doesn't do any alignment, so I wonder if that is tripping up a check in libc or the kernel.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 16, 2021

I thought about that. I don't know. But we also don't align the SP in other syscall trampolines, whereas it always fails here. Also, if that's the case, I'd expect it fails much more deterministically.

Loading

@4a6f656c
Copy link
Contributor

@4a6f656c 4a6f656c commented Nov 18, 2021

I would have to double check, but I do not believe there are alignment requirements beyond those normally required by the hardware architecture itself (it is also trivial to write test code to verify this) - clock_gettime(2) should rarely/never fail and is only documented as returning EINVAL (invalid clock which is highly unlikely given how we call in) and EFAULT (bad memory). It is also worth noting that there is timekeep memory that can avoid libc from having to make an actual syscall, although if this was involved we'd probably be seeing faults within libc itself.

I suspect the underlying issue is some form of memory corruption given we're seeing other failures like faulting on unlock:

https://build.golang.org/log/3e594f9d7a015c79ab62629337b14a2e21a86e9d

Additionally, the openbsd-386-68 builder had been passing fairly consistently during the development cycle and things seem to have turned red around early November (although in limited testing I've not been able to trigger failures myself).

Loading

@prattmic
Copy link
Member

@prattmic prattmic commented Nov 18, 2021

I would have to double check, but I do not believe there are alignment requirements beyond those normally required by the hardware architecture itself (it is also trivial to write test code to verify this) - clock_gettime(2) should rarely/never fail and is only documented as returning EINVAL (invalid clock which is highly unlikely given how we call in) and EFAULT (bad memory). It is also worth noting that there is timekeep memory that can avoid libc from having to make an actual syscall, although if this was involved we'd probably be seeing faults within libc itself.

I also clicked through the source in libc and the kernel and did not notice anything particularly noteworthy. No explicit alignment checks, no odd error paths beyond the ones @4a6f656c mentioned. The only thing of note is that it appeared that on 386, libc may really need to make a system call. I didn't see the fast path getting set up for 386 like it does for amd64, though I may well have missed it.

Loading

@4a6f656c
Copy link
Contributor

@4a6f656c 4a6f656c commented Nov 19, 2021

I would have to double check, but I do not believe there are alignment requirements beyond those normally required by the hardware architecture itself (it is also trivial to write test code to verify this) - clock_gettime(2) should rarely/never fail and is only documented as returning EINVAL (invalid clock which is highly unlikely given how we call in) and EFAULT (bad memory). It is also worth noting that there is timekeep memory that can avoid libc from having to make an actual syscall, although if this was involved we'd probably be seeing faults within libc itself.

I also clicked through the source in libc and the kernel and did not notice anything particularly noteworthy. No explicit alignment checks, no odd error paths beyond the ones @4a6f656c mentioned. The only thing of note is that it appeared that on 386, libc may really need to make a system call. I didn't see the fast path getting set up for 386 like it does for amd64, though I may well have missed it.

Ah, no you're correct - I neglected to recall that _tc_get_timecount is not set on OpenBSD i386, which means we're always calling the clock_gettime(2) syscall. That means that we're almost certainly failing due to EFAULT (ktrace would confirm if we can get a reproducible test case).

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 23, 2021

It seems the failure only occurs on openbsd-386-68 builder, not -70 or -70-n1 builders.

Loading

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Nov 24, 2021

It is also interesting that all failures seem to come from the cmd/dist binary, not anything else.

Loading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants