Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime,sync: SIGSEGV in runtime.checkptrBase via (*Pool).pin #45977

Closed
bcmills opened this issue May 5, 2021 · 9 comments
Closed

runtime,sync: SIGSEGV in runtime.checkptrBase via (*Pool).pin #45977

bcmills opened this issue May 5, 2021 · 9 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented May 5, 2021

2021-05-04T20:50:35-d19e549/freebsd-amd64-race

fatal error: unexpected signal during runtime execution
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x45a399]

goroutine 26 [running]:
runtime.throw({0x7d61ce, 0x2a})
	/tmp/workdir/go/src/runtime/panic.go:1198 +0x74 fp=0xc000061418 sp=0xc0000613e8 pc=0x47e9b4
runtime.sigpanic()
	/tmp/workdir/go/src/runtime/signal_unix.go:719 +0x4a5 fp=0xc000061478 sp=0xc000061418 pc=0x498545
runtime.spanOf(...)
	/tmp/workdir/go/src/runtime/mheap.go:652
runtime.findObject(0xc000234080, 0x0, 0x0)
	/tmp/workdir/go/src/runtime/mbitmap.go:385 +0x39 fp=0xc0000614b0 sp=0xc000061478 pc=0x45a399
runtime.checkptrBase(0xc000234080)
	/tmp/workdir/go/src/runtime/checkptr.go:68 +0x65 fp=0xc0000614f0 sp=0xc0000614b0 pc=0x44b805
runtime.checkptrArithmetic(0xc000234080, {0xc000061560, 0x1, 0x1})
	/tmp/workdir/go/src/runtime/checkptr.go:32 +0x45 fp=0xc000061520 sp=0xc0000614f0 pc=0x44b705
sync.indexLocal(...)
	/tmp/workdir/go/src/sync/pool.go:277
sync.(*Pool).pin(0x999e60)
	/tmp/workdir/go/src/sync/pool.go:204 +0x91 fp=0xc000061578 sp=0xc000061520 pc=0x4c5f91
sync.(*Pool).Get(0x999e60)
	/tmp/workdir/go/src/sync/pool.go:128 +0x45 fp=0xc0000615d0 sp=0xc000061578 pc=0x4c5905
fmt.newPrinter()
	/tmp/workdir/go/src/fmt/print.go:137 +0x3f fp=0xc000061608 sp=0xc0000615d0 pc=0x53435f
fmt.Sprintf({0x7c9cc7, 0x5}, {0xc0000616a8, 0x1, 0x1})
	/tmp/workdir/go/src/fmt/print.go:218 +0x34 fp=0xc000061660 sp=0xc000061608 pc=0x534dd4
testing.fmtDuration(0x6e8dfe)
	/tmp/workdir/go/src/testing/testing.go:628 +0xd8 fp=0xc0000616c8 sp=0xc000061660 pc=0x55f998
testing.tRunner.func1.2({0x78bec0, 0x999090})
	/tmp/workdir/go/src/testing/testing.go:1186 +0x236 fp=0xc0000617b0 sp=0xc0000616c8 pc=0x5629b6
testing.tRunner.func1(0xc0001fbba0)
	/tmp/workdir/go/src/testing/testing.go:1195 +0x41f fp=0xc000061938 sp=0xc0000617b0 pc=0x5622df
runtime.call16(0x0, 0x7dbc80, 0xc000061fa0, 0x8, 0x8, 0x8, 0xc000061988)
	/tmp/workdir/go/src/runtime/asm_amd64.s:625 +0x49 fp=0xc000061958 sp=0xc000061938 pc=0x4b4589
panic({0x78bec0, 0x999090})
	/tmp/workdir/go/src/runtime/panic.go:1052 +0x2fe fp=0xc000061a30 sp=0xc000061958 pc=0x47e35e
runtime.panicmem(...)
	/tmp/workdir/go/src/runtime/panic.go:221
runtime.sigpanic()
	/tmp/workdir/go/src/runtime/signal_unix.go:735 +0x40e fp=0xc000061a90 sp=0xc000061a30 pc=0x4984ae
runtime.spanOf(...)
	/tmp/workdir/go/src/runtime/mheap.go:652
runtime.findObject(0x9a36e0, 0x0, 0x0)
	/tmp/workdir/go/src/runtime/mbitmap.go:385 +0x39 fp=0xc000061ac8 sp=0xc000061a90 pc=0x45a399
runtime.checkptrBase(0x9a36e0)
	/tmp/workdir/go/src/runtime/checkptr.go:68 +0x65 fp=0xc000061b08 sp=0xc000061ac8 pc=0x44b805
runtime.checkptrAlignment(0x9a36e0, 0x79d9c0, 0x1)
	/tmp/workdir/go/src/runtime/checkptr.go:19 +0x6c fp=0xc000061b38 sp=0xc000061b08 pc=0x44b62c
sync/atomic.(*Value).Load(0x9a36e0)
	/tmp/workdir/go/src/sync/atomic/value.go:29 +0x5c fp=0xc000061b70 sp=0xc000061b38 pc=0x4b9c3c
internal/testlog.Logger(...)
	/tmp/workdir/go/src/internal/testlog/log.go:43
internal/testlog.Getenv({0x7da73f, 0x5})
	/tmp/workdir/go/src/internal/testlog/log.go:52 +0x3f fp=0xc000061ba0 sp=0xc000061b70 pc=0x4ec8df
os.Getenv({0x7da73f, 0x5})
	/tmp/workdir/go/src/os/env.go:102 +0x47 fp=0xc000061be8 sp=0xc000061ba0 pc=0x4f1de7
internal/testenv.GoToolPath({0x836d38, 0xc0001fbba0})
	/tmp/workdir/go/src/internal/testenv/testenv.go:92 +0x1cc fp=0xc000061c68 sp=0xc000061be8 pc=0x75e10c
cmd/vet_test.vetCmd(0xc0001fbba0, {0x7cf0f8, 0x16}, {0x7ca8ef, 0x8})
	/tmp/workdir/go/src/cmd/vet/vet_test.go:72 +0x65 fp=0xc000061d58 sp=0xc000061c68 pc=0x75ef05
cmd/vet_test.TestVet.func1(0xc0001fbba0)
	/tmp/workdir/go/src/cmd/vet/vet_test.go:113 +0xfe fp=0xc000061ed0 sp=0xc000061d58 pc=0x75f43e
testing.tRunner(0xc0001fbba0, 0xc0001be7b0)
	/tmp/workdir/go/src/testing/testing.go:1242 +0x22e fp=0xc000061fd0 sp=0xc000061ed0 pc=0x561d8e
runtime.goexit()
	/tmp/workdir/go/src/runtime/asm_amd64.s:1581 +0x1 fp=0xc000061fd8 sp=0xc000061fd0 pc=0x4b62e1
created by testing.(*T).Run
	/tmp/workdir/go/src/testing/testing.go:1289 +0x78e
@bcmills
Copy link
Member Author

@bcmills bcmills commented May 5, 2021

Marking as release-blocker until we better understand whether this is a 1.17 regression.

(Certainly there have been many runtime changes this cycle; it's not clear to me whether or how they interact with sync.Pool.)

CC @mknyszek @mdempsky

@heschi
Copy link
Contributor

@heschi heschi commented May 13, 2021

Weekly check-in: this needs to be investigated before beta 1.

@mdempsky
Copy link
Member

@mdempsky mdempsky commented May 13, 2021

There's actually two spanOf panics in the stack trace. Originally sync.(*Value).Load panics due to spanOf (within checkptr); and then the panic handler ends up calling sync.(*Pool).Get, which panics again in spanOf (within checkptr).

This makes me think this is a runtime issue, that the spans table has been corrupted somehow. But I'm sure @mknyszek has a better idea what might be going wrong here.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2021

These spanOf failures are interesting. They both fall on accessing the L2 map, that as the comment in spanOf indicates, should always be non-nil on amd64. Based on the panic, though, it does appear to be nil. I don't think it's a compiler issue because although spanOf is inlined, it's inlined into findObject. If that were broken somehow, we'd be seeing much bigger, louder failures. I'm inclined to think there is some degree of memory corruption happening here as @mdempsky suggests, and somehow a value is zero is written into mheap_.arenas. However, unless we have more data, I'm not entirely sure how to proceed.

Side-note: technically, the L2 is not actually guaranteed to be non-nil by any part of the initialization. Theoretically checkptr could fail in this way if there are no arenas created whatsoever. However, I will note that any stack allocation forces an arena to be created, and we always allocate at least one goroutine stack during initialization (not to mention the heap allocations we do before the GC is turned on), so in practice it is effectively guaranteed. Perhaps we should make that a more explicit part of the initialization procedure.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 18, 2021

I started looking at dashboard error logs containing "SIGSEGV" and I think we have broader memory corruption issues on freebsd/amd64. The other related failures appear to be #46103 and #46182.

@bcmills do you mind if I close those and merge them all into a general "FreeBSD memory corruption" issue?

@ayang64
Copy link
Contributor

@ayang64 ayang64 commented May 19, 2021

i can't reproduce under 12.2-RELEASE or 14-CURRENT. is this running on one of the 12.2 builders? does this error occur every test run?

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 19, 2021

@ayang64 The failures are varied and rare, but all appear to be some form of memory corruption (at least to my eyes).

I suspect that reproducing this error specifically is going to be difficult or impossible. I think if we want to try to nail this we just need to stress running all.bash on a whole bunch of FreeBSD machines and grab a core dump when things go wrong.

@bcmills
Copy link
Member Author

@bcmills bcmills commented May 19, 2021

@mknyszek, if we believe that the underlying problem is “undiagnosed memory corruption” I'm fine with deduping all of the issues we suspect share that cause to a single issue. (We can always open new issues if there are still crashes after the suspected root cause is diagnosed and fixed.)

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 19, 2021

Closing as a duplicate of #46272.

@mknyszek mknyszek closed this May 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants