Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: SIGSEGV in runtime.(*fixalloc).alloc #47302

prattmic opened this issue Jul 20, 2021 · 6 comments

runtime: SIGSEGV in runtime.(*fixalloc).alloc #47302

prattmic opened this issue Jul 20, 2021 · 6 comments
NeedsInvestigation release-blocker


Copy link

@prattmic prattmic commented Jul 20, 2021

On Go 1.17rc1 internally, we have seen numerous crashes on amd64/linux in runtime.(*fixalloc).alloc. The crashes do not occur on 1.16. The crashes all look something like this:

SIGSEGV: segmentation violation code=0x2 addr=0x7fa554c133c8 pc=0x561666c1a7b2]

runtime stack:
runtime.throw({0x5616599d2a91, 0x0})
        /root/go/src/runtime/panic.go:1204 +0x71 fp=0x7fa51b3abc50 sp=0x7fa51b3abc20 pc=0x561666c37451
        /root/go/src/runtime/signal_unix.go:720 +0x416 fp=0x7fa51b3abca8 sp=0x7fa51b3abc50 pc=0x561666c4ddd6
        /root/go/src/runtime/mfixalloc.go:72 +0x32 fp=0x7fa51b3abce0 sp=0x7fa51b3abca8 pc=0x561666c1a7b2
        /root/go/src/runtime/mcache.go:88 +0x37 fp=0x7fa51b3abd00 sp=0x7fa51b3abce0 pc=0x561666c17d97
        /root/go/src/runtime/mcache.go:86 +0x4b fp=0x7fa51b3abd30 sp=0x7fa51b3abd00 pc=0x561666c17ceb
runtime.(*p).init(0xc000264800, 0x5)
        /root/go/src/runtime/proc.go:4920 +0x105 fp=0x7fa51b3abd50 sp=0x7fa51b3abd30 pc=0x561666c43a25
        /root/go/src/runtime/proc.go:5096 +0x365 fp=0x7fa51b3abe00 sp=0x7fa51b3abd50 pc=0x561666c442a5
        /root/go/src/runtime/proc.go:1263 +0x8b fp=0x7fa51b3abe60 sp=0x7fa51b3abe00 pc=0x561666c3c4cb
        /root/go/src/runtime/proc.go:1098 +0x1b fp=0x7fa51b3abe78 sp=0x7fa51b3abe60 pc=0x561666c63c3b
        /root/go/src/runtime/asm_amd64.s:383 +0x46 fp=0x7fa51b3abe80 sp=0x7fa51b3abe78 pc=0x561666c6a6c6

goroutine 313 [running]:
        /root/go/src/runtime/asm_amd64.s:350 fp=0xc05ebe3e00 sp=0xc05ebe3df8 pc=0x561666c6a660
        /root/go/src/runtime/proc.go:1098 +0x28 fp=0xc05ebe3e30 sp=0xc05ebe3e00 pc=0x561666c3c048
        /root/go/src/runtime/proc.go:1131 +0x19 fp=0xc05ebe3e58 sp=0xc05ebe3e30 pc=0x561666c3c199
        /root/go/src/runtime/debug.go:33 +0x95 fp=0xc05ebe3e80 sp=0xc05ebe3e58 pc=0x561666c08175

These crashes have previously been observed on FreeBSD in #46103 and #46272, but were believed to be limited to FreeBSD. The latter bug also contains other types of crashes that we have not observed on Linux.

What we know so far:

  • The crash occurs when dereferencing fixalloc.list here.
    • In core dumps, fixalloc.list indeed contains the fault address.
  • The crash always occurs in mheap_.cachealloc.alloc.
  • The crash always occurs during procresize.
    • It should only be possible for mheap_.cachealloc.list to be set if GOMAXPROCS is decreased. In two cores checked so far, runtime.allp has len < cap, indicating there was a indeed a decrease.
  • The SIGSEGV always has code=2 (SEGV_ACCERR), implying that the fault address is a valid, but PROT_NONE, mapping.
  • In one case the fault address was in a pagealloc mapping. This is not true in all crashes, but suggests that this is not a case of a pointer that was previously valid and only later mprotect'd, but rather the pointer was always bad.
  • In several, but not all, crashes, the page offset of the faulting address is 0x3c8.
  • In two core dumps that have been checked, the fault address is only in memory at mheap_.cachealloc.list, and in the curg.sigcode1, curg.m.gsignal stack, and curg.m.g0 stack. All of the latter references are from after the crash while we prepare to panic.

cc @mknyszek @aclements

@prattmic prattmic added NeedsInvestigation release-blocker labels Jul 20, 2021
@prattmic prattmic added this to the Go1.17 milestone Jul 20, 2021
Copy link

@gopherbot gopherbot commented Jul 22, 2021

Change mentions this issue: runtime: move mem profile sampling into m-acquired section

Copy link
Member Author

@prattmic prattmic commented Jul 22, 2021

We believe we have figured out the issue. mallocgc uses the mcache after releasem. A preemption between releasem and the use could free the mcache to the fixalloc free list, and then the adjustment to c.nextSample would corrupt the next pointer in the free list.

This code is not new in 1.17, however added explicit preemption points between the releasem and the mcache, which makes this possible (while preemption was theoretically possible before, there were no morestack calls in the critical section).

Thanks to @dr2chase and @mknyszek for finding the root cause, and @aclements and @ianlancetaylor and others for all the help!

Copy link

@mknyszek mknyszek commented Jul 22, 2021

@dr2chase also came up with a small reproducer:

Running under stress it had about a 1% failure rate.

Copy link

@dr2chase dr2chase commented Jul 22, 2021

This reproducer, under stress, fails over 50%:
It has failed on a Mac laptop, on a beefy Linux cloud machine, and on a small ODroid Linux arm64.

Copy link

@dr2chase dr2chase commented Jul 22, 2021

By-the-way, technique for generating this was to run it under stress with a bunch of different seeds, record a list of failures, then modify to randomly chose from an array of just those known-failing seeds, and then choose the seed that failed the most (there were a few that failed at much higher rates). Then I repeated the pattern 10 times to see if it upped the failure rate (it did), then cut the sleep time to see if it still failed often (it did), then increased the repetition count some more to further increase the failure rate within a target no-fail runtime of about 5 seconds.

steeve pushed a commit to znly/go that referenced this issue Aug 19, 2021
It was not safe to do mcache profiling updates outside the critical
section, but we got lucky because the runtime was not preemptible.
Adding chunked memory clearing (CL 270943) created preemption
opportunities, which led to corruption of runtime data structures.

Fixes golang#47304.
Fixes golang#47302.

Change-Id: I461615470d62328a83ccbac537fbdc6dcde81c85
Trust: David Chase <>
Run-TryBot: David Chase <>
TryBot-Result: Go Bot <>
Reviewed-by: Austin Clements <>
Reviewed-by: Keith Randall <>
Copy link

@gopherbot gopherbot commented Oct 30, 2021

Change mentions this issue: runtime: add always-preempt maymorestack hook

gopherbot pushed a commit that referenced this issue Nov 5, 2021
This adds a maymorestack hook that forces a preemption at every
possible cooperative preemption point. This would have helped us catch
several recent preemption-related bugs earlier, including #47302,
 #47304, and #47441.

For #48297.

Change-Id: Ib82c973589c8a7223900e1842913b8591938fb9f
Trust: Austin Clements <>
Run-TryBot: Austin Clements <>
TryBot-Result: Go Bot <>
Reviewed-by: Cherry Mui <>
Reviewed-by: Michael Pratt <>
Reviewed-by: David Chase <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
NeedsInvestigation release-blocker
None yet

No branches or pull requests

4 participants