Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: SIGSEGV in runtime.(*fixalloc).alloc #47302

Closed
prattmic opened this issue Jul 20, 2021 · 5 comments
Closed

runtime: SIGSEGV in runtime.(*fixalloc).alloc #47302

prattmic opened this issue Jul 20, 2021 · 5 comments

Comments

@prattmic
Copy link
Member

@prattmic prattmic commented Jul 20, 2021

On Go 1.17rc1 internally, we have seen numerous crashes on amd64/linux in runtime.(*fixalloc).alloc. The crashes do not occur on 1.16. The crashes all look something like this:

SIGSEGV: segmentation violation code=0x2 addr=0x7fa554c133c8 pc=0x561666c1a7b2]

runtime stack:
runtime.throw({0x5616599d2a91, 0x0})
        /root/go/src/runtime/panic.go:1204 +0x71 fp=0x7fa51b3abc50 sp=0x7fa51b3abc20 pc=0x561666c37451
runtime.sigpanic()
        /root/go/src/runtime/signal_unix.go:720 +0x416 fp=0x7fa51b3abca8 sp=0x7fa51b3abc50 pc=0x561666c4ddd6
runtime.(*fixalloc).alloc(0x0)
        /root/go/src/runtime/mfixalloc.go:72 +0x32 fp=0x7fa51b3abce0 sp=0x7fa51b3abca8 pc=0x561666c1a7b2
runtime.allocmcache.func1()
        /root/go/src/runtime/mcache.go:88 +0x37 fp=0x7fa51b3abd00 sp=0x7fa51b3abce0 pc=0x561666c17d97
runtime.allocmcache()
        /root/go/src/runtime/mcache.go:86 +0x4b fp=0x7fa51b3abd30 sp=0x7fa51b3abd00 pc=0x561666c17ceb
runtime.(*p).init(0xc000264800, 0x5)
        /root/go/src/runtime/proc.go:4920 +0x105 fp=0x7fa51b3abd50 sp=0x7fa51b3abd30 pc=0x561666c43a25
runtime.procresize(0x6)
        /root/go/src/runtime/proc.go:5096 +0x365 fp=0x7fa51b3abe00 sp=0x7fa51b3abd50 pc=0x561666c442a5
runtime.startTheWorldWithSema(0x0)
        /root/go/src/runtime/proc.go:1263 +0x8b fp=0x7fa51b3abe60 sp=0x7fa51b3abe00 pc=0x561666c3c4cb
runtime.startTheWorld.func1()
        /root/go/src/runtime/proc.go:1098 +0x1b fp=0x7fa51b3abe78 sp=0x7fa51b3abe60 pc=0x561666c63c3b
runtime.systemstack()
        /root/go/src/runtime/asm_amd64.s:383 +0x46 fp=0x7fa51b3abe80 sp=0x7fa51b3abe78 pc=0x561666c6a6c6

goroutine 313 [running]:
runtime.systemstack_switch()
        /root/go/src/runtime/asm_amd64.s:350 fp=0xc05ebe3e00 sp=0xc05ebe3df8 pc=0x561666c6a660
runtime.startTheWorld()
        /root/go/src/runtime/proc.go:1098 +0x28 fp=0xc05ebe3e30 sp=0xc05ebe3e00 pc=0x561666c3c048
runtime.startTheWorldGC()
        /root/go/src/runtime/proc.go:1131 +0x19 fp=0xc05ebe3e58 sp=0xc05ebe3e30 pc=0x561666c3c199
runtime.GOMAXPROCS(0x6)
        /root/go/src/runtime/debug.go:33 +0x95 fp=0xc05ebe3e80 sp=0xc05ebe3e58 pc=0x561666c08175
[snip]

These crashes have previously been observed on FreeBSD in #46103 and #46272, but were believed to be limited to FreeBSD. The latter bug also contains other types of crashes that we have not observed on Linux.

What we know so far:

  • The crash occurs when dereferencing fixalloc.list here.
    • In core dumps, fixalloc.list indeed contains the fault address.
  • The crash always occurs in mheap_.cachealloc.alloc.
  • The crash always occurs during procresize.
    • It should only be possible for mheap_.cachealloc.list to be set if GOMAXPROCS is decreased. In two cores checked so far, runtime.allp has len < cap, indicating there was a indeed a decrease.
  • The SIGSEGV always has code=2 (SEGV_ACCERR), implying that the fault address is a valid, but PROT_NONE, mapping.
  • In one case the fault address was in a pagealloc mapping. This is not true in all crashes, but suggests that this is not a case of a pointer that was previously valid and only later mprotect'd, but rather the pointer was always bad.
  • In several, but not all, crashes, the page offset of the faulting address is 0x3c8.
  • In two core dumps that have been checked, the fault address is only in memory at mheap_.cachealloc.list, and in the curg.sigcode1, curg.m.gsignal stack, and curg.m.g0 stack. All of the latter references are from after the crash while we prepare to panic.

cc @mknyszek @aclements

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 22, 2021

Change https://golang.org/cl/336449 mentions this issue: runtime: move mem profile sampling into m-acquired section

@prattmic
Copy link
Member Author

@prattmic prattmic commented Jul 22, 2021

We believe we have figured out the issue. mallocgc uses the mcache after releasem. A preemption between releasem and the use could free the mcache to the fixalloc free list, and then the adjustment to c.nextSample would corrupt the next pointer in the free list.

This code is not new in 1.17, however golang.org/cl/270943 added explicit preemption points between the releasem and the mcache, which makes this possible (while preemption was theoretically possible before, there were no morestack calls in the critical section).

Thanks to @dr2chase and @mknyszek for finding the root cause, and @aclements and @ianlancetaylor and others for all the help!

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented Jul 22, 2021

@dr2chase also came up with a small reproducer: https://play.golang.org/p/zioaIirWPSp

Running under stress it had about a 1% failure rate.

@gopherbot gopherbot closed this in fdb45ac Jul 22, 2021
@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Jul 22, 2021

This reproducer, under stress, fails over 50%: https://play.golang.org/p/LjMLSsNtrLk
It has failed on a Mac laptop, on a beefy Linux cloud machine, and on a small ODroid Linux arm64.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Jul 22, 2021

By-the-way, technique for generating this was to run it under stress with a bunch of different seeds, record a list of failures, then modify to randomly chose from an array of just those known-failing seeds, and then choose the seed that failed the most (there were a few that failed at much higher rates). Then I repeated the pattern 10 times to see if it upped the failure rate (it did), then cut the sleep time to see if it still failed often (it did), then increased the repetition count some more to further increase the failure rate within a target no-fail runtime of about 5 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants