Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syscall: document or fix liveness during Syscall #13372

Closed
tkadauke opened this issue Nov 23, 2015 · 26 comments
Closed

syscall: document or fix liveness during Syscall #13372

tkadauke opened this issue Nov 23, 2015 · 26 comments
Milestone

Comments

@tkadauke
Copy link

@tkadauke tkadauke commented Nov 23, 2015

I am working on a new Go tool that reads events via the Linux perf API from a bunch of mmap'ed files and then extracts useful information about those events. The fewer processors I inspect at the same time, the sooner I run into an issue with the Go garbage collector:

runtime: free list of span 0x7ffff7fb9370:
0xc82067c000 -> 0xc82067c010 -> 0xc82067c020 -> 0xc82067c030 -> 0xc82067c040 -> 0xc82067c050 -> 0xc82067c060 -> 0xc82067c070 -> 0xc82067c080 -> 0x100c82067c090 (BAD)
fatal error: free list corrupted

runtime stack:
runtime.throw(0x912ec0, 0x13)
        /usr/local/go/src/runtime/panic.go:527 +0x90
runtime.mSpan_Sweep(0x7ffff7fb9370, 0x4b00000000, 0xc800000001)
        /usr/local/go/src/runtime/mgcsweep.go:186 +0x800
runtime.sweepone(0x4377c2)
        /usr/local/go/src/runtime/mgcsweep.go:97 +0x154
runtime.gosweepone.func1()
        /usr/local/go/src/runtime/mgcsweep.go:109 +0x21
runtime.systemstack(0xc8200967c0)
        /usr/local/go/src/runtime/asm_amd64.s:262 +0x79
runtime.mstart()
        /usr/local/go/src/runtime/proc1.go:674

It's either this backtrace (GC has trouble sweeping because the free list is corrupt), or a SIGSEGV (runtime/malloc.go:585 crashes because it picked a corrupted entry from the free list).

This is 100% reproducible for me (Sorry, can't share the source code just yet). In the current configuration that I have running, it takes between 1 and 60 seconds to crash, with or without GDB.

Please note that the corrupt pointer looks like an intact pointer & 0x1000000000000.

I get the crash in Go 1.5 and Go 1.5.1.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Nov 23, 2015

/cc @aclements for triage (likely a dup?)

@aclements
Copy link
Member

@aclements aclements commented Nov 23, 2015

It's quite possible this is a dup of #13287, but I'm going to leave it open for now.

@tkadauke, it would be great if you could share (possibly reduced) code to reproduce this. We have a few free list corruption reports, but yours seems by far the most reproducible and most likely to lead to a fix.

Also, what OS/architecture is this?

If it's easy, I'd be curious if you can reproduce this with the latest code from the master branch. I don't think we've fixed anything that would obviously have caused this, but if it is fixed on master, we can bisect it and probably narrow right in on the problem without you having to share your code.

@aclements aclements added this to the Go1.6 milestone Nov 23, 2015
@tkadauke
Copy link
Author

@tkadauke tkadauke commented Nov 23, 2015

@aclements, this is a 40-core x86_64. I could reproduce it on a 32-core x84_64 (to exclude a hardware problem).

I might be able to isolate it in a way so I can share the code with you (but not publicly) it that's ok with you.

If that doesn't work, I can look into compiling against master.

@aclements
Copy link
Member

@aclements aclements commented Nov 23, 2015

I might be able to isolate it in a way so I can share the code with you (but not publicly) it that's ok with you.

That would be okay. I have access to many-core machines if that's necessary to reproduce it.

I assume the OS is Linux?

@tkadauke
Copy link
Author

@tkadauke tkadauke commented Nov 23, 2015

Yes, a slightly patched Linux 3.10.72

@tkadauke
Copy link
Author

@tkadauke tkadauke commented Nov 23, 2015

@aclements: I just managed to isolate the problem somewhat and was able to trigger a segfault after a few seconds. GDB says:

Program received signal SIGSEGV, Segmentation fault.
[Switching to LWP 3363593]
0x000000000040e3a4 in runtime.mallocgc (size=8, typ=0x4dec40, flags=1, ~r3=0x30) at /usr/lib/golang/src/runtime/malloc.go:585
585                             s.freelist = v.ptr().next
(gdb) print v
$1 = 859534984192
(gdb) print c.alloc[2]
$2 = (struct runtime.mspan *) 0x7ffff7faec70
(gdb) print *c.alloc[2]
$3 = {next = 0x7ffff57ad378, prev = 0x7ffff56e73f0, start = 104923704, npages = 1, freelist = 282334511694880, sweepgen = 16, divMul = 2147483648, ref = 66, sizeclass = 2 '\002', incache = true, state = 0 '\000', needzero = 0 '\000', divShift = 4 '\004', divShift2 = 31 '\037', elemsize = 16, 
  unusedsince = 0, npreleased = 0, limit = 859534991360, speciallock = {key = 0}, specials = 0x0, baseMask = 18446744073709551600}
(gdb) print *c.alloc[2].freelist
Cannot access memory at address 0x100c820470420

So the same symptom.

Where can I send the source archive?

@ianlancetaylor ianlancetaylor changed the title fatal error: free list corrupted runtime: fatal error: free list corrupted Nov 23, 2015
@rsc
Copy link
Contributor

@rsc rsc commented Nov 23, 2015

Austin is austin@google.com. Thanks.

@tkadauke
Copy link
Author

@tkadauke tkadauke commented Nov 23, 2015

Sent an email with instructions to Austin. Hope you guys can figure out what's going on! :)

@aclements
Copy link
Member

@aclements aclements commented Nov 23, 2015

Thanks for the code! It takes several minutes, but I can reproduce the mallocgc failure exactly like you saw it. I haven't been able to reproduce the sweep panic yet.

@tkadauke
Copy link
Author

@tkadauke tkadauke commented Nov 23, 2015

Awesome! For me, it alternates between the two outcomes (SIGSEGV in malloc.go/panic in the GC) but the root cause in both cases is the corrupted free list.

--Thomas

Sent from my iPhone

On 23 Nov 2015, at 23:38, Austin Clements notifications@github.com wrote:

Thanks for the code! It takes several minutes, but I can reproduce the mallocgc failure exactly like you saw it. I haven't been able to reproduce the sweep panic yet.


Reply to this email directly or view it on GitHub.

@aclements
Copy link
Member

@aclements aclements commented Nov 23, 2015

I can also reproduce this at the head of master.

@tkadauke
Copy link
Author

@tkadauke tkadauke commented Nov 23, 2015

So, from a first glance, do you think it’s a bug in the Go runtime or something that I’m doing wrong? I have ways to suppress this bug for a couple of hours, but I won’t get my program stable without a proper fix.

—Thomas

On 23 Nov 2015, at 23:41, Thomas Kadauke thomas.kadauke@googlemail.com wrote:

Awesome! For me, it alternates between the two outcomes (SIGSEGV in malloc.go/panic in the GC) but the root cause in both cases is the corrupted free list.

--Thomas

Sent from my iPhone

On 23 Nov 2015, at 23:38, Austin Clements <notifications@github.com mailto:notifications@github.com> wrote:

Thanks for the code! It takes several minutes, but I can reproduce the mallocgc failure exactly like you saw it. I haven't been able to reproduce the sweep panic yet.


Reply to this email directly or view it on GitHub #13372 (comment).

@aclements
Copy link
Member

@aclements aclements commented Nov 24, 2015

I think this is a bug in the Go runtime. I skimmed over your unsafe code and none of it jumped out at me as likely to cause this, mostly because you usually don't write through unsafe pointers and the things you read through unsafe pointers don't themselves contain pointers*. It also seems unlikely you'd corrupt the free list in this particular way; you'd have to either back up one byte from an allocation and scribble a 1 over it or skip forward just the right number of bytes from the end of an allocation without mangling the intervening bytes.

* That said, some of it is a bit scary. For example, your BinReader does nothing to keep the backing store alive, but all of the call sites happen to.

@hus787
Copy link

@hus787 hus787 commented Nov 30, 2015

It's hard to reproduce this but I have run into this several times today.

@rsc
Copy link
Contributor

@rsc rsc commented Nov 30, 2015

@hus787 do you have the same program as @tkadauke? If not, please file a separate issue, including crash stack traces and ideally the source code you're using. Thanks.

@rsc
Copy link
Contributor

@rsc rsc commented Nov 30, 2015

@tkadauke Thanks very much for sending the simplified program. Austin and I were able to reproduce it on multiple machines. I'm sorry to say that it's not a runtime bug, at least not directly. Your program wraps sys_poll like this:

const (
    POLLIN   = 0x0001
    POLLPRI  = 0x0002
    POLLOUT  = 0x0004
    POLLERR  = 0x0008
    POLLHUP  = 0x0010
    POLLNVAL = 0x0020
)

type PollFd struct {
    fd      int32
    events  int16
    revents int16
}

func sys_poll(files []*os.File, timeout uint) (int, error) {
    fds := make([]PollFd, 0, len(files))

    for _, file := range files {
        fds = append(fds, PollFd{
            fd:     int32(file.Fd()),
            events: POLLIN | POLLERR | POLLHUP,
        })
    }

    r1, _, e := syscall.Syscall(syscall.SYS_POLL, uintptr(unsafe.Pointer(&fds[0])), uintptr(len(fds)), uintptr(timeout))

    if e != 0 {
        return int(r1), os.NewSyscallError("poll", e)
    }

    return int(r1), nil
}

The problem is that nothing refers to fds after the syscall.Syscall arguments are prepared, so the compiler does not record fds as live during the system call. While the system call is blocked, the garbage collector can come along and reclaim the memory pointed at by fds. Then when the kernel is ready to return a result, it writes POLLIN (0x0001) into, say, fds[0].revents, which ends up overwriting the top 16 bits of a 64-bit word with 0x0001, in a block of memory that has already been freed and is sitting on a free list. That is how we end up in the crash with what looks like a valid pointer except for the 0x0001000000000000 high bit.

If you change the code to use fds again after the syscall.Syscall, it will be marked live for the duration of the call and will therefore stay valid for the kernel to write to.

One way to do this is to add something like this to your package:

var useGlobal uintptr

func use(x unsafe.Pointer) {
    atomic.StoreUintptr(&useGlobal, uintptr(x))
}

and then after the syscall.Syscall, add a call

use(unsafe.Pointer(&fds[0]))

to keep fds live during the syscall.Syscall. This is basically what package syscall itself does, although there func use is implemented in assembly.

With a change like this, Austin and I were both able to run the sample program for over an hour, when it used to crash after a few minutes.

I've been thinking about what we should do in general here. At the least we need better documentation about how to use syscall.Syscall, but perhaps we can make this kind of thing less necessary. I'm going to leave this issue to track that work.

Thanks very much for the report. I'm sad it wasn't (or doesn't appear to be) the runtime bug we're chasing in the other reports, but it's still certainly something we need to address.

@rsc rsc changed the title runtime: fatal error: free list corrupted syscall: document or fix liveness during Syscall Nov 30, 2015
@hus787
Copy link

@hus787 hus787 commented Nov 30, 2015

@rsc No, it's not the same program. And it's unlikely that I'll be able to share to source code. But sure I'll file a separate issue with the (probably a partial) stacktrace.

@dr2chase
Copy link
Contributor

@dr2chase dr2chase commented Nov 30, 2015

It's tempting to think of either a "go vet" check, or something that the compiler might do automatically, since a pattern along the lines of "go pointer -> unsafe.Pointer -> uintptr -> cgo call arg & go pointer is now not live" is a red flag with a siren. I'd want a tool like that to make it easier to deal with bugs like this, might as well cut out the middle man and put it into go vet or the compiler itself.

@rsc
Copy link
Contributor

@rsc rsc commented Nov 30, 2015

@tkadauke
Copy link
Author

@tkadauke tkadauke commented Nov 30, 2015

@rsc, thank you so much for your looking into this. Your findings are very fascinating and line up with the weird crashing behavior that less traced CPUs trigger crashes more often. This is because there are more calls to poll then, which means more opportunities for the GC to collect fds.

@gopherbot
Copy link

@gopherbot gopherbot commented Jan 13, 2016

CL https://golang.org/cl/18584 mentions this issue.

@gopherbot gopherbot closed this in 1ac637c Jan 14, 2016
@gopherbot
Copy link

@gopherbot gopherbot commented Jan 14, 2016

CL https://golang.org/cl/18640 mentions this issue.

gopherbot pushed a commit that referenced this issue Jan 22, 2016
Add docs for valid uses of Pointer.
Then document change made for #13372 in CL 18584.

Fixes #8994.

Change-Id: Ifba71e5aeafd11f684aed0b7ddacf3c8ec07c580
Reviewed-on: https://go-review.googlesource.com/18640
Reviewed-by: Alan Donovan <adonovan@google.com>
Reviewed-by: Robert Griesemer <gri@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
Reviewed-by: Rob Pike <r@golang.org>
Reviewed-by: Austin Clements <austin@google.com>
Reviewed-by: Rick Hudson <rlh@golang.org>
@gopherbot
Copy link

@gopherbot gopherbot commented May 15, 2016

CL https://golang.org/cl/23123 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented May 15, 2016

CL https://golang.org/cl/23124 mentions this issue.

gopherbot pushed a commit to golang/sys that referenced this issue May 15, 2016
Updates golang/go#13372.

Change-Id: Idfd5001f4ad7bc80a4283df1c310f97612eba85c
Reviewed-on: https://go-review.googlesource.com/23124
Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
gopherbot pushed a commit that referenced this issue May 15, 2016
Updates #13372.

Change-Id: Id2402a781474e9d0bb0901c5844adbd899f76cbd
Reviewed-on: https://go-review.googlesource.com/23123
Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@gopherbot
Copy link

@gopherbot gopherbot commented Jul 4, 2016

CL https://golang.org/cl/24750 mentions this issue.

@gopherbot
Copy link

@gopherbot gopherbot commented Jul 4, 2016

CL https://golang.org/cl/24751 mentions this issue.

gopherbot pushed a commit to golang/sys that referenced this issue Jul 4, 2016
Updates golang/go#13372.

Change-Id: I623de97eb19880356148cbcb7d17759df82684aa
Reviewed-on: https://go-review.googlesource.com/24751
Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
gopherbot pushed a commit that referenced this issue Jul 4, 2016
Updates #13372.

Change-Id: If383c14af14839a303425ba7b80b97e35ca9b698
Reviewed-on: https://go-review.googlesource.com/24750
Run-TryBot: Mikio Hara <mikioh.mikioh@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@golang golang locked and limited conversation to collaborators Jul 4, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.