Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "runtime: bad pointer in frame runtime.sellock" crash on some builders #15936

Closed
eliasnaur opened this issue Jun 2, 2016 · 15 comments

Comments

Projects
None yet
8 participants
@eliasnaur
Copy link
Contributor

commented Jun 2, 2016

While working on another builder crash, I noticed this once:

https://build.golang.org/log/e3cc1e2bd6c5d4e0c3726213974c3d0a1af12f5a

runtime: bad pointer in frame runtime.sellock at 0xa24cc68: 0xd

fatal error: invalid stack pointer



runtime stack:

runtime.throw(0x424be8a, 0x15)

    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/panic.go:566 +0x80 fp=0x16fd5f2a0 sp=0x16fd5f280

runtime.adjustpointers(0xa24cc68, 0x16fd5f440, 0x16fd5f7b8, 0x4285c58)

    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/stack.go:602 +0x2d8 fp=0x16fd5f3b0 sp=0x16fd5f2a0

runtime.adjustframe(0x16fd5f6c0, 0x16fd5f7b8, 0x16fd5f501)

    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/stack.go:674 +0x1d0 fp=0x16fd5f470 sp=0x16fd5f3b0

runtime.gentraceback(0x4103e50, 0xa24ca60, 0x40d9aa8, 0xa1ed380, 0x0, 0x0, 0x7fffffff, 0x425e8c0, 0x16fd5f7b8, 0x0, ...)

    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/traceback.go:378 +0xe0c fp=0x16fd5f710 sp=0x16fd5f470

runtime.copystack(0xa1ed380, 0x1000, 0xa1ed301)

    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/stack.go:895 +0x3e0 fp=0x16fd5f920 sp=0x16fd5f710

runtime.newstack()

    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/stack.go:1063 +0xa80 fp=0x16fd5faa0 sp=0x16fd5f920

runtime.morestack()

    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/asm_arm64.s:300 +0x5c fp=0x16fd5faa0 sp=0x16fd5faa0

Filing this issue if the crash is interesting in itself or in case in happens again on the same or another builder.

@ianlancetaylor ianlancetaylor added this to the Go1.7 milestone Jun 2, 2016

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 2, 2016

@aclements

This comment has been minimized.

Copy link
Member

commented Jun 2, 2016

We've seen a few other "bad pointer in frame" failures:

$ greplogs -l -md -dashboard -E 'bad pointer in frame'
2015-07-29T23:50:20-d3ffc97/linux-arm64-canonical
2015-09-08T09:27:31-8d08167/windows-amd64-2003kardianos
2016-02-02T18:08:46-3297a4f/darwin-arm64-a7ios
2016-03-23T10:59:55-a4be24c/plan9-amd64-9front
2016-06-02T00:21:58-36358b1/darwin-arm64-a7ios

I don't believe the plan9-amd64-9front failure since basically every test crashed on that run, but the others look entirely plausible.

This may suggest another bug in the liveness/stack map information. Probably the right next step is to build at one of the failing commits/configurations and look at the machine code and stack map around the call with the bad frame.

@eliasnaur eliasnaur changed the title runtime: runtime: bad pointer in frame runtime.sellock on the darwin/arm64 builder runtime: "runtime: bad pointer in frame runtime.sellock" crash on some builders Jun 7, 2016

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 8, 2016

I looked at the most recent link above, https://build.golang.org/log/df89109d252dc12ecd0fa587cc00c11cdb0af1c5 . It doesn't seem like a liveness problem. The crash is when newstack is called while panicking here:

runtime.sigpanic()
    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/sigpanic_unix.go:12 +0x50 fp=0xa24cbe0 sp=0xa24cb90
runtime/internal/atomic.Cas64(0x65, 0x0, 0x1)
    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/internal/atomic/atomic_arm64.s:76 +0xc fp=0xa24cbf0 sp=0xa24cbf0
runtime.lock(0x65)
    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/lock_sema.go:43 +0x94 fp=0xa24cc40 sp=0xa24cbf0
runtime.sellock(0xa1eef28, 0x1, 0x1, 0xa1eef88, 0x1, 0x1)
    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/select.go:170 +0xc0 fp=0xa24cc90 sp=0xa24cc40
runtime.selectgoImpl(0xa1eef10, 0x18, 0xa251000)
    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/select.go:426 +0x115c fp=0xa24cea0 sp=0xa24cc90
runtime.selectgo(0xa1eef10)
    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/runtime/select.go:238 +0x10 fp=0xa24cec0 sp=0xa24cea0
net.(*netFD).connect.func2(0x0, 0xa1eef30, 0x0, 0x0)
    /private/var/folders/0m/n0150xb53y5cg1rq75_3qv3r0000gn/T/workdir/go/src/net/fd_unix.go:110 +0x154 fp=0xa24cf80 sp=0xa24cec0

This stack trace shows that the pointer value is already bad (I don't know if it is really 0x65, but it is clearly bad or we would not have gotten a signal here). The pointer value is not on the stack--it's obtained from a field of the hselect which is on the stack. Specifically it's lock(&c.lock) in sellock. I checked the liveness info of the function where the hselect is allocated, net.(*netFD).connect.func2 and it is pretty straightforward.

net/fd_unix.go:110: live at call to selectgo: ctx fd done autotmp_3704

TYPE    "".autotmp_3704-128(SP)type.struct { tcase uint16; ncase uint16; pollorder *uint8; lockorder *uint8; scase [2]struct { elem *uint8; chan *uint8; pc uintptr; kind uint16; so uint16; receivedp *uint8; releasetime uint64 }; lockorderarr [2]uint16; pollorderarr [2]uint16 }, $128

I don't know what is going on but I don't see how there could be any liveness confusion here.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2016

I wrote a CL to clobber all dead pointer fields immediately after all call sites. It didn't find anything interesting, unfortunately.
https://go-review.googlesource.com/#/c/23924/

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2016

The fact that this seems to happen on arm64 (the single amd64 case is from a long time ago, and is not in the select code) makes me wonder if something in the channel/select stack transfer code is not following the weak arm64 memory model.

@randall77

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2016

Actually, I did find something, the bug in #16016. I just didn't realize it was a bug. I'm not sure whether that bug is the cause for any of these failures, but it is possible.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 12, 2016

I've kicked off three tests on my pine64 linux/arm64 machine, based on the failure logs: stress ./bufio.test, stress ./time.test, stress ./syslog.test. I'll report back. Suggestions for other tests to run would be welcome.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2016

24 hours later, I see no failures. I'm going to try running all.bash in a loop for 24 hours instead.

@aclements

This comment has been minimized.

Copy link
Member

commented Jun 13, 2016

It may be worth running a stress test on the commit before e3f1c66, which fixed #16016. If we can reproduce it there, we'll know that was probably the cause and that it's fixed.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 13, 2016

Good idea. I've started the same three stress tests using 1bdf1c3.

@josharian

This comment has been minimized.

Copy link
Contributor

commented Jun 14, 2016

No failures after a day at 1bdf1c3, either. Unfortunately, I'll be away from my hardware for the next two weeks.

@adg

This comment has been minimized.

Copy link
Contributor

commented Jun 27, 2016

Looks like we might punt this to 1.8.

@adg adg modified the milestones: Go1.8, Go1.7 Jun 27, 2016

@quentinmit

This comment has been minimized.

Copy link
Contributor

commented Sep 29, 2016

@josharian Have you found any other instances of this? Otherwise I think we should probably close this as not reproducible.

@aclements

This comment has been minimized.

Copy link
Member

commented Oct 11, 2016

There haven't been any instances of this on the dashboard since June 2nd (#15936 (comment)). I think it's safe to close this.

@quentinmit quentinmit closed this Oct 12, 2016

@gopherbot

This comment has been minimized.

Copy link

commented Feb 3, 2017

CL https://golang.org/cl/23924 mentions this issue.

gopherbot pushed a commit that referenced this issue Apr 21, 2017

cmd/compile: experiment which clobbers all dead pointer fields
The experiment "clobberdead" clobbers all pointer fields that the
compiler thinks are dead, just before and after every safepoint.
Useful for debugging the generation of live pointer bitmaps.

Helped find the following issues:
Update #15936
Update #16026
Update #16095
Update #18860

Change-Id: Id1d12f86845e3d93bae903d968b1eac61fc461f9
Reviewed-on: https://go-review.googlesource.com/23924
Run-TryBot: Keith Randall <khr@golang.org>
Reviewed-by: Matthew Dempsky <mdempsky@google.com>
Reviewed-by: Cherry Zhang <cherryyz@google.com>

@golang golang locked and limited conversation to collaborators Feb 3, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.