New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: can not read stacktrace from a core file #25218

Open
aarzilli opened this Issue May 2, 2018 · 22 comments

Comments

Projects
None yet
6 participants
@aarzilli
Contributor

aarzilli commented May 2, 2018

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

This problem was introduced by b1d1ec9 (CL 110065) and is still present at tip.

Does this issue reproduce with the latest release?

No.

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOCACHE="/home/a/.cache/go-build"
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/a/n/go/"
GORACE=""
GOROOT="/usr/local/go-tip"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go-tip/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build495464568=/tmp/go-build -gno-record-gcc-switches"

What did you do?

Given:

package main

import "runtime/debug"

func main() {
	debug.SetTraceback("crash")
	crash()
}

func crash() {
	panic("panic!")
}

Running it under gdb will produce this stacktrace:

(gdb) bt
#0  0x000000000044fff4 in runtime.raise () at <autogenerated>:1
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at /usr/local/go-tip/src/runtime/signal_unix.go:424
#2  0x0000000000438dea in runtime.crash () at /usr/local/go-tip/src/runtime/signal_unix.go:526
#3  0x0000000000426487 in runtime.fatalpanic (msgs=<optimized out>) at /usr/local/go-tip/src/runtime/panic.go:696
#4  0x0000000000425e6b in runtime.gopanic (e=...) at /usr/local/go-tip/src/runtime/panic.go:502
#5  0x0000000000470ac9 in main.crash () at /home/a/temp/simple.go:11
#6  0x0000000000470a7b in main.main () at /home/a/temp/simple.go:7

however letting it produce a core file then reading the core file with gdb produces this:

$ gdb ./simple simple-core
...
(gdb) bt
#0  0x000000000044fff4 in runtime.raise () at <autogenerated>:1
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at /usr/local/go-tip/src/runtime/signal_unix.go:424
#2  0x00000000004390a8 in runtime.sigfwdgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=6, ~r3=<optimized out>) at /usr/local/go-tip/src/runtime/signal_unix.go:637
#3  0x0000000000438488 in runtime.sigtrampgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=<optimized out>) at /usr/local/go-tip/src/runtime/signal_unix.go:289
#4  0x00000000004502e3 in runtime.sigtramp () at <autogenerated>:1
#5  0x00000000004503d0 in ?? () at <autogenerated>:1
#6  0x0000000000000001 in ?? ()
#7  0x0000000000000000 in ?? ()

I'm not sure what's happening here. Is the signal handler running and overwriting part of the stack?

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented May 2, 2018

@andybons andybons added this to the Unplanned milestone May 2, 2018

@heschik heschik added the Debugging label May 2, 2018

@heschik

This comment has been minimized.

Contributor

heschik commented May 2, 2018

I assume this is all with GOTRACEBACK=crash. GDB is breaking at the first SIGABRT, which isn't what actually kills the process:

(gdb) run
Starting program: panic 
panic: panic!
[snip]
Program received signal SIGABRT, Aborted.
runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
146		RET
(gdb) bt
#0  runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at go/src/runtime/signal_unix.go:424
#2  0x0000000000438dea in runtime.crash () at go/src/runtime/signal_unix.go:526
#3  0x0000000000426497 in runtime.fatalpanic (msgs=<optimized out>) at go/src/runtime/panic.go:696
#4  0x0000000000425e7b in runtime.gopanic (e=...) at go/src/runtime/panic.go:502
#5  0x00000000004709c9 in main.crash () at go/src/panic.go:11
#6  0x000000000047097b in main.main () at go/src/panic.go:7
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted.
runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
146		RET
(gdb) bt
#0  runtime.raise () at go/src/runtime/sys_linux_amd64.s:146
#1  0x0000000000438c2b in runtime.dieFromSignal (sig=6) at go/src/runtime/signal_unix.go:424
#2  0x00000000004390a8 in runtime.sigfwdgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=6, ~r3=<optimized out>) at go/src/runtime/signal_unix.go:637
#3  0x0000000000438488 in runtime.sigtrampgo (ctx=0xc000009ac0, info=0xc000009bf0, sig=<optimized out>) at go/src/runtime/signal_unix.go:289
#4  0x00000000004502a3 in runtime.sigtramp () at go/src/runtime/sys_linux_amd64.s:353
#5  0x0000000000450390 in ?? () at go/src/runtime/sys_linux_amd64.s:437
#6  0x0000000000000007 in ?? ()
#7  0x0000000000000000 in ?? ()
(gdb) c
Continuing.

Program terminated with signal SIGABRT, Aborted.
The program no longer exists.

So that much makes sense. Is the real complaint here that the application's stack trace doesn't appear in the core dump?

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented May 2, 2018

So that much makes sense. Is the real complaint here that the application's stack trace doesn't appear in the core dump?

Yes. I would expect some goroutine's stack to have a frame for main.main and main.crash in it, at least.

@heschik

This comment has been minimized.

Contributor

heschik commented May 2, 2018

Well, it would still be on the goroutine's stack, but the thread is on the signal handler stack when it dies, not the goroutine stack. And IIRC Austin said there's no way to convince GDB to show you a different stack for a thread while handling a core file without writing a custom backtracer and even then it was very hard or didn't work or something.

My inclination would be to deregister the signal handler for sig in dieFromSignal. If the first signal kills the process then you'd get the first stack in the core file, which is what we want. But all this stuff is incredibly subtle, so there's probably a reason that wouldn't work, maybe some cgo interop thing.

Does Delve do any better than GDB here? In principle it could show the goroutine stack along with gsignal.

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented May 2, 2018

Does Delve do any better than GDB here? In principle it could show the goroutine stack along with gsignal.

It used to do better, but it doesn't anymore after that commit. In particular:

Well, it would still be on the goroutine's stack, but the thread is on the signal handler stack when it dies, not the goroutine stack

doesn't look true to me. What it looks like to me is that the thread is running on the normal stack of goroutine 1, if it was a signal handling stack it would have goid == 0, right? Also, how does sigfwdgo call dieFromSignal?

@heschik

This comment has been minimized.

Contributor

heschik commented May 2, 2018

Where are you seeing goroutine 1 that you trust? The panic backtrace shows goroutine 1, but that happens before dieFromSignal. gdb shows 1 in info goroutines, but it combines scheduler information with thread state in a way that'll just end up showing whatever the thread's up to, not the user code that was running for goroutine 1.

Also, how does sigfwdgo call dieFromSignal?

dieFromSignal(sig)

or am I missing something?

I'm probably out of my depth at this point. Hopefully Austin or Elias can weigh in.

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented May 2, 2018

Where are you seeing goroutine 1 that you trust?

I'm seeing it while debugging delve, the TLS for the first thread contains a pointer to the same g struct that's in the first entry of runtime.allgs. And it has goid == 1.
I was under the impression that when a thread is running on one of those special stacks the pointer to g is changed accordingly, am I wrong? Should I check if sp is inside [g.m.gsignal.stack.stackhi, g.m.gsignal.stack.stacklo]?

or am I missing something?

oh I didn't see that.

@heschik

This comment has been minimized.

Contributor

heschik commented May 2, 2018

I was under the impression that when a thread is running on one of those special stacks the pointer to g is changed accordingly, am I wrong? Should I check if sp is inside [g.m.gsignal.stack.stackhi, g.m.gsignal.stack.stacklo]?

If you want to handle this case I think you have to. This area of the code is responsible for handling signals that might be caused by C code, so it can't blindly muck with Go stuff until it knows what's up. The setg call you hoped had run is here:

setg(g.m.gsignal)

and only runs if sigfwdgo doesn't do anything. sigtrampgo checks the stack pointer to decide if gsignal or g0 is running:
if sp < g.m.gsignal.stack.lo || sp >= g.m.gsignal.stack.hi {
if sp >= g.m.g0.stack.lo && sp < g.m.g0.stack.hi {
so that ought to be a decent model to follow.

(I still kinda think clearing out the signal handler in dieFromSignal is reasonable.)

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented May 2, 2018

Ok, the thread's sp is actually inside gsignal's stack. Do you know where the sp for the normal goroutine stack is saved in this case? g.sched.sp is zero.

@heschik

This comment has been minimized.

Contributor

heschik commented May 2, 2018

That's a good point. The goroutine wasn't preempted normally, so nothing in the runtime will know what its stack pointer was. The only place you could find it is the ctx argument to the signal handler (the thing we wrap a sigctxt around), and even that will be awkward to interpret in the case of a chain of signal handlers.

@aclements

This comment has been minimized.

Member

aclements commented May 2, 2018

I still kinda think clearing out the signal handler in dieFromSignal is reasonable.

I think the concern here is if the signal needs to be forwarded. E.g., if you have some SIGABRT-trapping crash handling service installed, we want the signal to get forwarded to it. Maybe dieFromSignal could check if we're forwarding it (which we never will be in pure Go programs) and, if not, go straight to its fallback path that clears the handler and raises?

cc @ianlancetaylor @bcmills

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented May 2, 2018

It is not so crazy to use the signal context to figure out where a signal handler was invoked. That is what gdb does. The key point is to reliably determine whether you are running in a signal handler. I think gdb does that by recognizing the signal trampoline that is on the return stack, which resumes normal execution if the signal handler returns.

@aclements

This comment has been minimized.

Member

aclements commented May 2, 2018

The only place you could find it is the ctx argument to the signal handler (the thing we wrap a sigctxt around), and even that will be awkward to interpret in the case of a chain of signal handlers.

I don't think the chained signal handlers are a problem. We don't manipulate the context when calling down the chain.

I was thinking we might be able to stash this away in g.sched.sp when we enter the signal handler, but there's still a window between switching to the signal stack and doing there, and it's possible we can receive a signal while g.sched.sp is set to something important.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented May 2, 2018

I don't think we need to do anything in dieFromSignal, I think that in crash we could clear the signal handler. At least at first glance I'm not worried about forwarding the signal once we are calling crash.

@bcmills

This comment has been minimized.

Member

bcmills commented May 2, 2018

If we don't forward the signal, we potentially bypass crash reporters (particularly on iOS and Android) and skew people's crash metrics.

But I think I'm missing something here:

the thread is on the signal handler stack when it dies, not the goroutine stack

The saved base pointer in the root frame of the signal stack should point back to the goroutine stack, right? (Do we know why the debuggers aren't following that base pointer? Am I just confused?)

@heschik

This comment has been minimized.

Contributor

heschik commented May 2, 2018

Since that was sort of addressed to me, I'll say that looking with GDB I think you're right that following the base pointer chain gets you to the right stack, and that seems better than using the signal handler context. Based on Stack Overflow I guess GDB only follows the base pointers when it's out of options.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented May 2, 2018

Following the base pointer will presumably only work when the base pointer is used by the code being interupted, so it will be unreliable if the signal fires while executing C code compiled with -fomit-frame-pointer.

@aclements

This comment has been minimized.

Member

aclements commented May 3, 2018

I think gdb also won't follow a frame pointer if it points to a smaller address, to avoid walking corrupted stacks or going in cycles.

But why doesn't it unwind through the signal context? That I would expect to work here.

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented May 3, 2018

Maybe it's because you aren't telling it where you saved rdx on the stack?

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented May 3, 2018

Never mind, I see now that x86_64_fallback_frame_state always knows how to get the context. My next guess is that, if gdb's documentation is correct, x86_64_fallback_frame_state is only used when no FDE is found, which is not the case for runtime.sigreturn.
This was also the problem for delve: if I ignore runtime.sigreturn's FDE it will just follow the frame pointer and end up back on the right stack.

@heschik

This comment has been minimized.

Contributor

heschik commented May 25, 2018

@aarzilli, did you want anything else here?

@aarzilli

This comment has been minimized.

Contributor

aarzilli commented May 25, 2018

I'm satisfied by blacklisting runtime.sigreturn's FDE on delve's side, I've left this issue open because I suspect it would be useful to other debuggers to remove it entirely from the executable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment