New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: crash with "invalid pc-encoded table" #27540

Open
crxpandion opened this Issue Sep 6, 2018 · 21 comments

Comments

Projects
None yet
6 participants
@crxpandion

crxpandion commented Sep 6, 2018

What version of Go are you using (go version)?

go 1.11

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOPATH="/home/users/kai.hayashi/go"
GORACE=""
GOROOT="/usr/local/go"
GOTOOLDIR="/usr/local/go/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build387775402=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"

What did you do?

I am running a fairly run-of-the-mill production api server. I have observed one crash so far on this code since upgrading to go1.11 with fatal error: invalid runtime symbol table. I am running a library in this server code that periodically captures a cpu, block, and mutex profile every few minutes. The sample rates on the mutex profile is 1/10 and 1 block profile every second.

What did you expect to see?

No crashes

What did you see instead?

runtime: invalid pc-encoded table [...]
fatal error: invalid runtime symbol table
goroutine 0 [idle]:
runtime.throw(0x10501de, 0x1c)
      /usr/local/go/src/runtime/panic.go:608 +0x72 fp=0xc00064f258 sp=0xc00064f228 pc=0x42c2b2
runtime.pcvalue(0x179ed90, 0x19f6800, 0xXXXXXXXX005ea0e7, 0xaaf94c, 0xc00064f480, 0xXXXXXXXXXXXXXX01, 0xXXXXXXXX00000000)
      /usr/local/go/src/runtime/symtab.go:791 +0x50a fp=0xc00064f300 sp=0xc00064f258 pc=0x449d3a
runtime.funcspdelta(0x179ed90, 0x19f6800, 0xaaf94c, 0xc00064f480, 0xXXXXXXXX00000000)
     /usr/local/go/src/runtime/symtab.go:843 +0x5f fp=0xc00064f370 sp=0xc00064f300 pc=0x44a1ef
runtime.gentraceback(0x45d5d5, 0xc0005f3e00, 0x0, 0xc0004d1080, 0x0, 0xc00064f6d8, 0x40, 0x0, 0x0, 0x6, ...)
      /usr/local/go/src/runtime/traceback.go:208 +0x1828 fp=0xc00064f678 sp=0xc00064f370 pc=0x450ca8
runtime.sigprof(0x45d5d5, 0xc0005f3e00, 0x0, 0xc0004d1080, 0xc000244380)
      /usr/local/go/src/runtime/proc.go:3823 +0x4b6 fp=0xc00064f928 sp=0xc00064f678 pc=0x436cd6
runtime.sighandler(0xc00000001b, 0xc00064fbf0, 0xc00064fac0, 0xc0004d1080)
      /usr/local/go/src/runtime/signal_sighandler.go:38 +0x73c fp=0xc00064f9b8 sp=0xc00064f928 pc=0x440dec

This crash is reminiscent of a bug I was hoping would be fixed with the 1.11 release related to #24925.
The crashes do seem less frequent than when I was running the same code under 1.10.

CC @heschik and @rhysh

@crxpandion crxpandion changed the title from runtime: crash with "invalid ps-encoded table" to runtime: crash with "invalid pc-encoded table" Sep 6, 2018

@heschik

This comment has been minimized.

Contributor

heschik commented Sep 6, 2018

Yeah, this is pretty similar. I don't think I can do anything at all about this without seeing the part of the error you've hidden behind [...] and 0xXXXXXX though.

@heschik

This comment has been minimized.

Contributor

heschik commented Sep 6, 2018

...actually, reading back through the internal bug where we diagnosed this, I think that error might not be relevant. (All the logs I was looking have expired so I'm trying to piece it back together from memory.) What we need is the stack that was being scanned by gentraceback. So, starting with a core file and a crash dump like the above:

  • What is the function at the first argument of sigprof, above 0x45d5d5?
  • What do the first, say, 64 words of the stack starting at the second argument of sigprof, above 0xc0005f3e00, look like? Many of them will be saved return PCs, in which case it would help to have them annotated with the functions they're in.

@andybons andybons added this to the Unplanned milestone Sep 6, 2018

@crxpandion

This comment has been minimized.

crxpandion commented Sep 6, 2018

(gdb) x 0x45d5d5
0x45d5d5 <runtime.nanotime+69>:	0x00f88348

(gdb) x/64xg 0xc0005f3e00
0xc0005f3e00:	0x0000000000aaf94c	0x000000003713e20f
0xc0005f3e10:	0x000000c0005f3e68	0x0000000000430000 <runtime.scang+192 at /usr/local/go/src/runtime/proc.go:981>:	0x244c8b4824048b48
0xc0005f3e20:	0x0027ced50000000a	0x0000000200001002
0xc0005f3e30:	0x0000000000000001	0x000000c0005f3ea8
0xc0005f3e40:	0x0027ced59b81e2c3	0x0000000000000cbc
0xc0005f3e50:	0x0027ced59b81e2c3	0x00000000019ed033 <runtime.oneptrmask>:	0x4f7d7b007c0a0101
0xc0005f3e60:	0x000000c012a66fb4	0x000000c0005f3ea8
0xc0005f3e70:	0x000000000045776d <runtime.markroot.func1+109 at /usr/local/go/src/runtime/mgcmark.go:264>:	0x75c084172444b60f	0x000000c012a66f00
0xc0005f3e80:	0x000000c000063c70	0x00007f6c035a2100
0xc0005f3e90:	0x00007f6c03f71248	0x000000c000206780
0xc0005f3ea0:	0x00007f6c03f712d0	0x000000c0005f3f28
0xc0005f3eb0:	0x000000000041d009 <runtime.markroot+777 at /usr/local/go/src/runtime/mgcmark.go:189>:	0x03f983fffffe07e9	0x000000c0005f3f10
0xc0005f3ec0:	0x000000000000002e	0x00007f6c068e7cc3
0xc0005f3ed0:	0x000000c000063c70	0x0000000000000000
0xc0005f3ee0:	0x0000000300000024	0x0000000000000003
0xc0005f3ef0:	0x0000000000000001	0x0000000000000000
0xc0005f3f00:	0x000000c000048f00	0x000000c00000c000
0xc0005f3f10:	0x0000000000457700 <runtime.markroot.func1 at /usr/local/go/src/runtime/mgcmark.go:245>:0xfffff8250c8b4864	0x000000c012a66f00
0xc0005f3f20:	0x000000c000063c70	0x000000c0005f3f80
0xc0005f3f30:	0x000000000041e507 <runtime.gcDrain+279 at /usr/local/go/src/runtime/mgcmark.go:883>:	0xd285484024548b48	0x000000c000063c70
0xc0005f3f40:	0x0000000000000c33	0x0000000000000000
0xc0005f3f50:	0x7fffffffffffffff	0x000000000042f295 <runtime.ready+165 at /usr/local/go/src/runtime/proc.go:663>:	0xc085016076d5058b
0xc0005f3f60:	0x000000c000048f00	0x000000c00f766d80
0xc0005f3f70:	0x000000c000206780	0x0000000000000000
0xc0005f3f80:	0x000000c0005f3fc0	0x00000000004575b0 <runtime.gcBgMarkWorker.func2+128 at /usr/local/go/src/runtime/mgc.go:1840>:	0x4800841824448b48
0xc0005f3f90:	0x000000c000063c70	0x0000000000000005
0xc0005f3fa0:	0x0000000000000004	0x000000c000206780
0xc0005f3fb0:	0x000000c000062a00	0x000000c000063c70
0xc0005f3fc0:	0x000000c0002eff58	0x0000000000459c76 <runtime.systemstack+102 at /usr/local/go/src/runtime/asm_amd64.s:355>:	0xfffff825048b4864
0xc0005f3fd0:	0x0000000000430b60 <runtime.mstart at /usr/local/go/src/runtime/proc.go:1229>:	0x246c894820ec8348	0xXXXXXXXXXXXXXXXX
0xc0005f3fe0:	0xXXXXXXXXXXXXXXXX 	0x000000c0005f2000
0xc0005f3ff0:	0x000000c0005a9ed0	0x000000000045da33 <runtime.clone+115 at /usr/local/go/src/runtime/sys_linux_amd64.s:586>:	0x003cb80000006fbf

Sorry about the 0xXX stuff - trying to avoid exposing memory with ascii.

@heschik

This comment has been minimized.

Contributor

heschik commented Sep 6, 2018

No problem.

This looks exactly the same as the previous bug, which is worrying. gentraceback was called from

go/src/runtime/proc.go

Lines 3822 to 3823 in 41e62b8

} else if traceback {
n = gentraceback(pc, sp, lr, gp, 0, &stk[0], len(stk), nil, nil, _TraceTrap|_TraceJumpStack)
which means traceback was true. That means that either mp was nil or vdsoSP was 0. We can see in the backtrace that mp (last arg to sigprof) was 0xc000244380, so vsdoSP must have been 0. (You could confirm this by printing *mp in gdb.) vdsoSP is supposed to be set in nanotime well before the stack alignment that confuses the backtrace:
MOVQ DX, m_vdsoPC(BX)

so I don't get it. Maybe the pcsp table is wrong even before the stack alignment? What does disass 0x45d5d5 look like? This is unmodified go 1.11?

@crxpandion

This comment has been minimized.

crxpandion commented Sep 7, 2018

(gdb) disass 0x45d5d5
Dump of assembler code for function runtime.nanotime:
   0x000000000045d590 <+0>:	mov    %rsp,%rbp
   0x000000000045d593 <+3>:	mov    %fs:0xfffffffffffffff8,%rax
   0x000000000045d59c <+12>:	mov    0x30(%rax),%rbx
   0x000000000045d5a0 <+16>:	mov    (%rsp),%rdx
   0x000000000045d5a4 <+20>:	mov    %rdx,0x330(%rbx)
   0x000000000045d5ab <+27>:	lea    (%rsp),%rdx
   0x000000000045d5af <+31>:	mov    %rdx,0x328(%rbx)
   0x000000000045d5b6 <+38>:	cmp    0xc0(%rbx),%rax
   0x000000000045d5bd <+45>:	jne    0x45d5c6 <runtime.nanotime+54 at /usr/local/go/src/runtime/sys_linux_amd64.s:254>
   0x000000000045d5bf <+47>:	mov    (%rbx),%rdx
   0x000000000045d5c2 <+50>:	mov    0x38(%rdx),%rsp
   0x000000000045d5c6 <+54>:	sub    $0x10,%rsp
   0x000000000045d5ca <+58>:	and    $0xfffffffffffffff0,%rsp
   0x000000000045d5ce <+62>:	mov    0x15f90c3(%rip),%rax        # 0x1a56698 <runtime.vdsoClockgettimeSym>
   0x000000000045d5d5 <+69>:	cmp    $0x0,%rax
   0x000000000045d5d9 <+73>:	je     0x45d60d <runtime.nanotime+125 at /usr/local/go/src/runtime/sys_linux_amd64.s:274>
   0x000000000045d5db <+75>:	mov    $0x1,%edi
   0x000000000045d5e0 <+80>:	lea    (%rsp),%rsi
   0x000000000045d5e4 <+84>:	callq  *%rax
   0x000000000045d5e6 <+86>:	mov    (%rsp),%rax
   0x000000000045d5ea <+90>:	mov    0x8(%rsp),%rdx
   0x000000000045d5ef <+95>:	mov    %rbp,%rsp
   0x000000000045d5f2 <+98>:	movq   $0x0,0x328(%rbx)
   0x000000000045d5fd <+109>:	imul   $0x3b9aca00,%rax,%rax
   0x000000000045d604 <+116>:	add    %rdx,%rax
   0x000000000045d607 <+119>:	mov    %rax,0x8(%rsp)
   0x000000000045d60c <+124>:	retq   
   0x000000000045d60d <+125>:	lea    (%rsp),%rdi
   0x000000000045d611 <+129>:	mov    $0x0,%rsi
   0x000000000045d618 <+136>:	mov    0x158ff39(%rip),%rax        # 0x19ed558 <runtime.vdsoGettimeofdaySym>
   0x000000000045d61f <+143>:	callq  *%rax
   0x000000000045d621 <+145>:	mov    (%rsp),%rax
   0x000000000045d625 <+149>:	mov    0x8(%rsp),%edx
   0x000000000045d629 <+153>:	mov    %rbp,%rsp
   0x000000000045d62c <+156>:	movq   $0x0,0x328(%rbx)
   0x000000000045d637 <+167>:	imul   $0x3e8,%rdx,%rdx
   0x000000000045d63e <+174>:	imul   $0x3b9aca00,%rax,%rax
   0x000000000045d645 <+181>:	add    %rdx,%rax
   0x000000000045d648 <+184>:	mov    %rax,0x8(%rsp)
   0x000000000045d64d <+189>:	retq   
End of assembler dump.

yes this is unmodified go1.11

@heschik

This comment has been minimized.

Contributor

heschik commented Sep 7, 2018

Cool! This makes absolutely no sense to me. Can you print/x *mp in the sigprof frame? I'd like to confirm that it generally matches the other args passed to sigprof.

@crxpandion

This comment has been minimized.

crxpandion commented Sep 7, 2018

(gdb) x 0xc000244380
0xc000244380:	0x004d1080
(gdb) x 0x004d1080
0x4d1080 <fmt.(*ss).hexByte+464>:	0xfff38a4c

I have not been able to drop into the sigprof frame (which is weird to me a thread apply all bt does not contain sigprof) but I admit gdb confounds me.

@heschik

This comment has been minimized.

Contributor

heschik commented Sep 7, 2018

Okay, then please do (from memory):

set lang c
print/x *(('runtime.m'*)0xc000244380)

which should print the structure of an m with contents that make sense.

@crxpandion

This comment has been minimized.

crxpandion commented Sep 7, 2018

(gdb) print/x *(('runtime.m'*)0xc000244380)
$1 = {g0 = 0xc0004d1080, morebuf = {sp = 0x0, pc = 0x0, g = 0x0, ctxt = 0x0, 
    ret = 0x0, lr = 0x0, bp = 0x0}, divmod = 0x0, procid = 0x7055, 
  gsignal = 0xc0004d0f00, goSigStack = {stack = {lo = 0x0, hi = 0x0}, 
    stackguard0 = 0x0, stackguard1 = 0x0, stktopsp = 0x0}, sigmask = {0x0, 
    0x0}, tls = {0xc0004d0f00, 0x0, 0x0, 0x0, 0x0, 0x0}, mstartfn = 0xc0, 
  curg = 0xc000206780, caughtsig = 0x0, p = 0xc000062a00, nextp = 0x0, 
  id = 0x19, mallocing = 0x2, throwing = 0x1, preemptoff = {str = 0x0, 
    len = 0x0}, locks = 0x1, dying = 0x1, profilehz = 0x64, helpgc = 0x0, 
  spinning = 0x0, blocked = 0x0, inwb = 0x0, newSigstack = 0x1, 
  printlock = 0x0, incgo = 0x0, freeWait = 0x0, fastrand = {0x80981164, 
    0x4ba9cbdf}, needextram = 0x0, traceback = 0x0, ncgocall = 0x0, 
  ncgo = 0x0, cgoCallersUse = 0x0, cgoCallers = 0x0, park = {key = 0x0}, 
  alllink = 0xc00049d880, schedlink = 0xc0001a3500, mcache = 0x7f6c068ea768, 
  lockedg = 0x0, createstack = {0x0 <repeats 32 times>}, lockedExt = 0x0, 
  lockedInt = 0x0, nextwaitm = 0x0, waitunlockf = 0x0, waitlock = 0x0, 
  waittraceev = 0x19, waittraceskip = 0x4, startingtrace = 0x0, 
  syscalltick = 0x6d522d4, thread = 0x0, freelink = 0x0, libcall = {
    fn = 0x0, n = 0x0, args = 0x0, r1 = 0x0, r2 = 0x0, err = 0x0}, 
  libcallpc = 0x0, libcallsp = 0x0, libcallg = 0x0, syscall = {fn = 0x0, 
    n = 0x0, args = 0x0, r1 = 0x0, r2 = 0x0, err = 0x0}, vdsoSP = 0x0, 
  vdsoPC = 0x452a8a, mOS = {<No data fields>}}
@heschik

This comment has been minimized.

Contributor

heschik commented Sep 7, 2018

Okay. I don't understand how vdsoSP can possibly be 0 given that it would have been set immediately before the point at which the signal was caught.

@aclements, @ianlancetaylor, any ideas?

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 7, 2018

What is the function at address 0x452a8a?

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 7, 2018

This is very unlikely, but does your program ever call runtime.SetCgoTraceback? Does your program use cgo at all?

@crxpandion

This comment has been minimized.

crxpandion commented Sep 7, 2018

(gdb) x 0x452a8a
0x452a8a <runtime.goroutineheader+538>:	0x24048b48
@crxpandion

This comment has been minimized.

crxpandion commented Sep 7, 2018

This program doesn't use cgo, but it does mmap a file.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 7, 2018

The value of m.vdsoPC shows that nanotime was called by goroutineheader. The value of g passed to sighandler and the printout of *mp shows that the program was running on g0 when the signal occurred. The fact that the traceback field of *mp is 0 means that goroutineheader was not called by runtime.Stack. As far as I can tell the only other ways that we could see a call to goroutineheader all involve some sort of program crash or panic, but in that case I would have expected that something else would be printed out first. And you would have told us in that case, right?

My conclusion so far is like @heschik 's: this is impossible. Unless somehow mp.g0.m != mp, where mp is the value 0xc000244380 you printed out above. But that too should be impossible.

@crxpandion

This comment has been minimized.

crxpandion commented Sep 7, 2018

I definitely would have told you if I found something else printed out. :)
Maybe the best thing to do is to see if another of these happens and try and get more information from a second crash.

@mark-rushakoff

This comment has been minimized.

Contributor

mark-rushakoff commented Sep 7, 2018

Since nobody else has asked yet: are any data races detected if you run your executable built with -race?

@crxpandion

This comment has been minimized.

crxpandion commented Sep 7, 2018

no races detected. the frequency of crashes observed so far is low (only 1 per week so far under reasonable load) as well.

@cherrymui

This comment has been minimized.

Contributor

cherrymui commented Sep 8, 2018

I noticed that in sys_linux_amd64.s,

        LEAQ    ret+0(SP), DX
        MOVQ    DX, m_vdsoSP(BX)

the ret+0(SP) should probably be ret+0(FP).

But even so, I don't understand how vdsoSP could be 0.

@ianlancetaylor

This comment has been minimized.

Contributor

ianlancetaylor commented Sep 8, 2018

I didn't spell it out above, but vdsoSP could be 0 if something in the signal handler, or anything else using the same m, called nanotime between the time that the profiling signal was received and the time that we started asking for the traceback. The value of vdsoPC makes it clear that the function that called nanotime was goroutineheader. The question is whether goroutineheader was called before or after the profiling signal (if after, it was presumably when the profiling signal occurred during a different execution of nanotime).

The problem is that no matter what I don't see any way that goroutineheader could have called nanotime without causing other side effects. So clearly I'm missing something.

@heschik

This comment has been minimized.

Contributor

heschik commented Sep 10, 2018

Sorry, I think it was a mistake to ask for the printout of mp, it's a red herring. I believe this is a core dump for a process that threw with GOTRACEBACK=crash, so all the output we're getting is after the crash dump was printed. I didn't realize that the process of crashing would make a vdso call -- since it does, everything in m will have been overwritten.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment