Skip to content

runtime: corrupt gp.syscallsp after cgo call #9875

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
IvanUkhov opened this issue Feb 14, 2015 · 9 comments
Closed

runtime: corrupt gp.syscallsp after cgo call #9875

IvanUkhov opened this issue Feb 14, 2015 · 9 comments

Comments

@IvanUkhov
Copy link
Contributor

Hello,

I’m experiencing a panic when calling a C function. It happens only occasionally, and, unfortunately, I don’t have a sensibly small piece of code that could reproduce it. I’ve been trying to trace what happens and have even switched to master (from 1.4.1).

Let me please explain it the way I see it.

  1. cgo inserts a call to cgocall_errno after each invocation of a C function.
  2. cgocall_errno finishes by calling exitsyscall.
  3. exitsyscall throws an exception “syscall frame is no longer valid.”

However, I don’t see that message when the program panics. I discovered those calls in the stack trace. What actually happens next is as follows.

  1. throw tries to print the error message.
  2. Somehow morestack gets in the way and calls newstack.
  3. newstack throws “stack split at bad time.”

And this is the error message that is printed in the terminal.

I would be grateful for any feedback. Thank you.

Regards,
Ivan

go version devel +b7bfb54 Sat Feb 14 04:22:56 2015 +0000 linux/amd64
@dvyukov
Copy link
Member

dvyukov commented Feb 14, 2015

Replace

    if getcallersp(unsafe.Pointer(&dummy)) > _g_.syscallsp {
        throw("exitsyscall: syscall frame is no longer valid")
    }

with

    if getcallersp(unsafe.Pointer(&dummy)) > _g_.syscallsp {
                systemstack(func() {
            println("exitsyscall: syscall frame is no longer valid", unsafe.Pointer(&dummy)), g_.syscallsp)
            throw("exitsyscall: syscall frame is no longer valid")
        })
    }

What does it print?

@IvanUkhov
Copy link
Contributor Author

@dvyukov, thanks for the quick reply! I get:

exitsyscall: syscall frame is no longer valid 0xc20802eb08 0
fatal error: exitsyscall: syscall frame is no longer valid

runtime stack:
runtime.throw(0xbd5030, 0x2d)
    /home/ivauk/Software/go/src/runtime/panic.go:511 +0xa0
runtime.func·054()
    /home/ivauk/Software/go/src/runtime/proc1.go:1772 +0xa1
runtime.systemstack(0xc209b82390)
    /home/ivauk/Software/go/src/runtime/asm_amd64.s:249 +0x79
runtime.mstart()
    /home/ivauk/Software/go/src/runtime/proc1.go:667

goroutine 518 [running, locked to thread]:
runtime.systemstack_switch()
    /home/ivauk/Software/go/src/runtime/asm_amd64.s:203 fp=0xc20802eac8 sp=0xc20802eac0
runtime.exitsyscall(0x0)
    /home/ivauk/Software/go/src/runtime/proc1.go:1773 +0x71 fp=0xc20802eb08 sp=0xc20802eac8
runtime.cgocall_errno(0x43aa20, 0xc20802eb50, 0xc200000000)
    /home/ivauk/Software/go/src/runtime/cgocall.go:132 +0x138 fp=0xc20802eb30 sp=0xc20802eb08
…

@dvyukov
Copy link
Member

dvyukov commented Feb 14, 2015

OK, so g.syscallsp is 0. That's already something.
What OS are you running? Do your cgo calls do anything special? For example, callbacks? callbacks on non-Go threads? clone syscall calls?

@IvanUkhov
Copy link
Contributor Author

Here is my OS:

$ uname -a
Linux fermi 3.10.10-1-ARCH #1 SMP PREEMPT Fri Aug 30 11:30:06 CEST 2013 x86_64 GNU/Linux

The machine has around 23G of RAM and 16 cores like this:

$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 26
model name  : Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz
stepping    : 5
microcode   : 0xb
cpu MHz     : 1600.000
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 0
cpu cores   : 4
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 11
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips    : 4534.90
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:
…

And I’m trying to use them all. So I have 16 goroutines calling a C function; the CPU utilization is around 1600%. The function just does some linear algebra and nothing of what you mentioned.

@dvyukov
Copy link
Member

dvyukov commented Feb 14, 2015

Please try the following patch.
I don't see how it can fix the crash. But it's just a suspicious place.

--- a/src/runtime/proc1.go
+++ b/src/runtime/proc1.go
@@ -1651,6 +1651,8 @@ func reentersyscall(pc, sp uintptr) {
        if atomicload(&sched.sysmonwait) != 0 { // TODO: fast atomic
                systemstack(entersyscall_sysmon)
                save(pc, sp)
+               _g_.syscallsp = sp
+               _g_.syscallpc = pc
        }

        _g_.m.syscalltick = _g_.m.p.syscalltick
@@ -1660,6 +1662,8 @@ func reentersyscall(pc, sp uintptr) {
        if sched.gcwaiting != 0 {
                systemstack(entersyscall_gcwait)
                save(pc, sp)
+               _g_.syscallsp = sp
+               _g_.syscallpc = pc
        }

@IvanUkhov
Copy link
Contributor Author

Patched. Strange things started to happen. I got another “stack split at bad time,” but this time in a completely different part of my program where I spin off my goroutines by go func() { … }(). I also got one “Segmentation fault” without any other output; I haven’t seen this one before for sure. Also,

runtime: g318: leftover defer sp=0x0 pc=0x0
    defer 0xc208c74000 sp=0xc20a8d5b08 pc=0x9b083c
fatal error: traceback has leftover defers

runtime stack:
runtime.throw(0xbcb3d0, 0x1d)
    /home/ivauk/Software/go/src/runtime/panic.go:511 +0xa0
runtime.gentraceback(0x9ce3b0, 0xc2080192d0, 0x0, 0xc209f91c20, 0x0, 0x0, 0x7fffffff, 0xbf5ee8, 0x7f6cc27fbcf8, 0x0, ...)
    /home/ivauk/Software/go/src/runtime/traceback.go:430 +0x749
runtime.copystack(0xc209f91c20, 0x1000)
    /home/ivauk/Software/go/src/runtime/stack1.go:570 +0x18f
runtime.newstack()
    /home/ivauk/Software/go/src/runtime/stack1.go:758 +0xbc5
runtime.morestack()
    /home/ivauk/Software/go/src/runtime/asm_amd64.s:317 +0x82

goroutine 318 [copystack]:
runtime.mProf_Malloc(0xc20b6d30e0, 0xa0)
    /home/ivauk/Software/go/src/runtime/mprof.go:233 fp=0xc2080192d8 sp=0xc2080192d0
runtime.profilealloc(0xc2080d6400, 0xc20b6d30e0, 0xa0)
    /home/ivauk/Software/go/src/runtime/malloc.go:310 +0x9b fp=0xc208019300 sp=0xc2080192d8
runtime.mallocgc(0xa0, 0xb138e0, 0xc200000001, 0xc2080a2960)
    /home/ivauk/Software/go/src/runtime/malloc.go:234 +0x43b fp=0xc2080193b8 sp=0xc208019300
runtime.newarray(0xb138e0, 0x14, 0xc2080a2960)
    /home/ivauk/Software/go/src/runtime/malloc.go:276 +0xc8 fp=0xc2080193f0 sp=0xc2080193b8
runtime.makeslice(0xb09ac0, 0x14, 0x14, 0x0, 0x0, 0x0)
    /home/ivauk/Software/go/src/runtime/slice.go:32 +0x16c fp=0xc208019438 sp=0xc2080193f0
…

Also,

invalid m->locked = 2
fatal error: internal lockOSThread error

runtime stack:
runtime.throw(0xbc85d0, 0x1b)
    /home/ivauk/Software/go/src/runtime/panic.go:511 +0xa0
runtime.goexit0(0xc20aa76fc0)
    /home/ivauk/Software/go/src/runtime/proc1.go:1564 +0x15c
runtime.mcall(0x0)
    /home/ivauk/Software/go/src/runtime/asm_amd64.s:191 +0x5e
…

And sometimes everything works just fine. Maybe it’s not specific to cgo after all. I guess I’m doing something horrible somewhere without realizing it. I need to carefully go through my code and to spend more time experimenting with my setup. Right now it looks rather random.

@dvyukov, thank you for your help!

@mikioh mikioh changed the title Fatal error: “syscall frame is no longer valid” followed by “stack split at bad time” runtime: “fatal error: exitsyscall: syscall frame is no longer valid” followed by “stack split at bad time” Feb 15, 2015
@dvyukov
Copy link
Member

dvyukov commented Feb 16, 2015

I agree that it looks like a heap corruption.
Things to look for:

  • data races (run with -race if not yet)
  • look for incorrect usages of unsafe
  • look for incorrect usages of cgo, in particular if you pass and manipulate Go pointers in C, or use finalizers to free C resources

@IvanUkhov
Copy link
Contributor Author

I’ve tried running my code on another machine, and I haven’t observed anything similar there. Everything works as it should. Might be related to #9906.

@rsc rsc changed the title runtime: “fatal error: exitsyscall: syscall frame is no longer valid” followed by “stack split at bad time” runtime: corrupt gp.syscallsp after cgo call Apr 10, 2015
@rsc rsc added this to the Go1.5Maybe milestone Apr 10, 2015
@rsc
Copy link
Contributor

rsc commented Jul 21, 2015

I suspect this is the same as #9906 - bad hardware.

@rsc rsc closed this as completed Jul 21, 2015
@golang golang locked and limited conversation to collaborators Jul 20, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants