New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cmd/compile: ppc64le broken by encourage inlining of functions with single-call bodies #28679

Open
bradfitz opened this Issue Nov 8, 2018 · 13 comments

Comments

Projects
None yet
5 participants
@bradfitz
Member

bradfitz commented Nov 8, 2018

https://go-review.googlesource.com/c/go/+/147361 broke ppc64le.

https://build.golang.org/log/cee992f2c28faffaa1c9526349f4923f9cb16c83

##### GOMAXPROCS=2 runtime -cpu=1,2,4 -quick
--- FAIL: TestGdbPython (0.58s)
    runtime-gdb_test.go:66: gdb version 7.7
    runtime-gdb_test.go:213: gdb output: Loading Go Runtime support.
        Loaded  Script                                                                 
        Yes     /tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/runtime-gdb.py      
        Breakpoint 1 at 0x92cac: file /tmp/go-build132537636/main.go, line 15.
        hi
        
        Breakpoint 1, main.main () at /tmp/go-build132537636/main.go:17
        17		gslice = slicevar
        BEGIN info goroutines
        * 1 running  syscall.Syscall
          2 runnable runtime.forcegchelper
          17 waiting  runtime.gopark
          18 runnable runtime.runfinq
        END
        BEGIN print mapvar
        $1 = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"}
        END
        BEGIN print strvar
        $2 = "abc"
        END
        BEGIN info locals
        mapvar = map[string]string = {["abc"] = "def", ["ghi"] = "jkl"}
        strvar = "abc"
        slicevar =  []string
        END
        BEGIN goroutine 1 bt
        #0  main.main () at /tmp/go-build132537636/main.go:17
        END
        BEGIN goroutine 2 bt
        #0  runtime.forcegchelper () at /tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/proc.go:242
        #1  0x0000000000092cac in fmt.Println (a=..., n=<optimized out>, err=...) at /tmp/workdir-host-linux-ppc64le-osu/go/src/fmt/print.go:275
        #2  main.main () at /tmp/go-build132537636/main.go:14
        Traceback (most recent call last):
          File "/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/runtime-gdb.py", line 536, in invoke
            gdb.parse_and_eval('$sp = $save_sp')
        gdb.error: Attempt to assign to an unmodifiable value.
        Error occurred in Python command: Attempt to assign to an unmodifiable value.
        END
        Breakpoint 2 at 0x92ce0: file /tmp/go-build132537636/main.go, line 18.
        runtime: newstack sp=0xc000032fc0 stack=[0xc000064000, 0xc000065000]
        	morebuf={pc:0x92cac sp:0xc000032fc0 lr:0x0}
        	sched={pc:0x38bb4 sp:0xc000032fc0 lr:0x92cac ctxt:0x0}
        runtime: gp=0xc000000180, goid=1, gp->status=0x2
         runtime: split stack overflow: 0xc000032fc0 < 0xc000064000
        fatal error: runtime: split stack overflow
        
        runtime stack:
        runtime.throw(0xd4b34, 0x1d)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/panic.go:608 +0x5c
        runtime.newstack()
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/stack.go:995 +0x7a4
        runtime.morestack()
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/asm_ppc64x.s:333 +0x60
        
        goroutine 1 [running]:
        runtime.forcegchelper()
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/proc.go:242 +0x14 fp=0xc000032fc0 sp=0xc000032fc0 pc=0x38bb4
        runtime: unexpected return pc for main.main called from 0x0
        stack: frame={sp:0xc000032fc0, fp:0xc000033080} stack=[0xc000064000,0xc000065000)
        
        fmt.Println(...)
        	/tmp/workdir-host-linux-ppc64le-osu/go/src/fmt/print.go:275
        main.main()
        	/tmp/go-build132537636/main.go:14 +0x1ec fp=0xc000033080 sp=0xc000032fc0 pc=0x92cac
        [Inferior 1 (process 11715) exited with code 02]
        BEGIN goroutine 1 bt at the end
        No such goroutine:  1
        END
        
    runtime-gdb_test.go:286: goroutine 1 bt at the end failed: No such goroutine:  1
FAIL
FAIL	runtime	36.282s
@randall77

This comment has been minimized.

Contributor

randall77 commented Nov 8, 2018

Odd.
The backtrace for the background GC thread is all wrong:

BEGIN goroutine 2 bt
        #0  runtime.forcegchelper () at /tmp/workdir-host-linux-ppc64le-osu/go/src/runtime/proc.go:242
        #1  0x0000000000092cac in fmt.Println (a=..., n=<optimized out>, err=...) at /tmp/workdir-host-linux-ppc64le-osu/go/src/fmt/print.go:275
        #2  main.main () at /tmp/go-build132537636/main.go:14

The backtrace starts at a GC helper thread, then somehow jumps to the user's thread, at a nonsensical location.

Here's goroutine 2's stack trace from a correct run (on linux):

        BEGIN goroutine 2 bt
        #0  runtime.gopark (unlockf=<optimized out>, lock=0x752d20 <runtime.forcegc>, reason=16 '\020', traceEv=20 '\024', traceskip=1) at /usr/local/google/home/khr/sandbox/inl/src/runtime/proc.go:302
        #1  0x0000000000429766 in runtime.goparkunlock (lock=<optimized out>, reason=<optimized out>, traceEv=<optimized out>, traceskip=<optimized out>) at /usr/local/google/home/khr/sandbox/inl/src/runtime/proc.go:307
        #2  runtime.forcegchelper () at /usr/local/google/home/khr/sandbox/inl/src/runtime/proc.go:250
        #3  0x0000000000450671 in runtime.goexit () at /usr/local/google/home/khr/sandbox/inl/src/runtime/asm_amd64.s:1340
        #4  0x0000000000000000 in ?? ()
        END

The error isn't deterministic; sometimes it happens and sometimes it doesn't.
I think this issue is probably a bug in the underlying mid-stack inlining, not anything to do with my inlining change directly.
I will investigate further.
@davidlazar @aclements

@laboger

This comment has been minimized.

Contributor

laboger commented Nov 15, 2018

I haven't been able to reproduce this error on any of our systems, using various distros.
Is there something different about the way this run or the builder system?

I tried the full all.bash or just by doing this:
go tool dist test runtime:cpu124

@bradfitz

This comment has been minimized.

Member

bradfitz commented Nov 15, 2018

Our ppc64le builders run Debian jessie with kernel:

Linux go-le-1 3.16.0-4-powerpc64le #1 SMP Debian 3.16.7-ckt25-1 (2016-03-06) ppc64le GNU/Linux

@randall77

This comment has been minimized.

Contributor

randall77 commented Nov 15, 2018

Check gdb versions as well, that is likely to be part of the issue.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Nov 15, 2018

$ gdb --version
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1
@laboger

This comment has been minimized.

Contributor

laboger commented Nov 15, 2018

I ran it on a comparable system but with a more recent kernel:
Linux lorentz 3.16.0-6-powerpc64le #1 SMP Debian 3.16.56-1+deb8u1 (2018-05-08) ppc64le GNU/Linux

gdb --version
GNU gdb (Debian 7.7.1+dfsg-5) 7.7.1

I don't understand the point of setting GOMAXPROCS=2 and then -test.cpu=1,2,4? Setting GOMAXPROCS=2 but -test.cpu=4 seems wrong. Could something be getting messed up because of that?

What is the ppc64 system? Seems strange that is not failing in this way.

@laboger

This comment has been minimized.

Contributor

laboger commented Nov 26, 2018

Could we get a newer distro and kernel on the ppc64le builder. This is the second time we've seen failures in the builder that don't happen on other distros. Debian 8 seems pretty old.

I don't work with OSU much so I don't know who your contact is for this system or how to make this request.

@bradfitz

This comment has been minimized.

Member

bradfitz commented Nov 26, 2018

Could we get a newer distro and kernel on the ppc64le builder.

Sorry, but that won't happen anytime soon. We have a backlog of builder work and this wouldn't be near the top of the priority list. I recommend just skipping GDB tests if the GDB or kernel is too old.

/cc @dmitshur

@laboger

This comment has been minimized.

Contributor

laboger commented Nov 27, 2018

It would be good if we could upgrade at some point. It's not a certainty that's the problem in this case, just a guess since it seems to work right so many places.

I built the programs for main.go, the program being run by gdb when it fails, before and after the bad commit. One thing I saw that looks suspicious is that starting in the commit where it started failing, Fprintln is now inlined from Println, so main doesn't directly call Println anymore. And fmt.Println is what shows up in the backtrace in the output when it fails. So I'm wondering if that is significant? I don't know what has to happen to the dwarf or whatever information gdb has to look at to figure out the backtrace in the case where inlining happened to get it right, but maybe there is something that causes gdb to look at uninitalized data causing random behavior? I also see that Println and Fprintln have variable argument lists, not common.

@laboger

This comment has been minimized.

Contributor

laboger commented Nov 30, 2018

I'm not sure what's happening on the ppc64le builder, but I can reproduce the exact same error output on any machine by doing this:

GOMAXPROCS=1 ./runtime.test -test.run=GdbPython$

I can also use gdb with the source for the program being run in the testcase and step through and see what's wrong. I don't think the problem is related to the compiler doing something wrong but because the testcase is different and behaves differently. I can try this test with an older compiler and get the same failure.

The problem occurs if forcegchelper is at the top of the stack for one of the goroutines and the goroutine's pc is at the start of the function. At this point the value for sp has not been set yet for forcegchelper, and since this is the initial function for a goroutine I'm not sure what sp would be at that point.

// start forcegc helper goroutine
func init() {
        go forcegchelper()
}

So when we have this situation, and 'goroutine 2 bt' is invoked for a goroutine with this at the top of its stack, things go wrong after that, probably because the sp does not make sense at this point. If I break at the end and do goroutine 2 bt no errors occur.

@gopherbot

This comment has been minimized.

gopherbot commented Dec 4, 2018

Change https://golang.org/cl/152540 mentions this issue: runtime: fix runtime-gdb.py when switching sp value

@laboger

This comment has been minimized.

Contributor

laboger commented Dec 5, 2018

To see the reason for the above change, here is a reference in the gdb documentation. https://sourceware.org/gdb/onlinedocs/gdb/Registers.html.
See the footnote where it is providing information on another way to set $sp.

This is a way of removing one word from the stack, on machines where stacks grow downward in memory (most machines, nowadays). This assumes that the innermost stack frame is selected; setting $sp is not allowed when other stack frames are selected.

@milanknezevic

This comment has been minimized.

Contributor

milanknezevic commented Dec 5, 2018

@laboger
The same issue can be reproduced on mipsle, but not on mips, I'm not sure why.
In our case, goroutine $num bt happens at point in main where $pc = $ra(LR register). When goroutine command sets the $pc it points at the start of runtime.runfinq (before the $ra is saved on stack), so the gdb uses $ra to unwind. At this point we have bad backtrace:

(gdb) goroutine 17 bt
#0  runtime.runfinq () 
#1  0x000a4ea0 in fmt.Println 
#2  main.main ()

and gdb frame command returns that main.main frame is current frame.

(gdb) frame
#2  main.main () 

Your fix helps, thank you.
I fixed it locally by saving/restoring $ra along with $pc and $sp. Not sure if that should be done for all non-x86 targets, regardless of this issue.

gopherbot pushed a commit that referenced this issue Dec 7, 2018

runtime: fix runtime-gdb.py when switching sp value
After a recent change to runtime-gdb_test.go the ppc64le builder
has had intermittent failures. The failures occur when trying to
invoke the goroutineCmd function to display the backtrace for
a selected goroutine. There is nothing wrong with the testcase
but it seems to intermittently leave goroutines in a state
where an error can occur.

The error message indicates that the problem occurs when trying
to change the sp back to the original after displaying the
stacktrace for the goroutine.

gdb.error: Attempt to assign to an unmodifiable value.

After some searching I found that this error message can happen
if the sp register is changed when on a frame that is not the
top-most frame. To fix the problem, frame 0 is selected before
changing the value of sp. This fixes the problem in my
reproducer environment, and hopefully will fix the problem on
the builder.

Updates #28679

Change-Id: I329bc95b30f8c95acfb161b0d9cfdcbd917a1954
Reviewed-on: https://go-review.googlesource.com/c/152540
Run-TryBot: Lynn Boger <laboger@linux.vnet.ibm.com>
Reviewed-by: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment