Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: gdb stepping behavior when debugging Go program on ppc64le changed when split-stack prologue marked non-preemptible #37126

Open
laboger opened this issue Feb 7, 2020 · 11 comments

Comments

@laboger
Copy link
Contributor

@laboger laboger commented Feb 7, 2020

What version of Go are you using (go version)?

$ go version
go1.14rc1

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
linux/ppc64le

What did you do?

Tried to debug Go program on ppc64le using gdb by breaking at a function's address then stepping through the first several instructions.

What did you expect to see?

Normal gdb behavior, gdb prompt returning within a reasonable amount of time (usually immediately) when stepping through code at the beginning of a function. This type of stepping works normally for programs compiled with Go 1.13.

I found that this behavior started with b2482e4, so with that knowledge I now know that if I break after the split-stack prologue code the hanging/delay doesn't happen and only happens with functions that have split-stack prologue.

What did you see instead?

Break at the beginning of a function that has a split-stack prolog using its address. After typing 'ni' to go to the next instruction, gdb hangs and the prompt can take a long time and sometimes does not return (or I didn't wait long enough).

@dmitshur

This comment has been minimized.

Copy link
Member

@dmitshur dmitshur commented Feb 14, 2020

@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Feb 24, 2020

So if I set GODEBUG=asyncpreemptoff=1 then the problem doesn't happen. I'm not entirely sure that it only happens at the beginning of a function.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Mar 2, 2020

I don't see how this could be related specifically to marking the split-stack prologue non-preemptible (/cc @cherrymui), but I could see asynchronous preemption affecting this.

My hunch is that this is just an extension of a problem we've had for a long time. If you try to step over a split stack prologue and it actually grows the stack, GDB gets very confused because the SP changes dramatically, but GDB assumes the SP is a stable indication of what frame you're in. Though I'm not quite sure why asynchronous preemption would make that worse because that won't actually grow the stack (it can't). It may be that the preemption gets injected and switches to a different goroutine and the dramatic SP change of the goroutine switch confuses GDB. It wouldn't surprise me if GDB sees the SP increase dramatically from the goroutine switch and assumes you've exited the frame you were trying to next in, and it just gives up on nexting.

@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Mar 5, 2020

Here is what I just tried, using math.test built from the math package. I set a break at the start of the sin function.

Thread 1 "math.test" hit Breakpoint 1, math.asin (~r1=<optimized out>, x=0.77388724745781046) at /home/boger/golang/base/go/src/math/asin.go:21
21	func asin(x float64) float64 {
(gdb) ni
0x000000000008e324	21	func asin(x float64) float64 {
(gdb) ni
0x000000000008e328	21	func asin(x float64) float64 {
(gdb) ni  <---- hung here for a long time. top shows 100% in gdb
^C
Thread 2 "math.test" received signal SIGINT, Interrupt.
[Switching to LWP 92085]
0x00000000000741f0 in runtime.usleep () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:134
134		SYSCALL	$SYS_nanosleep
(gdb) x/i 0x8e328
   0x8e328 <math.asin+8>:	blt     0x8e338 <math.asin+24>
(gdb) bt
#0  0x00000000000741f0 in runtime.usleep () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:134
#1  0x0000000000050240 in runtime.sysmon () at /home/boger/golang/base/go/src/runtime/proc.go:4468
#2  0x0000000000047af4 in runtime.mstart1 () at /home/boger/golang/base/go/src/runtime/proc.go:1098
#3  0x0000000000047a00 in runtime.mstart () at /home/boger/golang/base/go/src/runtime/proc.go:1063
#4  0x00000000000747ec in runtime.clone () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:566

If I set GODEBUG=asyncpreemptoff=1 this doesn't happen. If I use Go 1.13 this doesn't happen. If I break at a later spot, the ni doesn't hang either, at least in this case. The last instruction it tried was the blt. I don't think it is related to morestack because it will happen on any function, and I don't think it is called that often. I don't think this is a problem that has been around for a long time, I use gdb on programs built by Golang all the time and have never seen this before.

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Mar 5, 2020

My guess is that when one thread is stopped in the prologue, which is non-preemptible, other threads may keep sending signals to try to preempt it. Maybe the signals keep gdb busy? Maybe try info threads to see what other threads are doing, whether they are sending signals. Maybe try set scheduler-locking on to block other threads from running when single stepping?

@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Mar 5, 2020

I was not familiar with set scheduler-locking but that does seem to fix it. In any case, when it hangs, here is the output from the active threads and their backtraces. I don't see anything that would be sending signals.

I also tried this same experiment with Go 1.13 and if I stop at the same location, the threads and their stacks look the same.

21	func asin(x float64) float64 {
(gdb) ni
0x000000000008e324	21	func asin(x float64) float64 {
(gdb) x/20ni $pc
=> 0x8e324 <math.asin+4>:	cmpld   r3,r1
   0x8e328 <math.asin+8>:	blt     0x8e338 <math.asin+24>
   0x8e32c <math.asin+12>:	mflr    r5
   0x8e330 <math.asin+16>:	bl      0x71310 <runtime.morestack_noctxt>
   0x8e334 <math.asin+20>:	b       0x8e320 <math.asin>
   0x8e338 <math.asin+24>:	mflr    r31
   0x8e33c <math.asin+28>:	stdu    r31,-48(r1)
   0x8e340 <math.asin+32>:	lfd     f1,80(r1)
   0x8e344 <math.asin+36>:	xxlxor  vs2,vs2,vs2
   0x8e348 <math.asin+40>:	fcmpu   cr0,f1,f2
   0x8e34c <math.asin+44>:	beq     0x8e40c <math.asin+236>
   0x8e350 <math.asin+48>:	bge     0x8e404 <math.asin+228>
   0x8e354 <math.asin+52>:	fneg    f3,f1
   0x8e358 <math.asin+56>:	lis     r31,26
   0x8e35c <math.asin+60>:	lfd     f4,25976(r31)
   0x8e360 <math.asin+64>:	fcmpu   cr0,f3,f4
   0x8e364 <math.asin+68>:	bgt     0x8e3e8 <math.asin+200>
   0x8e368 <math.asin+72>:	fmul    f5,f3,f3
   0x8e36c <math.asin+76>:	fsub    f4,f4,f5
   0x8e370 <math.asin+80>:	fsqrt   f4,f4
(gdb) info reg $r3
r3             0xc00003cb80	824633969536
(gdb) info reg $r1
r1             0xc00003ce60	824633970272
(gdb) ni
^C
Thread 2 "math.test" received signal SIGINT, Interrupt.
[Switching to LWP 103396]
0x00000000000741f0 in runtime.usleep () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:134
134		SYSCALL	$SYS_nanosleep
(gdb) info thread
  Id   Target Id         Frame 
  1    LWP 103393 "math.test" 0x000000000008e324 in math.asin (~r1=<optimized out>, x=0.49790119248836734)
    at /home/boger/golang/base/go/src/math/asin.go:21
* 2    LWP 103396 "math.test" 0x00000000000741f0 in runtime.usleep () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:134
  3    LWP 103397 "math.test" 0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
  4    LWP 103398 "math.test" 0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
  5    LWP 103399 "math.test" 0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
(gdb) bt
#0  0x00000000000741f0 in runtime.usleep () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:134
#1  0x0000000000050240 in runtime.sysmon () at /home/boger/golang/base/go/src/runtime/proc.go:4468
#2  0x0000000000047af4 in runtime.mstart1 () at /home/boger/golang/base/go/src/runtime/proc.go:1098
#3  0x0000000000047a00 in runtime.mstart () at /home/boger/golang/base/go/src/runtime/proc.go:1063
#4  0x00000000000747ec in runtime.clone () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:566
(gdb) thread 3
[Switching to thread 3 (LWP 103397)]
#0  0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
501		SYSCALL	$SYS_futex
(gdb) bt
#0  0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
#1  0x000000000003f26c in runtime.futexsleep (addr=0xc0000444c8, val=0, ns=-1) at /home/boger/golang/base/go/src/runtime/os_linux.go:44
#2  0x000000000001a618 in runtime.notesleep (n=0xc0000444c8) at /home/boger/golang/base/go/src/runtime/lock_futex.go:151
#3  0x0000000000049158 in runtime.stopm () at /home/boger/golang/base/go/src/runtime/proc.go:1829
#4  0x000000000004a930 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>)
    at /home/boger/golang/base/go/src/runtime/proc.go:2361
#5  0x000000000004b6e0 in runtime.schedule () at /home/boger/golang/base/go/src/runtime/proc.go:2521
#6  0x000000000004bc88 in runtime.park_m (gp=0xc000000c00) at /home/boger/golang/base/go/src/runtime/proc.go:2691
#7  0x0000000000071178 in runtime.mcall () at /home/boger/golang/base/go/src/runtime/asm_ppc64x.s:202
#8  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 4
[Switching to thread 4 (LWP 103398)]
#0  0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
501		SYSCALL	$SYS_futex
(gdb) bt
#0  0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
#1  0x000000000003f26c in runtime.futexsleep (addr=0xc000080148, val=0, ns=-1) at /home/boger/golang/base/go/src/runtime/os_linux.go:44
#2  0x000000000001a618 in runtime.notesleep (n=0xc000080148) at /home/boger/golang/base/go/src/runtime/lock_futex.go:151
#3  0x0000000000049158 in runtime.stopm () at /home/boger/golang/base/go/src/runtime/proc.go:1829
#4  0x000000000004a930 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>)
    at /home/boger/golang/base/go/src/runtime/proc.go:2361
#5  0x000000000004b6e0 in runtime.schedule () at /home/boger/golang/base/go/src/runtime/proc.go:2521
#6  0x000000000004bc88 in runtime.park_m (gp=0xc000000a80) at /home/boger/golang/base/go/src/runtime/proc.go:2691
#7  0x0000000000071178 in runtime.mcall () at /home/boger/golang/base/go/src/runtime/asm_ppc64x.s:202
#8  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) thread 5 bt
[Switching to thread 5 (LWP 103399)]
#0  0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
501		SYSCALL	$SYS_futex
(gdb) bt
#0  0x0000000000074730 in runtime.futex () at /home/boger/golang/base/go/src/runtime/sys_linux_ppc64x.s:501
#1  0x000000000003f26c in runtime.futexsleep (addr=0xc000044848, val=0, ns=-1) at /home/boger/golang/base/go/src/runtime/os_linux.go:44
#2  0x000000000001a618 in runtime.notesleep (n=0xc000044848) at /home/boger/golang/base/go/src/runtime/lock_futex.go:151
#3  0x0000000000049158 in runtime.stopm () at /home/boger/golang/base/go/src/runtime/proc.go:1829
#4  0x000000000004a930 in runtime.findrunnable (gp=<optimized out>, inheritTime=<optimized out>)
    at /home/boger/golang/base/go/src/runtime/proc.go:2361
#5  0x000000000004b6e0 in runtime.schedule () at /home/boger/golang/base/go/src/runtime/proc.go:2521
#6  0x000000000004bc88 in runtime.park_m (gp=0xc000114000) at /home/boger/golang/base/go/src/runtime/proc.go:2691
#7  0x0000000000071178 in runtime.mcall () at /home/boger/golang/base/go/src/runtime/asm_ppc64x.s:202
#8  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Mar 5, 2020

Thanks, @laboger . Okay, it seems other threads are not actively sending signals (which is expected), even not actively running. So maybe it is gdb itself stuck? I guess the next step is to see what gdb does at this point. Maybe run gdb inside gdb to find out?

Also, does this happen with other non-preemptible places? For example, if you single step in runtime.memmove or any assembly function (but not the prologue), is it stuck?

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Mar 5, 2020

Normally, when single stepping in one thread, other threads are still running. set scheduler-locking on will let gdb suspend other threads.

@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Mar 6, 2020

Today was kind of strange, I found some systems where I didn't hit the error when at one time I thought it happened all the time. I am fairly sure in Go code it only happens within the code that checks the stack and calls morestack and that I can still reproduce on some systems. I thought I was hitting this a lot while debugging an asm function called update in poly1305 which now I see does not have the morestack sequence but today I can't reproduce it. I kind of got into the habit of setting the asyncpremptoff=1 while debugging because it was so annoying so I didn't really keep track of when and where it happened and didn't happen. I can't really make it happen today in other asm files.

I don't know what's different, although I do see some gdb's have the scheduler-locking default set to replay while some have it set to step. I can't find the documentation that tells what those two values mean so not sure if it makes a difference.

If this only affects Power I don't really expect a fix. Probably no one else does this type of debugging except me. (When you set a break at a symbol it sets it after the morestack sequence). I don't know if there is documentation for using gdb with Go but maybe could be mentioned there how to work around this problem.

@cherrymui

This comment has been minimized.

Copy link
Contributor

@cherrymui cherrymui commented Mar 6, 2020

(gdb) help set scheduler-locking 
Set mode for locking scheduler during execution.
off    == no locking (threads may preempt at any time)
on     == full locking (no thread except the current thread may run)
          This applies to both normal execution and replay mode.
step   == scheduler locked during stepping commands (step, next, stepi, nexti).
          In this mode, other threads may run during other commands.
          This applies to both normal execution and replay mode.
replay == scheduler locked in replay mode and unlocked during normal execution.

So, if it is set to step, other threads will be suspended while single stepping. If it is set to replay, other threads can still run.

I tried to reproduce on linux/amd64 with math.test but couldn't.

Do the systems you run have the same version of GDB?

Does it only happen if it actually calls morestack, or it can happen no matter whether it actually grows the stack?

@laboger

This comment has been minimized.

Copy link
Contributor Author

@laboger laboger commented Mar 9, 2020

OK, the scheduler-locking values make sense. On the system where the default is step, it does not hang, but all other systems default to replay and on those it hangs.

There are different versions of gdb on the systems I tried.
works: GNU gdb (GDB) Red Hat Enterprise Linux 8.2-6.el8_0; defaults to step
doesn't work: GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1; default to replay
doesn't work: GNU gdb (Debian 8.2.1-2+b3) 8.2.1; defaults to replay
doesn't work: GNU gdb (Ubuntu 8.1-0ubuntu3) 8.1.0.20180409-git: defaults to replay

Once it gets into this code sequence the delays when stepping happen pretty consistently. I don't think it is related to the morestack call, because the delay in stepping starts before it gets to the call to morestack, either on the first cmp or the blt.

I also tried breaking at the call to math/big.mulAddVWW which is asm without the morestack sequence and I do see some step delays in the code as well on the systems with replay as the default.

As mentioned above, once scheduler-locking is set to step or off then the delays are gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants
You can’t perform that action at this time.