Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rr fails to trace gdb #2452

Closed
multun opened this issue Mar 9, 2020 · 20 comments
Closed

rr fails to trace gdb #2452

multun opened this issue Mar 9, 2020 · 20 comments

Comments

@multun
Copy link

@multun multun commented Mar 9, 2020

Running rr record gdb -q ./helloworld fails to record a trace. There's no error message, the trace collection seems to continue, but the hello world program never starts.
It would be lovely to be able to debug gdb using rr!

Tested with rr 5.2 and 5.3

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 9, 2020

gdb uses ptrace aggressively and I suspect rr's ptrace emulation may not be adequate. It might not be too hard to make rr's ptrace emulation good enough for normal gdb usage. There are some ptrace features (in particular, PTRACE_SYSCALL) that would be prohibitively difficult to support under rr in general. (Currently rr relies on certain syscalls not being interruptible, but with PTRACE_SYSCALL you need to be able to interrupt every syscall.)

Most gdb issues are probably debuggable under rr by running gdb with a separate gdbserver, and running gdb under rr but not the gdbserver.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 9, 2020

The problem in this case is that gdb is using rt_sigsuspend to block waiting for ptrace SIGCHLD and that is interacting with rr's ptrace emulation in a buggy way. rr gets into an infinite loop of restarting syscalls. I can reproduce this in a testcase.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 9, 2020

We send a synthetic SIGCHLD to the ptracer as expected, but the rt_sigsuspend syscall returns with ERESTARTNOHAND, which seems wrong.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 9, 2020

the rt_sigsuspend syscall returns with ERESTARTNOHAND, which seems wrong.

Actually that's right in this case, because I didn't set a signal handler for SIGCHLD, and in that case sigsuspend is not supposed to return.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 9, 2020

The problem is code added in fb8d716. That is making us defer the SIGCHLD. Maybe SIGCHLD doesn't interrupt waitpid/waitid/wait4 but it certainly does interrupt rt_sigsuspend.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

I have a fix for that, running tests on that now. Basically, the wait syscalls prefer to return the notifying child pid than return ERESTARTSYS for the signal generated by the notification.

However, we run into more problems with ptrace emulation. For some reason gdb forks, ptraces the child, then the child forks with gdb ptracing the grandchild, then gdb SIGKILLs the grandchild and waits for its exit code. The problem is that under rr, gdb is neither the real parent of the grandchild nor the real ptracer of the grandchild, so the kernel does not forward the grandchild's exit notification to gdb's wait syscall or send the appropriate SIGCHLD. We have logic for sending a fake exit notification to the ptracer, but it only works for exit syscalls. To make it work for SIGKILL we have to move that logic to our PTRACE_EVENT_EXIT handler, but at that point we destroy the task, which breaks the wait logic that handles those fake exit notifications, because that logic currently requires a pointer to a RecordTask.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

So that leaves a couple of options, neither of which are nice:

  1. Refactor/build new notification machinery so we can keep around some form of "not yet reaped task" that our emulation of wait syscalls can use after the RecordTask is gone.
  2. Instead of destroying RecordTask after PTRACE_EVENT_EXIT, keep it around until the task has been reaped ... or something like that.

The sigsuspend fix landed as fe6dd2d.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

Option 2 doesn't look too bad, but there is one serious problem: we don't really actually reap any tasks during recording or replay :-(.

    if (tg->task_set().empty() && !session().is_recording()) {
      // Reap the zombie.

task_set().empty() is never true because the current task belongs to task_set()!

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

We actually do need to refrain from reporting task exit to a tracee wait-ish syscall until the task has really exited. Otherwise we get observable, nasty races where task A hasn't really exited but we report to task B that it has, and then B does things that don't work because A hasn't really exited yet.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

Currently we ptrace-detach from a task before (failing to) wait for it. That can't actually work reliably because once we aren't ptracing a task, if we're not its direct parent (which we usually aren't), we can't wait for its exit (not without pidfds anyway).

So I tried making us not ptrace-detach, just PTRACE_CONT out of the PTRACE_EVENT_EXIT so the task exits while we're ptrace-attached. But we get new problems :-(. I observe hanging just waiting for a ptraced task, with a child process but only one thread in its own threadgroup, that has exited normally via exit_group(). The task is a zombie so unfortunately I can't easily get a kernel stack to figure out why this is happening.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

I was able to fix that issue: 34a9f5d
Bit of a hack, but adequate.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

Now I can record gdb operating on simple programs:

[roc@localhost rr]$ rr record gdb -q ~/tmp/test
rr: Saving execution to trace directory `/home/roc/.local/share/rr/gdb-11'.
Warning: /usr/local/lib/rustlib/etc: No such file or directory.
Reading symbols from /home/roc/tmp/test...
(No debugging symbols found in /home/roc/tmp/test)
(gdb) c
The program is not being run.
(gdb) run
Starting program: /home/roc/tmp/test 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Hello World
[Inferior 1 (process 819773) exited normally]
(gdb) q
[roc@localhost rr]$ rr ps
PID	PPID	EXIT	CMD
819764	--	0	gdb -q /home/roc/tmp/test
819772	819764	0	iconv -l
819773	819764	0	/bin/bash -c exec /home/roc/tmp/test 
819774	819764	-9	(forked without exec)
819775	819774	-9	(forked without exec)
819776	819764	-9	(forked without exec)
@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

It's very slow during symbol table loading, though. It looks like every time gdb demangles a symbol, it calls rt_sigaction a couple of times and also setprocmask to set/remove a SIGSEGV handler, and these are a lot slower in rr than in normal execution (because we don't accelerate rt_sigaction with syscall buffering).

Using rr record gdb -ex 'maint set catch-demangler-crashes off' -q ... fixes the problem.

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

A single cycle looks like this:

{
  real_time:3418938.397948 global_time:6003, event:`SYSCALL: rt_sigaction' (state:ENTERING_SYSCALL) tid:826080, ticks:37041848
rax:0xffffffffffffffda rbx:0xd rcx:0xffffffffffffffff rdx:0x7fffa8dda660 rsi:0x7fffa8dda5c0 rdi:0xb rbp:0x681fffa0 rsp:0x681ffe00 r8:0x7fffa8dda810 r9:0x7fffa8dda80c r10:0x8 r11:0x246 r12:0x7fffa8dda770 r13:0x0 r14:0x5610d75c9dec r15:0x5610d75c9dec rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xd fs_base:0x7f5bd39319c0 gs_base:0x0
}
{
  real_time:3418938.398001 global_time:6004, event:`SYSCALL: rt_sigaction' (state:EXITING_SYSCALL) tid:826080, ticks:37041848
rax:0x0 rbx:0xd rcx:0xffffffffffffffff rdx:0x7fffa8dda660 rsi:0x7fffa8dda5c0 rdi:0xb rbp:0x681fffa0 rsp:0x681ffe00 r8:0x7fffa8dda810 r9:0x7fffa8dda80c r10:0x8 r11:0x246 r12:0x7fffa8dda770 r13:0x0 r14:0x5610d75c9dec r15:0x5610d75c9dec rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xd fs_base:0x7f5bd39319c0 gs_base:0x0
  { tid:826080, addr:0x7fffa8dda5c0, length:0x20 }
  { tid:826080, addr:0x7fffa8dda660, length:0x20 }
}
{
  real_time:3418938.398133 global_time:6005, event:`SYSCALLBUF_FLUSH' tid:826080, ticks:37041918
  { syscall:'rt_sigprocmask', ret:0x0, size:0x18 }
}
{
  real_time:3418938.398148 global_time:6006, event:`SYSCALL: rt_sigaction' (state:ENTERING_SYSCALL) tid:826080, ticks:37041918
rax:0xffffffffffffffda rbx:0xd rcx:0xffffffffffffffff rdx:0x0 rsi:0x7fffa8dda5c0 rdi:0xb rbp:0x681fffa0 rsp:0x681ffe00 r8:0x0 r9:0x0 r10:0x8 r11:0x246 r12:0x0 r13:0x0 r14:0x5610d75c9dec r15:0x5610d75c9dec rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xd fs_base:0x7f5bd39319c0 gs_base:0x0
}
{
  real_time:3418938.398156 global_time:6007, event:`SYSCALLBUF_RESET' tid:826080, ticks:37041918
}
{
  real_time:3418938.398202 global_time:6008, event:`SYSCALL: rt_sigaction' (state:EXITING_SYSCALL) tid:826080, ticks:37041918
rax:0x0 rbx:0xd rcx:0xffffffffffffffff rdx:0x0 rsi:0x7fffa8dda5c0 rdi:0xb rbp:0x681fffa0 rsp:0x681ffe00 r8:0x0 r9:0x0 r10:0x8 r11:0x246 r12:0x0 r13:0x0 r14:0x5610d75c9dec r15:0x5610d75c9dec rip:0x70000002 eflags:0x246 cs:0x33 ss:0x2b ds:0x0 es:0x0 fs:0x0 gs:0x0 orig_rax:0xd fs_base:0x7f5bd39319c0 gs_base:0x0
  { tid:826080, addr:0x7fffa8dda5c0, length:0x20 }
}
@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 10, 2020

Adding and removing the signal handler around each symbol demangle (2-3 system calls per symbol) surely makes gdb a lot slower even without rr. They should set it once and then demangle lots of symbols.

Accelerating rt_sigaction under rr is possible but would be pretty tricky and arguably only useful for code that's already quite poorly implemented, like here. So let's ignore that for now and call this done.

@rocallahan rocallahan closed this Mar 10, 2020
@multun

This comment has been minimized.

Copy link
Author

@multun multun commented Mar 10, 2020

thanks a lot for this amazingly efficient response!

@simark

This comment has been minimized.

Copy link

@simark simark commented Mar 18, 2020

Adding and removing the signal handler around each symbol demangle (2-3 system calls per symbol) surely makes gdb a lot slower even without rr. They should set it once and then demangle lots of symbols.

Accelerating rt_sigaction under rr is possible but would be pretty tricky and arguably only useful for code that's already quite poorly implemented, like here. So let's ignore that for now and call this done.

Just wondering, which GDB version did you test this with? Today, GDB normally sets a segv handler once, and then just updates a global variable callback when demangling a symbol:

https://github.com/bminor/binutils-gdb/blob/fe64b263e025c5abc02738138d9283f53641ca42/gdb/cp-support.c#L1545

The change introducing this new behavior was in this commit, from about a year ago:

bminor/binutils-gdb@3b3978b

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 18, 2020

GNU gdb (GDB) Fedora 8.3.50.20190824-30.fc31

@simark

This comment has been minimized.

Copy link

@simark simark commented Mar 18, 2020

Actually, it might have been that commit:

bminor/binutils-gdb@17bfe55

But in both cases, they were merged after the GDB you used was built (the first commit I mentioned was written over a year ago, but was merged last november).

@rocallahan

This comment has been minimized.

Copy link
Member

@rocallahan rocallahan commented Mar 18, 2020

That's good news.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.