Go can't be recorded #1861

jld · 2016-10-21T21:47:28Z

Setup: rr revision f024ebc, Ubuntu 14.04 on amd64, golang package version 2:1.2.1-2ubuntu1

STR: rr record go help produces this:

rr: Saving execution to trace directory `/home/jld/.local/share/rr/go-6'.
unexpected fault address 0x7fca91944f08
fatal error: fault
[signal 0xb code=0x1 addr=0x7fca91944f08 pc=0x7fca9167c1d0]

goroutine 1 [running]:
runtime.throw(0xb7c917)
        /usr/lib/go/src/pkg/runtime/panic.c:464 +0x69 fp=0x7fca919456f0
runtime: unexpected return pc for runtime.sigpanic called from 0x7fca9167c1d0
runtime.sigpanic()
        /usr/lib/go/src/pkg/runtime/os_linux.c:237 +0xe9 fp=0x7fca91945708

goroutine 3 [syscall]:
os/signal.loop()
        /usr/lib/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
        /usr/lib/go/src/pkg/os/signal/signal_unix.go:27 +0x31

goroutine 4 [chan send]:
text/template/parse.lexText(0xc210050700, 0x87d808)
        /usr/lib/go/src/pkg/text/template/parse/lex.go:216 +0x18e
text/template/parse.(*lexer).run(0xc210050700)
        /usr/lib/go/src/pkg/text/template/parse/lex.go:198 +0x40
created by text/template/parse.lex
        /usr/lib/go/src/pkg/text/template/parse/lex.go:191 +0x117

However, rr record --no-syscall-buffer seems to work, slowly but usably, both on go help and the less-trivial original test case (the BoringSSL test suite).

The text was updated successfully, but these errors were encountered:

rocallahan · 2016-10-22T03:15:19Z

WFM with Go 1.6 in Fedora 24 and Ubuntu 16.06. Guess I'll have to try it in Ubuntu 14.04.

rocallahan · 2016-10-22T11:08:06Z

This is a stack overflow calling into the syscallbuf. Presumably this is Go's split-stack implementation allocating a small initial stack and allocating new stack pages based on assumptions about stack usage which are broken by syscallbuf.

Need to think about what to do here.

rocallahan · 2016-10-22T22:39:54Z

I think we need to have the syscallbuf code switch to an alternate stack.

The question is where to put those stacks. Perhaps the simplest thing to do would be to use the thread_locals page. (Currently we assume syscallbuf only needs 2K of stack but that's conservative and could be reduced further.) The problem is that all those thread_locals pages are at the same address so this will muck up watchpoints and other debugging of anything on the syscallbuf stack.

rocallahan · 2016-10-22T22:47:09Z

Hmm in fact if we used the thread_locals page then whenever you're stopped in gdb and you try to get a backtrace of another thread that's blocked in syscallbuf, you'll get the wrong answer. That would be really bad.

rocallahan · 2016-10-22T22:55:19Z

BTW the reason this doesn't show up as easily in Go 1.6 is that in Go 1.3 they switched to contiguous stacks: when they run out of stack space, a new stack is created by copying the old stack into a larger area. Since they're still relying on manual stack-size checking and not kernel stack extension, this is probably still an rr issue, just harder to trigger.

rocallahan · 2016-10-22T22:58:35Z

Another possible approach is to detect SIGSEGVs triggered by the instruction that extends the syscallbuf stack, and in that case redirect program state to a stub that just triggers a non-buffered syscall and returns. Unfortunately these stacks might not have a guard page (they don't seem to in Go 1.2) so we can't rely on the SIGSEGV being triggered --- we might just start overwriting some other memory. So let's not do this.

rocallahan · 2016-10-22T23:04:24Z

Hmm, apart from the stack used by the syscallbuf hook code itself, there's also the 256-byte area we skip in X64SyscallStubExtendedJump to avoid corrupting the redzone.

rocallahan · 2016-10-22T23:07:32Z

I think we need to allocate a per-thread page for a syscallbuf alternate stack. We can store its address in the preload_thread_locals page. X64SyscallStubExtendedJump should switch stacks to this page to avoid the redzone issue.

The tricky thing is, how do we allocate and deallocate these pages?

rocallahan · 2016-10-22T23:34:08Z

Ah, I think we can just use part of the per-thread scratch buffer! Extend it by a page and use the last page.

rocallahan · 2016-10-22T23:34:52Z

Getting the CFI right so stack unwinding works will be a bit tricky but we should be able to manage it.

rocallahan · 2016-10-24T03:03:23Z

A potential problem with this is that a signal handler that runs during a blocked syscallbuf syscall runs on this alternate stack :-(.

khuey · 2016-10-24T03:04:13Z

Can we just use sigaltstack for that?

rocallahan · 2016-10-24T03:08:00Z

No because these are application signal handlers. If they expect to see the stack pointer pointing to an application stack, they'll get confused no matter which alt-stack we use.

Another issue which I've just uncovered is the signal handler calling a hooked system call. It tries to reuse the alt-stack, corrupting the stack frames already using it...

rocallahan · 2016-10-24T03:40:17Z

I can avoid the latter issue with some extra code to detect nested usage of the alt-stack, and for nested calls, suppress the stack switch.

The question is whether signal handlers will get confused in the presence of our stack switch, e.g. if they try to just scan from current $rsp to some top-of-stack saved in TLS.

rocallahan · 2016-10-24T03:41:48Z

Go, of course, uses SA_STACK for its signal handlers so they all use their own alt-stacks. So we have the situation that for Go we need to use our own alt-stack, but possibly other applications could be harmed by that.

rocallahan · 2016-10-24T04:11:40Z

I think that it might be possible, if necessary, to have rr switch the stack pointer back to the original stack pointer before delivering the signal, and restore it if/when the handler does a sigreturn. So I'm going to go with this approach.

rocallahan · 2016-10-24T08:49:12Z

Handling clone is also interesting with an alt-stack. The cloned child resumes with its rsp pointing to the parent's stack, and we return through that stack. That's kind of a problem since the parent could reuse that stack. We need to copy the stack page from parent to child --- after the child has set up its scratch buffer --- and adjust rsp in the syscallbuf.

rocallahan · 2016-10-25T02:59:24Z

The cloned child resumes with its rsp pointing to the parent's stack, and we return through that stack.

This is wrong. clone children are assigned a stack pointer by the syscall so they don't return through the syscallbuf alt-stack.

rocallahan · 2016-10-25T03:16:24Z

fork() and vfork() are OK, at least for now, because the scratch memory we're using is forked so it's safe for the child to use it, and that scratch is not being unmapped.

rocallahan · 2016-10-25T03:23:20Z

This change was quite complicated and some additional scrutiny would be welcome :-).

rocallahan closed this as completed in 38a799a Oct 25, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Go can't be recorded #1861

Go can't be recorded #1861

jld commented Oct 21, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 24, 2016

khuey commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 25, 2016

rocallahan commented Oct 25, 2016

rocallahan commented Oct 25, 2016

Go can't be recorded #1861

Go can't be recorded #1861

Comments

jld commented Oct 21, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 22, 2016

rocallahan commented Oct 24, 2016

khuey commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 24, 2016

rocallahan commented Oct 25, 2016

rocallahan commented Oct 25, 2016

rocallahan commented Oct 25, 2016