Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go can't be recorded #1861

Closed
jld opened this issue Oct 21, 2016 · 20 comments
Closed

Go can't be recorded #1861

jld opened this issue Oct 21, 2016 · 20 comments

Comments

@jld
Copy link

jld commented Oct 21, 2016

Setup: rr revision f024ebc, Ubuntu 14.04 on amd64, golang package version 2:1.2.1-2ubuntu1

STR: rr record go help produces this:

rr: Saving execution to trace directory `/home/jld/.local/share/rr/go-6'.
unexpected fault address 0x7fca91944f08
fatal error: fault
[signal 0xb code=0x1 addr=0x7fca91944f08 pc=0x7fca9167c1d0]

goroutine 1 [running]:
runtime.throw(0xb7c917)
        /usr/lib/go/src/pkg/runtime/panic.c:464 +0x69 fp=0x7fca919456f0
runtime: unexpected return pc for runtime.sigpanic called from 0x7fca9167c1d0
runtime.sigpanic()
        /usr/lib/go/src/pkg/runtime/os_linux.c:237 +0xe9 fp=0x7fca91945708

goroutine 3 [syscall]:
os/signal.loop()
        /usr/lib/go/src/pkg/os/signal/signal_unix.go:21 +0x1e
created by os/signal.init·1
        /usr/lib/go/src/pkg/os/signal/signal_unix.go:27 +0x31

goroutine 4 [chan send]:
text/template/parse.lexText(0xc210050700, 0x87d808)
        /usr/lib/go/src/pkg/text/template/parse/lex.go:216 +0x18e
text/template/parse.(*lexer).run(0xc210050700)
        /usr/lib/go/src/pkg/text/template/parse/lex.go:198 +0x40
created by text/template/parse.lex
        /usr/lib/go/src/pkg/text/template/parse/lex.go:191 +0x117

However, rr record --no-syscall-buffer seems to work, slowly but usably, both on go help and the less-trivial original test case (the BoringSSL test suite).

@rocallahan
Copy link
Collaborator

WFM with Go 1.6 in Fedora 24 and Ubuntu 16.06. Guess I'll have to try it in Ubuntu 14.04.

@rocallahan
Copy link
Collaborator

This is a stack overflow calling into the syscallbuf. Presumably this is Go's split-stack implementation allocating a small initial stack and allocating new stack pages based on assumptions about stack usage which are broken by syscallbuf.

Need to think about what to do here.

@rocallahan
Copy link
Collaborator

I think we need to have the syscallbuf code switch to an alternate stack.

The question is where to put those stacks. Perhaps the simplest thing to do would be to use the thread_locals page. (Currently we assume syscallbuf only needs 2K of stack but that's conservative and could be reduced further.) The problem is that all those thread_locals pages are at the same address so this will muck up watchpoints and other debugging of anything on the syscallbuf stack.

@rocallahan
Copy link
Collaborator

Hmm in fact if we used the thread_locals page then whenever you're stopped in gdb and you try to get a backtrace of another thread that's blocked in syscallbuf, you'll get the wrong answer. That would be really bad.

@rocallahan
Copy link
Collaborator

BTW the reason this doesn't show up as easily in Go 1.6 is that in Go 1.3 they switched to contiguous stacks: when they run out of stack space, a new stack is created by copying the old stack into a larger area. Since they're still relying on manual stack-size checking and not kernel stack extension, this is probably still an rr issue, just harder to trigger.

@rocallahan
Copy link
Collaborator

Another possible approach is to detect SIGSEGVs triggered by the instruction that extends the syscallbuf stack, and in that case redirect program state to a stub that just triggers a non-buffered syscall and returns. Unfortunately these stacks might not have a guard page (they don't seem to in Go 1.2) so we can't rely on the SIGSEGV being triggered --- we might just start overwriting some other memory. So let's not do this.

@rocallahan
Copy link
Collaborator

Hmm, apart from the stack used by the syscallbuf hook code itself, there's also the 256-byte area we skip in X64SyscallStubExtendedJump to avoid corrupting the redzone.

@rocallahan
Copy link
Collaborator

I think we need to allocate a per-thread page for a syscallbuf alternate stack. We can store its address in the preload_thread_locals page. X64SyscallStubExtendedJump should switch stacks to this page to avoid the redzone issue.

The tricky thing is, how do we allocate and deallocate these pages?

@rocallahan
Copy link
Collaborator

Ah, I think we can just use part of the per-thread scratch buffer! Extend it by a page and use the last page.

@rocallahan
Copy link
Collaborator

Getting the CFI right so stack unwinding works will be a bit tricky but we should be able to manage it.

@rocallahan
Copy link
Collaborator

A potential problem with this is that a signal handler that runs during a blocked syscallbuf syscall runs on this alternate stack :-(.

@khuey
Copy link
Collaborator

khuey commented Oct 24, 2016

Can we just use sigaltstack for that?

@rocallahan
Copy link
Collaborator

No because these are application signal handlers. If they expect to see the stack pointer pointing to an application stack, they'll get confused no matter which alt-stack we use.

Another issue which I've just uncovered is the signal handler calling a hooked system call. It tries to reuse the alt-stack, corrupting the stack frames already using it...

@rocallahan
Copy link
Collaborator

I can avoid the latter issue with some extra code to detect nested usage of the alt-stack, and for nested calls, suppress the stack switch.

The question is whether signal handlers will get confused in the presence of our stack switch, e.g. if they try to just scan from current $rsp to some top-of-stack saved in TLS.

@rocallahan
Copy link
Collaborator

Go, of course, uses SA_STACK for its signal handlers so they all use their own alt-stacks. So we have the situation that for Go we need to use our own alt-stack, but possibly other applications could be harmed by that.

@rocallahan
Copy link
Collaborator

I think that it might be possible, if necessary, to have rr switch the stack pointer back to the original stack pointer before delivering the signal, and restore it if/when the handler does a sigreturn. So I'm going to go with this approach.

@rocallahan
Copy link
Collaborator

Handling clone is also interesting with an alt-stack. The cloned child resumes with its rsp pointing to the parent's stack, and we return through that stack. That's kind of a problem since the parent could reuse that stack. We need to copy the stack page from parent to child --- after the child has set up its scratch buffer --- and adjust rsp in the syscallbuf.

@rocallahan
Copy link
Collaborator

The cloned child resumes with its rsp pointing to the parent's stack, and we return through that stack.

This is wrong. clone children are assigned a stack pointer by the syscall so they don't return through the syscallbuf alt-stack.

@rocallahan
Copy link
Collaborator

fork() and vfork() are OK, at least for now, because the scratch memory we're using is forked so it's safe for the child to use it, and that scratch is not being unmapped.

@rocallahan
Copy link
Collaborator

This change was quite complicated and some additional scrutiny would be welcome :-).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants