Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Go crash with uretprobe #1320

Open
ggaurav10 opened this issue Aug 25, 2017 · 12 comments
Open

Go crash with uretprobe #1320

ggaurav10 opened this issue Aug 25, 2017 · 12 comments

Comments

@ggaurav10
Copy link

I am attempting to get function latencies from a module written in 'Go'. I am using uprobe and uretprobe in a custom eBPF program to track the function entry and exit respectively.
Below is the system config I am working on:

4.4.0-81-generic #104~14.04.1-Ubuntu SMP Wed Jun 14 12:45:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

The issue is that the Go process crashes because of the conflict in the way return probes work and Go manages stack.
I am interested in knowing if there is any solution to the issue or a workaround for this. Does uretprobe work with Go in any of the later versions, or is this issue fixed in later versions.

@4ast
Copy link
Member

4ast commented Aug 26, 2017

are you saying that setting a uprobe somewhere in go code makes it fail?
That is certainly a kernel bug. Please describe the steps to reproduce.
If uprobe is responsible it should fail without bcc. Just adding a uprobe
as described here:
https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt
should make it fail.

@shashankmjain
Copy link

If we replace uretprobe with uprobe in the bcc code, it works. If uretprobe is used it is crashing the process.

@brendangregg
Copy link
Member

Here's a comment on my go tracing blog post[1] by Suresh Kumar:

Another problem I ran into: the uretprobe seems to place the return probes by modifying the stack, which is in conflict with how Go manages stack (stacks in Go can grow/shrink at anytime, it does so by copying entire stack to a new larger area, adjusting the pointers in the stack to point to new area etc). So if we are doing a uretprobe, and stack happens to grow (or shrink) at that time, it can lead Go runtime panics. Please see here for an example panic message: https://github.com/surki/misc/blob/master/go.stp#L32-L58

[1] http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html

@4ast
Copy link
Member

4ast commented Aug 27, 2017

@brendangregg thanks!
Indeed that's the case.
uprobe works whereas uretprobe makes c++ exception handling fail

$ cat throw.cxx
#include <stdexcept>
int foo(int x) { if (x == 1) throw std::runtime_error("oops!"); return x; }
int main(int ac) {
    try { foo(ac); } catch (std::runtime_error&) {}
    return 0;
}
$ clang++ throw.cxx
$ sudo /usr/share/bcc/tools/trace './a.out:_Z3fooi "%d" arg1' &
$ ./a.out
PID    TID    COMM         FUNC             -
2376753 2376753 a.out        _Z3fooi          1
2376841 2376841 a.out        _Z3fooi          1
$ sudo /usr/share/bcc/tools/trace 'r:./a.out:_Z3fooi "%d" retval' &
$ ./a.out
terminate called after throwing an instance of 'std::runtime_error'
  what():  oops!
$ ./a.out 2
PID    TID    COMM         FUNC             -
2413547 2413547 a.out        _Z3fooi          2

cc @yonghong-song @markdrayton @palmtenor

@shashankmjain
Copy link

shashankmjain commented Aug 27, 2017

So is there a solution to this problem for golang ? Or is there any other means to capture func latency if you can suggest?

@gianlucaborello
Copy link

gianlucaborello commented Jul 25, 2018

I have been hitting this problem as well: almost every time the runtime decides to shrink or expand the goroutine stack, if there is a uretprobe placed the process will crash because the stack is messed up.

For Golang, the solution I'm actually experimenting with is to "simulate" a uretprobe by using a series of uprobes. In particular:

  1. Given a Golang binary, parse the ELF symbol table and obtain the address of the symbol we want to trace. If needed, attach a uprobe at such address, for example if we want to calculate the timestamp at which the function began executing for latency purposes (be careful to obtain a unique id such as the goroutine id if this information gets saved into a map, don't use the tid).

  2. Instead of attaching a uretprobe to the symbol address, read the ELF text section starting at that address and start decoding instructions until the end of the symbol is reached. While scanning, place a uprobe at every instruction that returns from the procedure (e.g. for x86-64 that would be RETN instructions, opcodes 0xC2 and 0xC3). For the symbols I'm interested in, there's always a very manageable number of RETN instructions, in the 1-5 range, so that's reasonable.

  3. When one of the uprobes installed at the previous point triggers, it's effectively as if we were executing a uretprobe, except that we haven't messed with the stack and so the solution is robust enough to not crash when the Go runtime moves the stack (or so it seems). Also, since the uprobe is placed right before the RET instruction, the stack pointer is already conveniently placed at the beginning of the frame, so we have easy access to the input arguments and return values at a constant offset, since they are all stored on the stack in Go.

Also, this approach has some mild performance benefits, since we avoid the uretprobe overhead. By tracing a tight loop of a simple function from libc:

Non-instrumented function call: 2 ns/call
Instrumented function call with uretprobe: 4 us/call
Instrumented function call with 2 uprobes (at enter + RET instructions): 3 us/call

The drawback is that we now have to decode instructions in userspace, so it's significantly more annoying that the standard alternative, and it's not currently possible with bcc (I'm working using uprobes in a separate project so I have my own BPF loader).

I would appreciate some feedback on this approach, I am not an expert on this matter and would like to know if I'm missing something important.

Thanks

@brendangregg
Copy link
Member

Sounds good. But I have to wonder: doesn't the kernel already have the code for walking instructions and finding RETNs for the creation of uretprobes? Could a future solution be a variant or flag for uretprobes so the kernel did this behavior?

@gianlucaborello
Copy link

In general, since uretprobes are implemented by overwriting the return address on the stack, the kernel currently doesn't need to walk the function instructions.

However, as you correctly point out, the per-arch uprobes code already has a pretty comprehensive support for decoding instructions for all the major archs (e.g. arch_uprobe_analyze_insn() function in the tree), which is currently used mostly for validation and to know how to properly handle the out of line execution of the probed instruction.

I could definitely see a future feature where the kernel would, optionally, directly walk the function and place uprobes at the various return instructions. Userspace would just have to provide the symbol address and likely the symbol length, so a small API change overall.

Why hasn't it been done this way in the first place? I have briefly searched the relevant kernel commit logs, lkml archives and the original uprobes paper, but couldn't find a single mention about this approach. Naively I would say because it's more complicated, but it would be interesting to hear some kernel developer's point of view.

In my specific case, I was sharing the userspace solution because I am trying to introduce this functionality in sysdig, and being able to support this without any kernel changes would mean a much wider user adoption, as opposed to waiting for the feature to be upstreamed and, more painfully, waiting for it to make its way into the various distributions.

Thanks for the feedback!

@gianlucaborello
Copy link

gianlucaborello commented Nov 26, 2018

Why hasn't it been done this way in the first place?

After spending a bit more time on this, I can easily answer my original question, which might be useful if someone else reads my previous comments.

Simply placing uprobes when arch-specific return instructions are encountered is not enough. Even if we forget for a second "corner cases" such as panic() in Go, we still have to deal with tail call optimizations done by many compilers.

For example, the function gnutls_record_recv from GnuTLS goes like this:

gnutls_record_recv(gnutls_session_t session, void *data, size_t data_size)
{
	...

	return _gnutls_recv_int(session, GNUTLS_APPLICATION_DATA,
				data, data_size, NULL,
				session->internals.record_timeout_ms);
}

The dynamic symbol table tells us gnutls_record_recv is stored at 0x317e0 and is 26 bytes long, so we can inspect the code:

317e0:    44 8b 8f 3c 15 00 00     mov     0x153c(%rdi),%r9d
317e7:    48 89 d1                 mov     %rdx,%rcx
317ea:    45 31 c0                 xor     %r8d,%r8d
317ed:    48 89 f2                 mov     %rsi,%rdx
317f0:    be 17 00 00 00           mov     $0x17,%esi
317f5:    e9 f6 f8 ff ff           jmpq    0x310f0

The function doesn't return itself, it simply jumps into _gnutls_recv_int, which is at 0x310f0, instead of properly calling it and then explicitly returning. This already makes it difficult to proceed, because we don't know if some other function is calling _gnutls_recv_int, so by recursively instrumenting it (which we can do programmatically, by following jumps until we exhaust the search) we might end up instrumenting too much, as opposed to doing the stack manipulation done by uretprobes.

The problem is worse when the tail call optimization is done with a variable function pointer. For example, this is PR_Recv from lib NSPR:

PR_IMPLEMENT(PRInt32) PR_Recv(PRFileDesc *fd, void *buf, PRInt32 amount, PRIntn flags, PRIntervalTime timeout)
{
	return((fd->methods->recv)(fd,buf,amount,flags,timeout));
}

In this case, there's nothing we can do with static analysis, since the function pointer is obtained at runtime by dereferencing an argument, as we can see from the code:

e690:       48 8b 07                mov    (%rdi),%rax
e693:       48 8b 80 88 00 00 00    mov    0x88(%rax),%rax
e69a:       ff e0                   jmpq   *%rax

Placing a uretprobe via stack manipulation seems to be the only feasible way.

"Luckily", it seems Golang, when using the default GC compiler, is not currently doing any tail call optimization [1] [2], and in fact I've done a few experiments and even with very simple functions the call stack is properly kept. So, for the moment the original method I proposed seems to work, but it's important to keep in mind it might break when/if the GC compiler decides to introduce the optimization.

[1] https://blog.gopheracademy.com/recursion/
[2] golang/go#22624

@yzhao1012
Copy link
Contributor

Sorry to reopen this old thread. We ran into this same problem.

@gianlucaborello Hi Gianluca, for your suggested workaround in
#1320 (comment)
Is the code open-sourced?

@yzhao1012
Copy link
Contributor

Just to get back to this issue:

We did what's suggested in #1320: https://github.com/pixie-io/pixie/blob/5960b3535bbed52763dec0bbb6b2bf20b74c23ac/src/stirling/obj_tools/elf_reader.cc#L502:44

AFAIK, the known issues in this approach are:

Additionally, we have not investigated the behavior in exotic cases like recursive functions, and could become sources of additional bugs.

@mariomac
Copy link

With respect to this comment:

Also, since the uprobe is placed right before the RET instruction, the stack pointer is already conveniently placed at the beginning of the frame, so we have easy access to the input arguments and return values at a constant offset, since they are all stored on the stack in Go.

It seems that, since Go 1.17 the arguments are passed as registers so I don't see a way to get an arbitrary argument from the uprobe inserted before the function return:
https://go.googlesource.com/go/+/refs/heads/dev.regabi/src/cmd/compile/internal-abi.md#function-call-argument-and-result-passing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants