Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USDT probes #327

Closed
brendangregg opened this issue Jan 29, 2016 · 55 comments
Closed

USDT probes #327

brendangregg opened this issue Jan 29, 2016 · 55 comments

Comments

@brendangregg
Copy link
Member

User-level statically defined tracing probes have been placed in various applications and runtimes, including Java, Node.js, MySQL, and PostgreSQL. These allow API-stable scripts to be written, that do not depend on tracing raw user-level functions (uprobes).

As an example of hacking in USDT tracing using ftrace, see: http://www.brendangregg.com/blog/2015-07-03/hacking-linux-usdt-ftrace.html . The unpublished script I referred to is: https://gist.github.com/brendangregg/f1b3d09c14088522065b

For a simple example to trace:

1. Create tick-dtrace.d:
provider tick {
        probe loop(int);
}

#pragma D attributes Evolving/Evolving/ISA provider node provider
#pragma D attributes Private/Private/Unknown provider node module
#pragma D attributes Private/Private/Unknown provider node function
#pragma D attributes Private/Private/ISA provider node name
#pragma D attributes Evolving/Evolving/ISA provider node args
2. Then create an object file:
# apt-get install -y systemtap-sdt-dev        # adds "dtrace"
# dtrace -G -s tick-dtrace.d -o tick-dtrace.o
3. Create the target program, tick-main.c:
#include <stdio.h>
#include <unistd.h>
/* from systemtap-sdt */
#include <sys/sdt.h>

int
main(int argc, char *argv[])
{
        int i;

        for (i = 0; i < 5; i++) {
                DTRACE_PROBE1(tick, loop, i);
                printf("hi: %d\n", i);
                sleep(1);
        }

        return (0);
}
4. Compile tick-main:
gcc -c tick-main.c
gcc -o tick tick-main.o tick-dtrace.o
5. Check it has USDT probes:
readelf -n tick

Notes at offset 0x0000021c with length 0x00000020:
  Owner                 Data size       Description
  GNU                  0x00000010       NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 2.6.35

Notes at offset 0x0000023c with length 0x00000024:
  Owner                 Data size       Description
  GNU                  0x00000014       NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: 2b94c0e7c684a001a34d685c862b33aa51ff7672

Notes at offset 0x00000a08 with length 0x00000044:
  Owner                 Data size       Description
  stapsdt              0x0000002e       NT_STAPSDT (SystemTap probe descriptors)
    Provider: tick
    Name: loop
    Location: 0x0000000000400558, Base: 0x0000000000400628, Semaphore: 0x0000000000000000
    Arguments: -4@-4(%rbp)

See NT_STAPSDT etc.

This is a basic probe. There is another type, isenabled, which I discussed in the blog post, and requires a semaphore to activate.

@4ast
Copy link
Member

4ast commented Jan 30, 2016

Definitely very useful, not sure how to tackle this yet.

@goldshtn
Copy link
Collaborator

@brendangregg I want to start hacking on this, and was wondering if you know how I can easily simulate a probe that requires a semaphore. I've looked at this SystemTap page and can't see any mention of it.

@goldshtn
Copy link
Collaborator

BTW, the SystemTap page I linked to above says that you can simply #include <sys/sdt.h> and then start using the DTRACE_PROBE macros. Indeed, it works directly, with no need for a .d file or running dtrace to generate an .o file. Is there any real difference/value added by the dtrace preprocessing?

@4ast
Copy link
Member

4ast commented Mar 23, 2016

@brendangregg can you explain what sema in that script is for?
Also we need to standardize on std.h first before going much further.
Is this going to be from dtrace:
https://github.com/dtrace4linux/linux/tree/master/uts/common/sys/
or from stap:
https://sourcecodebrowser.com/systemtap/1.0/sdt_8h_source.html
because corresponding tooling support will be different.
Also I don't think the actual dtrace probe is necessary, since we're going with uprobe.
Just std.h macro.

@brendangregg
Copy link
Member Author

@goldshtn you're right, so steps 1 & 2 can just be:

# apt-get install -y systemtap-sdt-dev        # adds sdt.h and sdt-config.h

Or we can include our own sdt.h. I haven't compared both, but I'd guess the SystemTap one has had more work to make it behave well on Linux.

Another thing: on the iovisor call today (11am PST biweekly) we briefly discussed that there were at least 3 ways to do USDT probes. This way, using uprobes, is the most obvious and immediate solution, and we should go ahead with it. But later on (much later on) we might investigate other approaches in addition or instead of, including LD_PRELOAD so that tracing can be user-mode to user-mode, reducing overhead. These other approaches should greatly reduce the overhead of memleak too.

I'll see if I can dig up a semaphore example.

@4ast Why some probes need semaphores? It's so the program can do some expensive argument preparation only when the probe is enabled. For example:

  • IPv6 send/receive. Lets say the probe is somewhere where the IPv6 header length is not known, but we wanted that as an argument (handy for finding the offset to the TCP header, etc). Unless Linux has a trick to it, calculating IPv6 header length involves a loop, walking the optional next headers until you run out of headers. We don't want to do this for every packet when tracing isn't even enabled; so a semaphore is used in a conditional. Basically: if (I am being traced) { arg1 = expensive work; TRACEPOINT(arg0, arg1, ...) }.
  • User level programs, same story. You might want probe arguments that are string usernames, URLs, hostnames, etc, yet these aren't available when you put down the probe, and add overhead to calculate. So you wrap the calculation in a is-enabled conditional.

@brendangregg
Copy link
Member Author

USDT is-enabled example:

1. Create tick-dtrace.d:
provider tick {
        probe loop1(int);
        probe loop2(int);
}

#pragma D attributes Evolving/Evolving/ISA provider node provider
#pragma D attributes Private/Private/Unknown provider node module
#pragma D attributes Private/Private/Unknown provider node function
#pragma D attributes Private/Private/ISA provider node name
#pragma D attributes Evolving/Evolving/ISA provider node args
2. Then create header and object files:
# apt-get install -y systemtap-sdt-dev        # adds "dtrace"
# dtrace -G -s tick-dtrace.d -o tick-dtrace.o
# dtrace -h -s tick-dtrace.d -o tick-dtrace.h

Note that the generated tick-dtrace.h contains the ENABLED macros and semaphores. Here's the file:

/* Generated by the Systemtap dtrace wrapper */

#define _SDT_HAS_SEMAPHORES 1


#define STAP_HAS_SEMAPHORES 1 /* deprecated */


#include <sys/sdt.h>

/* TICK_LOOP1 ( int) */
#if defined STAP_SDT_V1
#define TICK_LOOP1_ENABLED() __builtin_expect (loop1_semaphore, 0)
#define tick_loop1_semaphore loop1_semaphore
#else
#define TICK_LOOP1_ENABLED() __builtin_expect (tick_loop1_semaphore, 0)
#endif
__extension__ extern unsigned short tick_loop1_semaphore __attribute__ ((unused)) __attribute__ ((section (".probes")));
#define TICK_LOOP1(arg1) \
STAP_PROBE1(tick,loop1,arg1)

/* TICK_LOOP2 ( int) */
#if defined STAP_SDT_V1
#define TICK_LOOP2_ENABLED() __builtin_expect (loop2_semaphore, 0)
#define tick_loop2_semaphore loop2_semaphore
#else
#define TICK_LOOP2_ENABLED() __builtin_expect (tick_loop2_semaphore, 0)
#endif
__extension__ extern unsigned short tick_loop2_semaphore __attribute__ ((unused)) __attribute__ ((section (".probes")));
#define TICK_LOOP2(arg1) \
STAP_PROBE1(tick,loop2,arg1)
3. Create the target program, tick-main.c:
#include <stdio.h>
#include <unistd.h>
#include "tick-dtrace.h"

int
main(int argc, char *argv[])
{
        int i;

        for (i = 0; i < 5; i++) {
                DTRACE_PROBE1(tick, loop1, i);
                if (TICK_LOOP2_ENABLED()) {
                        DTRACE_PROBE1(tick, loop2, i);
                }
                printf("hi: %d\n", i);
                sleep(1);
        }

        return (0);
}

So the loop1 probe is normal, and loop2 is wrapped in is-enabled. Should help testing.

4. Compile tick-main:
gcc -c tick-main.c
gcc -o tick tick-main.o tick-dtrace.o
5. Check it has USDT probes:
readelf -n tick

Notes at offset 0x00000254 with length 0x00000020:
  Owner                 Data size   Description
  GNU                  0x00000010   NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 2.6.24

Notes at offset 0x00000274 with length 0x00000024:
  Owner                 Data size   Description
  GNU                  0x00000014   NT_GNU_BUILD_ID (unique build ID bitstring)
    Build ID: b56b39ec25bf88a5eac9f687e41567ecdbb87b49

Notes at offset 0x00001058 with length 0x00000088:
  Owner                 Data size   Description
  stapsdt              0x0000002f   NT_STAPSDT (SystemTap probe descriptors)
    Provider: tick
    Name: loop1
    Location: 0x0000000000400564, Base: 0x00000000004006a4, Semaphore: 0x0000000000601028
    Arguments: -4@-4(%rbp)
  stapsdt              0x0000002f   NT_STAPSDT (SystemTap probe descriptors)
    Provider: tick
    Name: loop2
    Location: 0x0000000000400574, Base: 0x00000000004006a4, Semaphore: 0x000000000060102a
    Arguments: -4@-4(%rbp)

See NT_STAPSDT.

@4ast
Copy link
Member

4ast commented Mar 24, 2016

yes, but looks like steps 1 and 2 are not really necessary. Such .h can be simplified into a macro without any need for generation ?

@brendangregg
Copy link
Member Author

@4ast right

@goldshtn
Copy link
Collaborator

Cool, thanks for the detailed example. But it looks like the consumer doesn't know if the traced program uses the semaphore or not. So it should increment the semaphore in any case. Correct?

@brendangregg
Copy link
Member Author

Sounds correct. Unless there's a check to see if the semaphore is in the compiled binary. Which you might test for anyway by checking if the semaphore address is non-zero (I assume if it's not included in the binary, readelf will show zero).

@goldshtn
Copy link
Collaborator

Yes, I've seen it reported as zero when not using the dtrace-generated infrastructure.

BTW, I'm a bit worried about probing things like -4(%rbp). From some basic experiments it seems to produce an invalid BPF program.

@goldshtn
Copy link
Collaborator

Another thing I just stumbled across: some probe arguments have a format that mentions some global names. For example, in libc I found these probes:

/lib64/libc.so.6 libc:memory_mallopt_mxfast [0x18e010] ARGS: -4@%ebp 8@global_max_fast(%rip)
/lib64/libc.so.6 libc:memory_mallopt_arena_test [0x18e010] ARGS: -4@%ebp 8@24+mp_(%rip)

@brendangregg, do you know how we're supposed to resolve this kind of argument? I assume global_max_fast might be a global variable, and because x64 addressing is RIP-relative it is referring to %rip. But then there's also 24+mp_ -- is that a global variable named mp_ which I am supposed to add 24 to, and dereference the resulting address? I'm just guessing at this point.

(This output, by the way, is from code I'm planning to integrate into tplist, so you could list USDT probes like tracepoints. For example, tplist -l pthread would list the USDT probes in libpthread.)

@goldshtn
Copy link
Collaborator

And here's another interesting phenomenon from my libc -- the same probe name appears multiple times, but at different locations. How would the user specify which probe they are interested in? Or do we probe all the locations?

  stapsdt              0x00000041       NT_STAPSDT (SystemTap probe descriptors)
    Provider: libc
    Name: memory_memalign_retry
    Location: 0x0000000000086000, Base: 0x000000000018e010, Semaphore: 0x0000000000000000
    Arguments: 8@%r12 8@%rbx
  stapsdt              0x00000041       NT_STAPSDT (SystemTap probe descriptors)
    Provider: libc
    Name: memory_memalign_retry
    Location: 0x0000000000086750, Base: 0x000000000018e010, Semaphore: 0x0000000000000000
    Arguments: 8@%r12 8@%rbx
  stapsdt              0x00000041       NT_STAPSDT (SystemTap probe descriptors)
    Provider: libc
    Name: memory_memalign_retry
    Location: 0x00000000000869d0, Base: 0x000000000018e010, Semaphore: 0x0000000000000000
    Arguments: 8@%r12 8@%rbx
  stapsdt              0x00000041       NT_STAPSDT (SystemTap probe descriptors)
    Provider: libc
    Name: memory_memalign_retry
    Location: 0x00000000000875ed, Base: 0x000000000018e010, Semaphore: 0x0000000000000000
    Arguments: 8@%rbx 8@%r12

EDITED: After some testing, this is pretty obvious -- when the probe is used in multiple places in the source program, there are multiple notes for that probe. I also noticed that the arguments don't have to be in the same locations. For example, if the program specifies an immediate value, like 42, the probe's arguments at that location will not use registers, but rather something like 4@$42. The result of this is that the probing program (argdist, trace, etc.) will have to probe each location separately and obtain the arguments in a separate way.

@goldshtn
Copy link
Collaborator

All right, I have a version of tplist that can display USDT probes on this branch. Next, I'm going to add support for USDT probes to argdist and trace. I'm thinking of initializing __arg1, __arg2, etc. local variables and having arg1, arg2 etc. expressions provided by the user refer to these. This will be 100% natural for trace, and a new concept for argdist.

For example, the user says:

# trace 'u:libc:memory_realloc_retry (arg1 < 100000) "%llu %llx", arg1, arg2'

Now, trace knows this probe has two arguments of type u64 in R13 and RBP, and generates the following:

int probe(struct pt_regs *ctx)
{
  u64 __arg1 = (u64)ctx->r13;
  u64 __arg2 = (u64)ctx->rbp;
  if (__arg1 < 100000) return 0;
  /* etc., where references to arg1 and arg2 now replaced with __arg1 and __arg2 */
  return 0;
}

What do you think?

@brendangregg
Copy link
Member Author

@goldshtn you might have figured it out, but I assume if a probe is in multiple locations, then they all must be traced at the same time. It's an advantage of static probes: you can define a logical tracepoint, and the end-user doesn't need to know if it's instrumenting one location to satisfy that tracepoint or ten (because of the implementation).

I don't know about the global_max_fast and mp_ stuff yet.

That trace example looks good.

@goldshtn
Copy link
Collaborator

@brendangregg: I have an updated version on my usdt branch. Before I create a PR, do you have any comments on the final syntax?

argdist example:

./argdist.py -C 'u:pthread:pthread_start():u64:arg2'
[11:18:09]
u:pthread:pthread_start():u64:arg2
        COUNT      EVENT
[11:18:10]
u:pthread:pthread_start():u64:arg2
        COUNT      EVENT
        2          arg2 = 4196385

trace example:

./trace.py 'u:/home/vagrant/basic_usdt:loop_iter "i = %d", arg1' -p $(pidof basic_usdt)
TIME     PID    COMM         FUNC             -
11:17:28 18672  basic_usdt   loop_iter        i = 42
11:17:28 18672  basic_usdt   loop_iter        i = 0
11:17:28 18672  basic_usdt   loop_iter        i = 1
11:17:28 18672  basic_usdt   loop_iter        i = 2
11:17:28 18672  basic_usdt   loop_iter        i = 3
11:17:28 18672  basic_usdt   loop_iter        i = 4
11:17:28 18672  basic_usdt   loop_iter        i = 5
11:17:28 18672  basic_usdt   loop_iter        i = 6
11:17:28 18672  basic_usdt   loop_iter        i = 7
11:17:28 18672  basic_usdt   loop_iter        i = 8

@brendangregg
Copy link
Member Author

Looks good.

I think we ultimately may want to add long-form versions of these:

  • u: == usdt:
  • t: == tracepoint:
  • p: == probe:
  • r: == return:

and when teaching bcc, people can begin with the long-form, then move onto the shortcuts. Can be added later.

tplist needs an _example.txt.

Probably my broken system, but when I tried to build it I got llvm errors:

root@bgregg-build-i-xxxxxxxx:/mnt/src/usdt/bcc/build# make 2>&1 | grep -i error
/mnt/src/llvm/include/../tools/clang/include/clang/AST/CharUnits.h:177:26: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/TypeLoc.h:258:15: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/TypeLoc.h:357:12: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/TypeLoc.h:403:12: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/TypeLoc.h:410:12: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/ExprCXX.h:1436:31: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/ExprCXX.h:1443:31: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/OpenMPClause.h:89:13: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/OpenMPClause.h:101:13: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/OpenMPClause.h:147:13: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/StmtOpenMP.h:73:23: error: ‘RoundUpToAlignment’ is not a member of ‘llvm’
/mnt/src/llvm/include/llvm/Support/AlignOf.h:79:52: error: invalid application of ‘sizeof’ to incomplete type ‘clang::VarDecl::UninstantiatedDefaultArgument’
/usr/include/c++/4.8/type_traits:548:12: error: invalid use of incomplete type ‘struct clang::VarDecl::UninstantiatedDefaultArgument’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/Decl.h:737:10: error: forward declaration of ‘struct clang::VarDecl::UninstantiatedDefaultArgument’
/mnt/src/llvm/include/llvm/Support/AlignOf.h:79:13: error: ‘value’ is not a member of ‘std::is_abstract<clang::VarDecl::UninstantiatedDefaultArgument>’
/mnt/src/llvm/include/llvm/Support/AlignOf.h:79:52: error: invalid application of ‘sizeof’ to incomplete type ‘clang::VarDecl::UnparsedDefaultArgument’
/usr/include/c++/4.8/type_traits:548:12: error: invalid use of incomplete type ‘struct clang::VarDecl::UnparsedDefaultArgument’
/mnt/src/llvm/include/../tools/clang/include/clang/AST/Decl.h:733:10: error: forward declaration of ‘struct clang::VarDecl::UnparsedDefaultArgument’
/mnt/src/llvm/include/llvm/Support/AlignOf.h:79:13: error: ‘value’ is not a member of ‘std::is_abstract<clang::VarDecl::UnparsedDefaultArgument>’
/mnt/src/llvm/include/llvm/ADT/DenseMap.h:295:57: error: ‘isEqual’ is not a member of ‘llvm::DenseMapInfo<std::basic_string<char> >’
/mnt/src/llvm/include/llvm/ADT/DenseMap.h:294:53: error: ‘isEqual’ is not a member of ‘llvm::DenseMapInfo<std::basic_string<char> >’
In file included from /mnt/src/llvm/include/llvm/Support/ErrorOr.h:19:0,
/mnt/src/llvm/include/llvm/ADT/PointerIntPair.h:129:3: error: static assertion failed: PointerIntPair with integer size too large for pointer
/mnt/src/llvm/include/llvm/ADT/DenseMap.h:368:34: error: ‘getEmptyKey’ is not a member of ‘llvm::DenseMapInfo<std::basic_string<char> >’
/mnt/src/llvm/include/llvm/ADT/DenseMap.h:371:38: error: ‘getTombstoneKey’ is not a member of ‘llvm::DenseMapInfo<std::basic_string<char> >’
/mnt/src/llvm/include/llvm/Support/AlignOf.h:79:52: error: invalid application of ‘sizeof’ to incomplete type ‘clang::driver::JobAction’
/usr/include/c++/4.8/type_traits:548:12: error: invalid use of incomplete type ‘class clang::driver::JobAction’
/mnt/src/llvm/include/../tools/clang/include/clang/Driver/Util.h:21:9: error: forward declaration of ‘class clang::driver::JobAction’
/mnt/src/llvm/include/llvm/Support/AlignOf.h:79:13: error: ‘value’ is not a member of ‘std::is_abstract<clang::driver::JobAction>’
make[2]: *** [src/cc/frontends/clang/CMakeFiles/clang_frontend.dir/loader.cc.o] Error 1
make[1]: *** [src/cc/frontends/clang/CMakeFiles/clang_frontend.dir/all] Error 2
make: *** [all] Error 2

@goldshtn
Copy link
Collaborator

I haven't made any changes to the LLVM stuff, so it's probably your system 😺
I'll clean everything up and submit a PR soon.

@goldshtn
Copy link
Collaborator

OK, this issue is now mostly taken care of by #451. I still have to figure out what happens to the semaphore address when the probe is in a shared library. I expect the semaphore address to be relative to the library's load address, in which case the enabling program has to take care of it.

EDITED: Yes, it's relative, and it seems to be working now.

@goldshtn
Copy link
Collaborator

@brendangregg I think this can be closed now.

@goldshtn
Copy link
Collaborator

Figured out how to build Node with USDT probes. And it seems to work -- even though I haven't checked what the arguments mean yet.

# trace -p $(pidof node) -o \
'u:/home/vagrant/node/node:gc__start "%d %d %lld" arg1, arg2, arg3' \
'u:/home/vagrant/node/node:gc__done "%d %d %lld" arg1, arg2, arg3'  
TIME     PID    COMM         FUNC             -            
3.916463 21676  node         gc__start        4 0 33869920 
3.916553 21676  node         gc__done         4 0 33869920 
4.146330 21676  node         gc__start        2 0 33869920 
4.149811 21676  node         gc__done         2 0 33869920 
4.152456 21676  node         gc__start        1 0 33869920 
4.153467 21676  node         gc__done         1 0 33869920 
4.154198 21676  node         gc__start        1 0 33869920 
4.155255 21676  node         gc__done         1 0 33869920 
4.156458 21676  node         gc__start        1 0 33869920 
4.157312 21676  node         gc__done         1 0 33869920 
4.157947 21676  node         gc__start        1 0 33869920 
4.158857 21676  node         gc__done         1 0 33869920 
4.160153 21676  node         gc__start        1 0 33869920 
4.161080 21676  node         gc__done         1 0 33869920 
4.161780 21676  node         gc__start        1 0 33869920 
4.162749 21676  node         gc__done         1 0 33869920 
4.166224 21676  node         gc__start        1 0 33869920 
4.166396 21676  node         gc__done         1 0 33869920 
4.168663 21676  node         gc__start        1 0 33869920 
4.169020 21676  node         gc__done         1 0 33869920 
4.172296 21676  node         gc__start        1 0 33869920 
4.173119 21676  node         gc__done         1 0 33869920 
4.176806 21676  node         gc__start        4 0 33869920 
4.176907 21676  node         gc__done         4 0 33869920 
4.178813 21676  node         gc__start        2 0 33869920 
4.179444 21676  node         gc__done         2 0 33869920 
4.182515 21676  node         gc__start        8 0 33869920 
4.182531 21676  node         gc__done         8 0 33869920 
^C                           

@goldshtn
Copy link
Collaborator

Even better:

# trace -p $(pidof node) 'u:/home/vagrant/node/node:http__server__request "%s %s (from %s:%d)" arg5, arg6, arg3, arg4'
TIME     PID    COMM         FUNC             -
04:50:44 22185  node         http__server__request GET /foofoo (from ::1:51056)
^C

@4ast
Copy link
Member

4ast commented Mar 29, 2016

awesome! could you do a step-by-step description how to reproduce the same

@goldshtn
Copy link
Collaborator

Yes -- I've written a 1000+ word blog post that I intend to turn into a doc file. Hope to get it done tomorrow. Shows how to compile your own USDT probes (following Brendan's examples) and how to use Node with USDT.

@brendangregg
Copy link
Member Author

FWIW, last time I tested the Node.js USDT probes:

apt-get install systemtap-sdt-dev       # adds "dtrace", used by node build
wget https://nodejs.org/dist/v4.4.1/node-v4.4.1.tar.gz  # from https://nodejs.org/en/download/
tar xvf node-v4.4.1.tar.gz 
cd node-v4.4.1
./configure --with-dtrace
make -j 8

readelf should show things like this:

# readelf -n node
[...]
  stapsdt              0x00000082       NT_STAPSDT (SystemTap probe descriptors)
    Provider: node
    Name: http__client__request
    Location: 0x0000000000bf48ff, Base: 0x0000000000f22464, Semaphore: 0x0000000001243024
    Arguments: 8@%rax 8@%rdx 8@-136(%rbp) -4@-140(%rbp) 8@-72(%rbp) 8@-80(%rbp) -4@-144(%rbp)
[...]

Example server (basic-server.js):

// from https://nodejs.org/api/synopsis.html
const http = require('http');

http.createServer( (request, response) => {
  response.writeHead(200, {'Content-Type': 'text/plain'});
  response.end('Hello World\n');
}).listen(8124);

console.log('Server running at http://127.0.0.1:8124/');

Run using node basic-server.js.

Client can be curl, ab, or node, etc. curl http://127.0.0.1:8124/

Looking forward to @goldshtn 's blog post!

@brendangregg
Copy link
Member Author

Because my system was lacking a newer GLIB, I switched to node v0.11 (http://nodejs.org/dist/v0.11.0/node-v0.11.0.tar.gz), plus an older version of the HTTP server code. I can now build node, and list probes fine:

# ./tplist.py -p `pgrep -n node`
/mnt/src/node-v0.11.0/out/Release/node node:gc__start
/mnt/src/node-v0.11.0/out/Release/node node:gc__done
/mnt/src/node-v0.11.0/out/Release/node node:net__server__connection
/mnt/src/node-v0.11.0/out/Release/node node:net__stream__end
/mnt/src/node-v0.11.0/out/Release/node node:net__socket__read
/mnt/src/node-v0.11.0/out/Release/node node:net__socket__write
/mnt/src/node-v0.11.0/out/Release/node node:http__server__response
/mnt/src/node-v0.11.0/out/Release/node node:http__client__request
/mnt/src/node-v0.11.0/out/Release/node node:http__client__response
/mnt/src/node-v0.11.0/out/Release/node node:http__server__request

But not instrument them:

# ./trace.py 'u:/mnt/src/node-v0.11.0/node:http__server__request'
could not find library Usage: /usr/bin/which [-a] args

I used bcc to debug itself:

# ./execsnoop.py -n which
PCOMM            PID    RET ARGS
python           6138     0 /usr/bin/which --skip-alias /mnt/src/node-v0.11.0/node 2>/dev/n
sh               6139     0 /usr/bin/which --skip-alias /mnt/src/node-v0.11.0/node
^C
# /usr/bin/which --skip-alias /mnt/src/node-v0.11.0/node
/usr/bin/which: illegal option -- -
Usage: /usr/bin/which [-a] args

My which doesn't have the --skip-aliases.

@brendangregg
Copy link
Member Author

BTW, I deleted the --skip-aliases from trace.py and src/python/bcc/usdt.py, and it worked:

# ./trace.py 'u:/mnt/src/node-v0.11.0/node:http__server__request'
TIME     PID    COMM         FUNC             
00:05:06 4192   node         http__server__request 
00:05:06 4192   node         http__server__request 
00:05:09 4192   node         http__server__request 

Arguments didn't work (as well as your example), but that might be my build.

@goldshtn
Copy link
Collaborator

@4ast @brendangregg And here's the blog post I promised yesterday: http://blogs.microsoft.co.il/sasha/2016/03/30/usdt-probe-support-in-bpfbcc/

@4ast
Copy link
Member

4ast commented Mar 30, 2016

awesome!
I think we gotta simplify that part:
$ dtrace -h -s myapp.d -o myapp-trace.h
since it's kinda in the way.
It's possible to do the same without creating .d, calling extra tool, including generated .h.
if (static_key_false) {
expensive args
probe(args)
}
should be part of normal application code. Probably will require using our own sdt.h
which is a good thing anyway.

@goldshtn
Copy link
Collaborator

@4ast: I haven't looked into how <sys/sdt.h> works, but it looks like the hardest part is to generate the information on where to find the arguments (registers, globals, offsets, etc.). Would that be worth replicating in our own sdt.h?

@brendangregg: There is another low-hanging fruit, which is tapping into OpenJDK's USDT probes. It seems that there have been USDT markers in OpenJDK for a long time now. Quick example:

# ./trace.py -p $(pidof java) 'u:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.77-1.b03.fc22.x86_64/jre/lib/amd64/server/libjvm.so:gc__begin'
TIME     PID    COMM         FUNC
13:37:31 28599  java         gc__begin
^C

@brendangregg
Copy link
Member Author

@goldshtn yes (did you build your own openjdk for that?). So the JVM probes have had limited success: there's some interesting things, like GC, class, and thread activity, but the probes that would be obviously useful -- the Java method ones -- are usually too prohibitive to use due to their overhead. DTrace's overhead, that is.

Plus one needs -XX:+ExtendedDTraceProbes to make the method ones available, which adds a lot of overhead straight away (although that might have become dynamic https://blogs.oracle.com/sundararajan/entry/dtrace_java_turning_on_off ).

@4ast
Copy link
Member

4ast commented Mar 31, 2016

@goldshtn I was actually proposing to reuse sdt.h and argument storing into notes, since I can do

#include <sys/sdt.h>
int main(int ac, char **av)
{
  int tmp = 1;
  if (ac < 0) {
    DTRACE_PROBE2(foo, bar, ac + tmp, &tmp);
  }
}

and it will generate 'nop' insn and correct notes section without doing 'dtrace -...' steps.
Semaphors are good idea, but too hard to use, so I cannot see myself convincing production folks to use it.
Also even above program is buggy.
The way stap's sdt.h is written makes gcc 4.8 and 5.2 to generate wrong code for '&tmp',
since it doesn't track dependencies.
May be current sdt.h users learned to live with this limitation.
imo kernel-like (static_key_false approach) would be better.

@goldshtn
Copy link
Collaborator

@4ast: But you would still want that static variable to be enabled externally by the tracing program? How is that different from the current semaphores?

@goldshtn
Copy link
Collaborator

@brendangregg: I will experiment some more with enabling them dynamically. Still, the ones that are on by default like GC, loader, JNI can be very relevant for production monitoring.

@goldshtn
Copy link
Collaborator

@brendangregg: forgot to mention, I just did dnf install java on fc22 and got a libjvm with probes.

@goldshtn
Copy link
Collaborator

Just posted some more examples of using JVM probes: http://blogs.microsoft.co.il/sasha/2016/03/31/probing-the-jvm-with-bpfbcc/

@4ast
Copy link
Member

4ast commented Mar 31, 2016

@goldshtn enabling a variable externally is technically not different from semaphore. The difference that it doesn't need 'dtrace ..' step and extra .o

@goldshtn
Copy link
Collaborator

goldshtn commented Apr 1, 2016

OK, but we would still need the location of that variable recorded in the ELF note, right?

@4ast
Copy link
Member

4ast commented Apr 1, 2016

the stap approach is prone to compile bugs as shown in the example. If we do it kernel style then we can have all probes call single dummy function and pass arguments via standard x64 abi, so args will be in di,si,dx,cx,r8 and no parsing of args necessary, but notes section is still needed to know which sdt probes are there.
If we can live with current limitations of stap's sdt.h, great, but it's a hard sell when sdt depends on external binary and extra packages. If we can make it single sdt.h file, then we can add it to folly and large set of application will be able to easily use it.

@goldshtn
Copy link
Collaborator

goldshtn commented Apr 1, 2016

It's just that JDK and Node are already instrumented with stap's sdt.h. We could support both I suppose :-)

Is that in scope for this issue, or should we open another issue to discuss this? Also, is this something you think is worth investing in at this point in time?

@4ast
Copy link
Member

4ast commented Apr 1, 2016

support for existing stap's sdt is already done and it's absolutely awesome. I think we can push sdt to the next level.

@goldshtn
Copy link
Collaborator

goldshtn commented Apr 1, 2016

I have a few questions :-)
OK, so in the new approach, would we have a single function for all probes that takes a probe identifier and arguments? Does the probe consumer always call that function and the function checks if the probe is enabled, or is the check done at the call site before preparing the arguments? How do we optimize for probes where the arguments are already available in a register or are constants, and putting them in registers is just extra work? How do we optimize the function call (in the stap solution there is no function, just a nop you instrument)? Can we reuse any of the macros from trace_events.h or would it be overkill?

@goldshtn
Copy link
Collaborator

goldshtn commented Apr 1, 2016

I guess if the check is done at the call site, the function call cost would only be relevant when the probe is enabled, and then it's probably negligible compared to the int 3, kernel transition, BPF execution and so on. But the function call at the call site still makes the calling code a bit bigger.

@goldshtn
Copy link
Collaborator

goldshtn commented Apr 4, 2016

@brendangregg: It looks like enabling the ExtendedDTraceProbes flag dynamically is not supported in OpenJDK. I get the following

$ jinfo -flag ExtendedDTraceProbes $(pidof java)
-XX:-ExtendedDTraceProbes
$ jinfo -flag +ExtendedDTraceProbes $(pidof java)
Exception in thread "main" com.sun.tools.attach.AttachOperationFailedException: flag 'ExtendedDTraceProbes' cannot be changed

        at sun.tools.attach.LinuxVirtualMachine.execute(LinuxVirtualMachine.java:229)
        at sun.tools.attach.HotSpotVirtualMachine.executeCommand(HotSpotVirtualMachine.java:261)
        at sun.tools.attach.HotSpotVirtualMachine.setFlag(HotSpotVirtualMachine.java:234)
        at sun.tools.jinfo.JInfo.flag(JInfo.java:140)
        at sun.tools.jinfo.JInfo.main(JInfo.java:81)

But the rest of it works fine. I'm going to move on to perf-map support for usyms.py now.

@brendangregg
Copy link
Member Author

Since USDT does work via trace.py, I want to close this ticket, however, we should have (as part of this ticket or another one) a hello world example. I tried to write one. I think a couple more helper functions would make a big difference.

Here is rough attempt at a basic USDT script:

#!/usr/bin/python
#
# nodejs_http_server    Basic example of node.js USDT tracing.
#                       For Linux, uses BCC, BPF. Embedded C.
#
# USAGE: nodejs_http_server PID

from __future__ import print_function
from bcc import BPF, ProcUtils, USDTReader
import sys
import signal

if len(sys.argv) < 2:
    print("USAGE: nodejs_http_server PID")
    exit()
pid = sys.argv[1]
debug = 0

# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
int do_trace(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    if (pid != PID)
        return 0;
    int __loc_id = 0;
    ARGS
    bpf_trace_printk("%s\\n", arg6);

    return 0;
};
"""
bpf_text = bpf_text.replace('PID', pid)

binary = "node"
probe = "http__server__request"

#
# XXX: the following steps should be simplified, and as much as possible
# moved to the bcc usdt library.
#

# 1. find the binary, and the probe:
path = ProcUtils.which(binary)
if path is None or len(path) == 0:
    print("ERROR: could not find %s path. Exiting." % binary)
    exit()
reader = USDTReader(bin_path=path)
for p in reader.probes:
    if p.name == probe:
        usdtprobe = p
        break
if usdtprobe == None:
    print("ERROR: could not find probe %s. Exiting." % probe)
    exit()

# 2. do probe args
args = usdtprobe.generate_usdt_cases(pid=int(pid))
bpf_text = bpf_text.replace('ARGS', args)

# 3. try to enable the probe, after setting up disable on Ctrl-C
def signal_exit(signal, frame):
    if (usdtprobe.need_enable()):
        usdtprobe.disable(int(pid))
    exit()
try:
    signal.signal(signal.SIGINT, signal_exit)
    if (usdtprobe.need_enable()):
        usdtprobe.enable(int(pid))
except:
    print("ERROR: could not enable probe %s. Exiting." % probe)
    exit()

# 4. attach usdt probe:
if debug:
    print(bpf_text)
b = BPF(text=bpf_text)
for i, location in enumerate(usdtprobe.locations):
    # same USDT probe can appear in more than one code location
    b.attach_uprobe(name=path, addr=location.address,
        fn_name="do_trace", pid=int(pid))

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "ARGS"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

It works.

Now here's one proposal for how this could be improved (see the two "XXX"s):

#!/usr/bin/python
#
# nodejs_http_server    Basic example of node.js USDT tracing.
#                       For Linux, uses BCC, BPF. Embedded C.
#
# USAGE: nodejs_http_server PID

from __future__ import print_function
from bcc import BPF, ProcUtils, USDTReader
import sys
import signal

if len(sys.argv) < 2:
    print("USAGE: nodejs_http_server PID")
    exit()
pid = sys.argv[1]
debug = 0

# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
int do_trace(struct pt_regs *ctx) {
    u32 pid = bpf_get_current_pid_tgid();
    if (pid != PID)
        return 0;
    unsigned long long arg6 = bcc_get_usdtarg(6);    /* XXX */
    bpf_trace_printk("%s\\n", arg6);

    return 0;
};
"""
bpf_text = bpf_text.replace('PID', pid)
if debug:
    print(bpf_text)
b = BPF(text=bpf_text)

binary = "node"
probe = "http__server__request"

# XXX attach_usdt_probe() finds the binary, probe, enables it if necessary, and
# calls attach_uprobe(). It also adds a signal handler to disable the probe on
# script exit. If anything goes wrong, it returns an error.
if b.attach_usdt_probe(binary=binary, probe=probe, fn_name="do_trace",
    pid=int(pid)) != 0:
        print("ERROR: could not instrument %s:%s. Exiting." % (binary, probe))
        exit()

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "ARGS"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

@vmg
Copy link
Contributor

vmg commented Apr 28, 2016

I dig it. bcc_get_usdtarg would need to be bcc_get_usdtarg(ctx, 6), but besides that, it's totally doable and looks like a clean API.

The USDT implementation has been fully ported to C++ in #498, and it comes with a lot of extra goodies (including new helpers that would make implementing attach_usdt_probe quite trivial).

I'd like to get the old USDT Python APIs working on the new C++ implementation first. Once that's done, I'll get started with your suggestions. The PR could use a couple extra eyes btw.

@brendangregg
Copy link
Member Author

Ok, thanks. I need to learn the new lua stuff. /examples/lua is a big help!

@vmg vmg mentioned this issue May 1, 2016
@brendangregg
Copy link
Member Author

Based on the lua USDT api (#518), I think this is how the Python api could look:

#!/usr/bin/python
#
# nodejs_http_server    Basic example of node.js USDT tracing.
#                       For Linux, uses BCC, BPF. Embedded C.
#
# USAGE: nodejs_http_server PID

from __future__ import print_function
from bcc import BPF, USDT
import sys

if len(sys.argv) < 2:
    print("USAGE: nodejs_http_server PID")
    exit()
pid = sys.argv[1]
debug = 1

# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
int do_trace(struct pt_regs *ctx) {
    uint64_t addr;
    char path[128];
    bpf_usdt_readarg(6, ctx, &addr); 
    bpf_probe_read(&path, sizeof(path), (void *)addr);
    bpf_trace_printk("path: %s\\n", path);
    return 0;
};
"""

# enable USDT probe from given PID
u = USDT(pid=pid)
u.enable_probe(event="http__server__request", fn_name="do_trace")
if debug:
    print(u.get_text())
    print(bpf_text)

# initialize BPF
b = BPF(text=bpf_text, usdt=u)

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "ARGS"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        continue
    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

@vmg did you start work porting it to Python already?

Looks like there are three things to do:

  1. Add the USDT class and u.enable_probe function to src/python/bcc/usdt.py, which can come from src/lua/bcc/usdt.lua (most of the real code is already shared in src/cc/usdt.cc).
  2. The BPF object needs updating to check for a usdt argument, and if so, prepend its get_text() output to the BPF program. I'd also make get_text() public, so we can include it in debug output (like I did in that example).
  3. The USDT cleanup routine would need to be added to BPF.

Is that it?

It's a little odd, since normally we're creating a BPF object, and then doing attach_kprobe() later. Here we're doing a USDT enable_probe() and then initializing BPF. Seems backwards. But I get that it's easier to setup this way.

CC @goldshtn , who I'm sure is interested as well.

@brendangregg
Copy link
Member Author

Here's a start; I'm missing _attach_uprobes() and cleanup.

diff --git a/src/python/bcc/__init__.py b/src/python/bcc/__init__.py
index 048712a..fd0d29a 100644
--- a/src/python/bcc/__init__.py
+++ b/src/python/bcc/__init__.py
@@ -142,7 +142,7 @@ class BPF(object):
                     raise Exception("Could not find file %s" % filename)
         return filename

-    def __init__(self, src_file="", hdr_file="", text=None, cb=None, debug=0, cflags=[]):
+    def __init__(self, src_file="", hdr_file="", text=None, cb=None, usdt=None, debug=0, cflags=[]):
         """Create a a new BPF module with the given source code.

         Note:
@@ -166,6 +166,8 @@ class BPF(object):
         self.tables = {}
         cflags_array = (ct.c_char_p * len(cflags))()
         for i, s in enumerate(cflags): cflags_array[i] = s.encode("ascii")
+        if text and usdt:
+            text = usdt.get_text() + text
         if text:
             self.module = lib.bpf_module_create_c_from_string(text.encode("ascii"),
                     self.debug, cflags_array, len(cflags_array))
@@ -773,5 +775,5 @@ class BPF(object):
         except KeyboardInterrupt:
             exit()

-from .usdt import USDTReader
+from .usdt import USDTReader, USDT

diff --git a/src/python/bcc/usdt.py b/src/python/bcc/usdt.py
index b4e3151..5e8b66d 100644
--- a/src/python/bcc/usdt.py
+++ b/src/python/bcc/usdt.py
@@ -12,6 +12,9 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

+import ctypes as ct
+from .libbcc import lib
+
 import os
 import struct
 import re
@@ -19,6 +22,21 @@ import re
 from . import BPF
 from . import ProcStat, ProcUtils

+class USDT(object):
+    def __init__(self, pid=None):
+        if pid != None:
+            self.context = lib.bcc_usdt_new_frompid(int(pid))
+            if self.context == 0:
+                raise ValueError("valid pid is required for USDT()")
+
+    def enable_probe(self, probe, fn_name):
+        if lib.bcc_usdt_enable_probe(self.context, probe.encode("ascii"),
+                                     fn_name.encode("ascii")) < 0:
+            raise ValueError("failed to enable USDT probe: " + probe)
+
+    def get_text(self):
+        return ct.cast(lib.bcc_usdt_genargs(self.context), ct.c_char_p).value
+
 class USDTArgument(object):
         def __init__(self, size, is_signed, location,

And example program:

#!/usr/bin/python
#
# nodejs_http_server    Basic example of node.js USDT tracing.
#                       For Linux, uses BCC, BPF. Embedded C.
#
# USAGE: nodejs_http_server PID

from __future__ import print_function
from bcc import BPF, USDT
import sys

if len(sys.argv) < 2:
    print("USAGE: nodejs_http_server PID")
    exit()
pid = sys.argv[1]
debug = 1

# load BPF program
bpf_text = """
#include <uapi/linux/ptrace.h>
int do_trace(struct pt_regs *ctx) {
    uint64_t addr;
    char path[128];
    bpf_usdt_readarg(6, ctx, &addr); 
    bpf_probe_read(&path, sizeof(path), (void *)addr);
    bpf_trace_printk("path:%s\\n", path);
    return 0;
};
"""

# enable USDT probe from given PID
u = USDT(pid=pid)
u.enable_probe(probe="http__server__request", fn_name="do_trace")
if debug:
    print(u.get_text())
    print(bpf_text)

# initialize BPF
b = BPF(text=bpf_text, usdt=u)

# header
print("%-18s %-16s %-6s %s" % ("TIME(s)", "COMM", "PID", "ARGS"))

# format output
while 1:
    try:
        (task, pid, cpu, flags, ts, msg) = b.trace_fields()
    except ValueError:
        print("value error")
        continue
    print("%-18.9f %-16s %-6d %s" % (ts, task, pid, msg))

@goldshtn
Copy link
Collaborator

@brendangregg: This looks very good, and timely too -- I'm working on generalizing stackcount (#580), and I'm just about done with uprobes and tracepoints. Adding USDT support in a non-hacky way would be a great way to wrap this up.

@vmg has some work in progress that he shared over here. I think it's more important to get the basic features in the library -- we can take care of argdist and trace and the other tools later.

@brendangregg
Copy link
Member Author

Oh cool, so @vmg's is much further along. I'll test it out.

oh, so I was missing the libbcc.py stuff, and had to do all the ctype stuff manually...

@brendangregg
Copy link
Member Author

Fixed in #624

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants