Permalink
| Demonstrations of trace. | |
| trace probes functions you specify and displays trace messages if a particular | |
| condition is met. You can control the message format to display function | |
| arguments and return values. | |
| For example, suppose you want to trace all commands being exec'd across the | |
| system: | |
| # trace 'sys_execve "%s", arg1' | |
| PID COMM FUNC - | |
| 4402 bash sys_execve /usr/bin/man | |
| 4411 man sys_execve /usr/local/bin/less | |
| 4411 man sys_execve /usr/bin/less | |
| 4410 man sys_execve /usr/local/bin/nroff | |
| 4410 man sys_execve /usr/bin/nroff | |
| 4409 man sys_execve /usr/local/bin/tbl | |
| 4409 man sys_execve /usr/bin/tbl | |
| 4408 man sys_execve /usr/local/bin/preconv | |
| 4408 man sys_execve /usr/bin/preconv | |
| 4415 nroff sys_execve /usr/bin/locale | |
| 4416 nroff sys_execve /usr/bin/groff | |
| 4418 groff sys_execve /usr/bin/grotty | |
| 4417 groff sys_execve /usr/bin/troff | |
| ^C | |
| The ::sys_execve syntax specifies that you want an entry probe (which is the | |
| default), in a kernel function (which is the default) called sys_execve. Next, | |
| the format string to print is simply "%s", which prints a string. Finally, the | |
| value to print is the first argument to the sys_execve function, which happens | |
| to be the command that is exec'd. The above trace was generated by executing | |
| "man ls" in a separate shell. As you see, man executes a number of additional | |
| programs to finally display the man page. | |
| Next, suppose you are looking for large reads across the system. Let's trace | |
| the read system call and inspect the third argument, which is the number of | |
| bytes to be read: | |
| # trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' | |
| PID COMM FUNC - | |
| 4490 dd sys_read read 1048576 bytes | |
| 4490 dd sys_read read 1048576 bytes | |
| 4490 dd sys_read read 1048576 bytes | |
| 4490 dd sys_read read 1048576 bytes | |
| ^C | |
| During the trace, I executed "dd if=/dev/zero of=/dev/null bs=1M count=4". | |
| The individual reads are visible, with the custom format message printed for | |
| each read. The parenthesized expression "(arg3 > 20000)" is a filter that is | |
| evaluated for each invocation of the probe before printing anything. | |
| You can also trace user functions. For example, let's simulate the bashreadline | |
| script, which attaches to the readline function in bash and prints its return | |
| value, effectively snooping all bash shell input across the system: | |
| # trace 'r:bash:readline "%s", retval' | |
| PID COMM FUNC - | |
| 2740 bash readline echo hi! | |
| 2740 bash readline man ls | |
| ^C | |
| The special retval keywords stands for the function's return value, and can | |
| be used only in a retprobe, specified by the 'r' prefix. The next component | |
| of the probe is the library that contains the desired function. It's OK to | |
| specify executables too, as long as they can be found in the PATH. Or, you | |
| can specify the full path to the executable (e.g. "/usr/bin/bash"). | |
| Multiple probes can be combined on the same command line. For example, let's | |
| trace failed read and write calls on the libc level, and include a time column: | |
| # trace 'r:c:read ((int)retval < 0) "read failed: %d", retval' \ | |
| 'r:c:write ((int)retval < 0) "write failed: %d", retval' -T | |
| TIME PID COMM FUNC - | |
| 05:31:57 3388 bash write write failed: -1 | |
| 05:32:00 3388 bash write write failed: -1 | |
| ^C | |
| Note that the retval variable must be cast to int before comparing to zero. | |
| The reason is that the default type for argN and retval is an unsigned 64-bit | |
| integer, which can never be smaller than 0. | |
| trace has also some basic support for kernel tracepoints. For example, let's | |
| trace the block:block_rq_complete tracepoint and print out the number of sectors | |
| transferred: | |
| # trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' -T | |
| TIME PID COMM FUNC - | |
| 01:23:51 0 swapper/0 block_rq_complete sectors=8 | |
| 01:23:55 10017 kworker/u64: block_rq_complete sectors=1 | |
| 01:23:55 0 swapper/0 block_rq_complete sectors=8 | |
| ^C | |
| To discover the tracepoint structure format (which you can refer to as the "args" | |
| pointer variable), use the tplist tool. For example: | |
| # tplist -v block:block_rq_complete | |
| block:block_rq_complete | |
| dev_t dev; | |
| sector_t sector; | |
| unsigned int nr_sector; | |
| int errors; | |
| char rwbs[8]; | |
| This output tells you that you can use "args->dev", "args->sector", etc. in your | |
| predicate and trace arguments. | |
| More and more high-level libraries are instrumented with USDT probe support. | |
| These probes can be traced by trace just like kernel tracepoints. For example, | |
| trace new threads being created and their function name: | |
| # trace 'u:pthread:pthread_create "%U", arg3' -T | |
| TIME PID COMM FUNC - | |
| 02:07:29 4051 contentions pthread_create primes_thread+0x0 | |
| 02:07:29 4051 contentions pthread_create primes_thread+0x0 | |
| 02:07:29 4051 contentions pthread_create primes_thread+0x0 | |
| 02:07:29 4051 contentions pthread_create primes_thread+0x0 | |
| ^C | |
| The "%U" format specifier tells trace to resolve arg3 as a user-space symbol, | |
| if possible. Similarly, use "%K" for kernel symbols. | |
| Ruby, Node, and OpenJDK are also instrumented with USDT. For example, let's | |
| trace Ruby methods being called (this requires a version of Ruby built with | |
| the --enable-dtrace configure flag): | |
| # trace 'u:ruby:method__entry "%s.%s", arg1, arg2' -p $(pidof irb) -T | |
| TIME PID COMM FUNC - | |
| 12:08:43 18420 irb method__entry IRB::Context.verbose? | |
| 12:08:43 18420 irb method__entry RubyLex.ungetc | |
| 12:08:43 18420 irb method__entry RuxyLex.debug? | |
| ^C | |
| In the previous invocation, arg1 and arg2 are the class name and method name | |
| for the Ruby method being invoked. | |
| You can also trace exported functions from shared libraries, or an imported | |
| function on the actual executable: | |
| # sudo ./trace.py 'r:/usr/lib64/libtinfo.so:curses_version "Version=%s", retval' | |
| # tput -V | |
| PID TID COMM FUNC - | |
| 21720 21720 tput curses_version Version=ncurses 6.0.20160709 | |
| ^C | |
| Occasionally, it can be useful to filter specific strings. For example, you | |
| might be interested in open() calls that open a specific file: | |
| # trace 'p:c:open (STRCMP("test.txt", arg1)) "opening %s", arg1' -T | |
| TIME PID COMM FUNC - | |
| 01:43:15 10938 cat open opening test.txt | |
| 01:43:20 10939 cat open opening test.txt | |
| ^C | |
| In the preceding example, as well as in many others, readability may be | |
| improved by providing the function's signature, which names the arguments and | |
| lets you access structure sub-fields, which is hard with the "arg1", "arg2" | |
| convention. For example: | |
| # trace 'p:c:open(char *filename) "opening %s", filename' | |
| PID TID COMM FUNC - | |
| 17507 17507 cat open opening FAQ.txt | |
| ^C | |
| # trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec' | |
| PID TID COMM FUNC - | |
| 777 785 automount SyS_nanosleep sleep for 500000000 ns | |
| 777 785 automount SyS_nanosleep sleep for 500000000 ns | |
| 777 785 automount SyS_nanosleep sleep for 500000000 ns | |
| 777 785 automount SyS_nanosleep sleep for 500000000 ns | |
| ^C | |
| Remember to use the -I argument include the appropriate header file. We didn't | |
| need to do that here because `struct timespec` is used internally by the tool, | |
| so it always includes this header file. | |
| As a final example, let's trace open syscalls for a specific process. By | |
| default, tracing is system-wide, but the -p switch overrides this: | |
| # trace -p 2740 'do_sys_open "%s", arg2' -T | |
| TIME PID COMM FUNC - | |
| 05:36:16 15872 ls do_sys_open /etc/ld.so.cache | |
| 05:36:16 15872 ls do_sys_open /lib64/libselinux.so.1 | |
| 05:36:16 15872 ls do_sys_open /lib64/libcap.so.2 | |
| 05:36:16 15872 ls do_sys_open /lib64/libacl.so.1 | |
| 05:36:16 15872 ls do_sys_open /lib64/libc.so.6 | |
| 05:36:16 15872 ls do_sys_open /lib64/libpcre.so.1 | |
| 05:36:16 15872 ls do_sys_open /lib64/libdl.so.2 | |
| 05:36:16 15872 ls do_sys_open /lib64/libattr.so.1 | |
| 05:36:16 15872 ls do_sys_open /lib64/libpthread.so.0 | |
| 05:36:16 15872 ls do_sys_open /usr/lib/locale/locale-archive | |
| 05:36:16 15872 ls do_sys_open /home/vagrant | |
| ^C | |
| In this example, we traced the "ls ~" command as it was opening its shared | |
| libraries and then accessing the /home/vagrant directory listing. | |
| Lastly, if a high-frequency event is traced you may overflow the perf ring | |
| buffer. This shows as "Lost N samples": | |
| # trace sys_open | |
| 5087 5087 pgrep sys_open | |
| 5087 5087 pgrep sys_open | |
| 5087 5087 pgrep sys_open | |
| 5087 5087 pgrep sys_open | |
| 5087 5087 pgrep sys_open | |
| Lost 764896 samples | |
| Lost 764896 samples | |
| Lost 764896 samples | |
| The perf ring buffer size can be changed with -b. The unit is size per-CPU buffer | |
| size and is measured in pages. The value must be a power of two and defaults to | |
| 64 pages. | |
| USAGE message: | |
| usage: trace [-h] [-b BUFFER_PAGES] [-p PID] [-L TID] [-v] [-Z STRING_SIZE] | |
| [-S] [-M MAX_EVENTS] [-t] [-T] [-K] [-U] [-I header] | |
| probe [probe ...] | |
| Attach to functions and print trace messages. | |
| positional arguments: | |
| probe probe specifier (see examples) | |
| optional arguments: | |
| -h, --help show this help message and exit | |
| -b BUFFER_PAGES, --buffer-pages BUFFER_PAGES | |
| number of pages to use for perf_events ring buffer | |
| (default: 64) | |
| -p PID, --pid PID id of the process to trace (optional) | |
| -L TID, --tid TID id of the thread to trace (optional) | |
| -v, --verbose print resulting BPF program code before executing | |
| -Z STRING_SIZE, --string-size STRING_SIZE | |
| maximum size to read from strings | |
| -S, --include-self do not filter trace's own pid from the trace | |
| -M MAX_EVENTS, --max-events MAX_EVENTS | |
| number of events to print before quitting | |
| -t, --timestamp print timestamp column (offset from trace start) | |
| -T, --time print time column | |
| -K, --kernel-stack output kernel stack trace | |
| -U, --user-stack output user stack trace | |
| -I header, --include header | |
| additional header files to include in the BPF program | |
| as either full path, or relative to '/usr/include' | |
| EXAMPLES: | |
| trace do_sys_open | |
| Trace the open syscall and print a default trace message when entered | |
| trace 'do_sys_open "%s", arg2' | |
| Trace the open syscall and print the filename being opened | |
| trace 'sys_read (arg3 > 20000) "read %d bytes", arg3' | |
| Trace the read syscall and print a message for reads >20000 bytes | |
| trace 'r::do_sys_open "%llx", retval' | |
| Trace the return from the open syscall and print the return value | |
| trace 'c:open (arg2 == 42) "%s %d", arg1, arg2' | |
| Trace the open() call from libc only if the flags (arg2) argument is 42 | |
| trace 'c:malloc "size = %d", arg1' | |
| Trace malloc calls and print the size being allocated | |
| trace 'p:c:write (arg1 == 1) "writing %d bytes to STDOUT", arg3' | |
| Trace the write() call from libc to monitor writes to STDOUT | |
| trace 'r::__kmalloc (retval == 0) "kmalloc failed!"' | |
| Trace returns from __kmalloc which returned a null pointer | |
| trace 'r:c:malloc (retval) "allocated = %x", retval' | |
| Trace returns from malloc and print non-NULL allocated buffers | |
| trace 't:block:block_rq_complete "sectors=%d", args->nr_sector' | |
| Trace the block_rq_complete kernel tracepoint and print # of tx sectors | |
| trace 'u:pthread:pthread_create (arg4 != 0)' | |
| Trace the USDT probe pthread_create when its 4th argument is non-zero | |
| trace 'p::SyS_nanosleep(struct timespec *ts) "sleep for %lld ns", ts->tv_nsec' | |
| Trace the nanosleep syscall and print the sleep duration in ns |