-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kernel tracepoint support (ugly) #419
Comments
:) I was hoping it won't get to it, but that's a reasonable 'workaround'. |
OK, suppose I go through perf - would it be possible to have perf discard the events and not write them out to perf.data? I'm looking to eliminate the inefficiency in doing analysis and filtering in user space, otherwise it isn't really worth pursuing... |
if perf_event_open is done for the task, then perf_trace* will be called, but it will not produce any output, since pid != pid_of_our_python_process. Another approach is to add a filter to it that always returns false. It's a bit slower, since alloc + copy of arguments and registers will be happening before discarding event. The main overhead is int3 anyway. |
OK, so just to make sure I understand the recommended approach. Turn on the tracepoints using Also adding @brendangregg if you have any thoughts. |
Turn tracepoint on with perf_event_open() syscall, add a filter via PERF_EVENT_IOC_SET_FILTER ioctl and add kprobe to corresponding perf_trace_* function. There are ~400 of them for each tracepoint. Like perf_trace_block_bio_complete. There will be no trace record. Only functions args in ctx->di, si ... |
I was kind of hoping to use |
you mean to execute perf binary?! that's more complex and slower and fragile than doing a syscall. |
Well I'm willing to try. While I have your attention, do you know if there's a simple example of turning on a tracepoint and adding a filter? |
bcc/src/src/cc/libbpf.c for opening an event. Doing ioctl is trivial. Just pass a string to it: ioctl(fd, PERF_EVENT_IOC_SET_FILTER, filter); |
All right, I think I have something reasonable at this point -- https://github.com/goldshtn/bcc/tree/tpoint. There are a few small issues to figure out, but the main one is this: some tracepoints don't have a corresponding Here's an example that works nicely:
But there are many examples of tracepoints for which I can't find the corresponding
Lastly, the I'd appreciate any pointers that would help make this even more useful. /cc @brendangregg @4ast |
I've looked through the patches in your tree, but it was too hard for me to see what's going on. |
@4ast Thanks! I'll add it to the bcc module. About my other question, do you know why some tracepoints don't have corresponding |
syscall* tracepoints are slow and not recommended. It's faster to kprobe into sys_foo. |
Great work! And I didn't realize perf_trace_* was such a strong convention, but I see it now. I should have seen it sooner -- it's from the macros (TRACE_EVENT etc). So I don't get how we get to nr_bytes.
So the args should be: dev, sector, nr_sector, ... right? So eliding arg0 (as @4ast said), the one-liner should be:
I guess it's clear from your output anyway, since we aren't doing block device I/O of "8 -> 15" bytes :) As for @4ast's last Q, how we make tracepoint names to perf_trace_* functions, then I think it can all come from the events pathnames:
I think we can parse that to give us the tracepoint name (last two components -> block:block_rq_complete), and also the function name (last component prefixed with perf_trace_). Although they don't all exist as those functions (eg, one kernel I'm on is missing perf_event_block_bio_backmerge), so we'd have to check /sys/kernel/debug/tracing/available_filter_functions. |
RE syscall tracing: we probably should have a docs directory, and a simple syscalls.md document to explain how to trace syscalls, and the caveats of kprobes (stability). I used to know the difference between sys_read and SyS_read, but I forget right now, and it's not easy to duckduckgo for (aka google for). I suppose bcc could map syscall tracepoints to the sys kprobes, and have a simple lookup table for that. This would be optional -- you could trace the kprobe directly, or try a syscall tracepoint mapping. The problem is maintenance of the lookup table, especially if we need one bcc repo to support different kernel versions. I'm reminded of SystemTap's tapsets, that get mired in kernel version tests. It might not be so bad if we just do it for syscalls and nothing else. |
... block:block_bio_backmerge got inlined, hence no perf_trace_... |
@brendangregg re: available_filter_functions is for ftrace only. It's not applicable to kprobe or tracepoints. Anyway I'm working on proper support for tracepoints. The goal is to let bpf programs define 2nd argument into bpf prog as a special struct. Like for block/block_rq_complete the 2nd arg will be struct S { Technically it's possible to hack it already (without kernel changes). The first argument (that I said not to use) will be filled with such struct, but kprobe should be not on the first instruction, but close to the last and ctx->di will be pointing to filled in struct like above. We'd need to bpf_probe_read() it first. |
@brendangregg Re nr_bytes vs. nr_sectors, it's weird but the
I don't think it's feasible to come up with the @4ast I will experiment with a kretprobe on the |
Hello I did a test looking at nr_bytes vs sector as it didn’t make sense to me how we could get nr_bytes when it is used to work out the sector value. So maybe it didn’t matter what was passed in to the script. root@henky:~# dd if=/dev/zero of=./aaaa oflag=direct bs=8192 count=10 root@henky:/var/tmp/march# ./argdist.py -H 't:block:block_rq_complete(void dummy, void *dummy2, void *dummy3, int allan):int:allan' with nr_sector root@henky:/var/tmp/march# # ./argdist.py -H 't:block:block_rq_complete(void dummy, void *dummy2, void *dummy3, int nr_sector):int:nr_sector'
|
I might be missing something, but did you simply run the same command with 'allan' and 'nr_sector' as the parameter names? If so, there will be no difference because the parameter names you give the traced method don't matter at all, and have nothing to do with the names used in the trace structure. |
@4ast I've tried hacking together an example that reads from the entry struct when exiting the #!/usr/bin/env python
from bcc import BPF
from time import sleep
source = """
#include <linux/ptrace.h>
#include <linux/blkdev.h>
struct entry {
dev_t dev;
sector_t sector;
unsigned int nr_sector;
int errors;
char rwbs[8];
};
BPF_HASH(history, u64, struct entry);
BPF_HASH(curr, u64, u64);
int enter(struct pt_regs *ctx)
{
u64 tid = bpf_get_current_pid_tgid();
u64 val = ctx->di;
curr.update(&tid, &val);
return 0;
}
int probe(struct pt_regs *ctx)
{
u64 tid = bpf_get_current_pid_tgid();
u64 *enter_di = curr.lookup(&tid);
if (enter_di == 0)
return 0;
u64 key = bpf_ktime_get_ns();
struct entry e = {};
bpf_probe_read(&e, sizeof(e), (void *)*enter_di);
history.update(&key, &e);
return 0;
}
"""
bpf = BPF(text=source)
bpf.attach_kprobe(event="perf_trace_block_rq_complete", fn_name="enter")
bpf.attach_kretprobe(event="perf_trace_block_rq_complete", fn_name="probe")
while True:
sleep(1)
print("****")
for k, v in bpf.get_table("history").items():
print(k.value, v.rwbs, v.nr_sector, v.sector) |
@4ast @drzaeus77 Because this would also be my first PR to the bcc module, I wanted to know what you think about the desired API. How about |
@goldshtn sorry my mistake about first arg. remembering ctx->di isn't going to work. I misread kernel code. Trying to figure out another way... |
ok, the following changes were needed:
and
also I would suggest to use array type instead of hash to record ctx->di with key==0 |
@4ast Cool, will keep experimenting. Thanks for the help. BTW, not sure how an array with key=0 would work -- there could be multiple threads calling the trace methods, so it doesn't seem safe to record ctx->di with just one key for all these threads. Am I wrong? |
@4ast OK, so I set up a kprobe on When I enable the tracepoint with The code is still on my tpoint branch. Here are the test programs I use outside of the full argdist script. The first one enables a tracepoint using #include <string.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <sys/ioctl.h>
#include <linux/unistd.h>
int main()
{
struct perf_event_attr attr = {};
int pid = 0 /* current */, cpu = -1, group_fd = -1;
int pfd;
printf("__NR_perf_event_open = %d\n", __NR_perf_event_open);
printf("PERF_TYPE_TRACEPOINT = %d\n", PERF_TYPE_TRACEPOINT);
printf("PERF_SAMPLE_RAW = %d\n", PERF_SAMPLE_RAW);
printf("PERF_FLAG_FD_CLOEXEC = %d\n", PERF_FLAG_FD_CLOEXEC);
printf("PERF_EVENT_IOC_SET_FILTER = %d\n", PERF_EVENT_IOC_SET_FILTER);
printf("PERF_EVENT_IOC_ENABLE = %d\n", PERF_EVENT_IOC_ENABLE);
attr.config = 1116; /* net:net_dev_xmit */
attr.type = PERF_TYPE_TRACEPOINT;
attr.sample_type = PERF_SAMPLE_RAW;
attr.sample_period = 1;
attr.wakeup_events = 1;
pfd = syscall(__NR_perf_event_open, &attr, pid, cpu, group_fd,
PERF_FLAG_FD_CLOEXEC);
if (pfd < 0) {
fprintf(stderr, "perf_event_open failed: %s\n",
strerror(errno));
return 1;
}
if (ioctl(pfd, PERF_EVENT_IOC_SET_FILTER, "common_pid == -17") < 0) {
fprintf(stderr, "ioctl to set event filter failed\n");
return 1;
}
if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
fprintf(stderr, "ioctl to enable event failed\n");
return 1;
}
printf("Hit ENTER to quit.\n");
getchar();
return 0;
} And here's the script, following your recommendations: #!/usr/bin/env python
from bcc import BPF
from time import sleep
source = """
#include <linux/ptrace.h>
#include <linux/blkdev.h>
struct entry2 {
u64 __DONT_USE__;
void *skbaddr;
unsigned int len;
int rc;
};
BPF_HASH(history, u64, struct entry2);
BPF_HASH(curr, u64, u64);
int enter(struct pt_regs *ctx)
{
u64 tid = bpf_get_current_pid_tgid();
u64 val = ctx->di;
curr.update(&tid, &val);
return 0;
}
int probe(struct pt_regs *ctx)
{
u64 tid = bpf_get_current_pid_tgid();
u64 *enter_di = curr.lookup(&tid);
if (enter_di == 0)
return 0;
u64 key = bpf_ktime_get_ns();
struct entry2 e = {};
bpf_probe_read(&e, sizeof(e), (void *)*enter_di);
history.update(&key, &e);
return 0;
}
"""
bpf = BPF(text=source)
bpf.attach_kprobe(event="tracing_generic_entry_update", fn_name="enter")
bpf.attach_kretprobe(event="perf_trace_net_dev_xmit", fn_name="probe")
print(BPF.open_kprobes())
while True:
sleep(1)
print("****")
for k, v in bpf.get_table("history").items():
print(k.value, v.skbaddr, v.len, v.rc) |
The name If this is only matching some tracepoints, then we might want a "tlist" tool that lists traceable tracepoints. Later on, when proper tracepoint support exists, tlist should just list them all. |
Yeah, now we just need to figure out how to get the entry struct and I think the experience can be pretty good. The tlist tool could print the tracepoint details and format, and then argdist and trace would let you access the parameters from the format fields directly. I don't think I can or should retrofit this magic into the BPF module though. No existing methods there actually introduce their own BPF program, and the tracepoint work would require it (the program that reads ctx->di with the entry address and the program that reads it in a kretprobe when it's already filled). |
I spent some time digging through the TRACE_* macros in the kernel, and I'm still optimistic that there's a way to extract the types automatically. I'll keep looking. |
I have everything set up to generate the structure from the debugfs format file. I just need a reliable way of getting a pointer to the structure that's filled by the |
I was hoping to |
I don't mind getting the structure from debugfs (especially since I already wrote the code). It's getting a pointer to the filled struct that's an issue. It's being filled by the |
It looks to me like the call to |
Yeah, that would be a good place to try. I will do it tomorrow. Unfortunately I don't see a way to figure out the tracepoint id when looking only at this function, so I will still need a probe in the specific perf_trace function followed by a probe in perf_tp_event (which the inlinable perf_trace_buf_submit calls). |
probe in perf_tp_event is not great, since it will be firing for all tracepoints. |
The weird thing is that it is called when I use perf to enable the tracepoint, even with the filter. Maybe it's worth looking more into what perf is doing. Maybe need to set up the mmap buffer in order to reach our desired code path. Can you also explain why perf_tp_event would be called for all tracepoints? Do you mean tracepoints not enabled by our code, so we would be probing unnecessarily? |
yes. it's easier to look at perf_trace_##call() in include/trace/perf.h to see what's it's doing |
I looked at the code, indeed. I think I can record a temporary marker into a BPF array in the perf_trace_nnn function, and then check for that marker in perf_trace_buf_submit and clear it. That way we are still probing all tracepoints but not doing the actual reporting work for tracepoints we don't own. How's that sound? |
I don't see how that would work. The whole point of tracepoints is to see the args, so the main code of the program need to operate after { assign; } part. |
In the version of the function I'm looking at (http://lxr.free-electrons.com/source/include/trace/perf.h#L70), the submit function comes after the assign part. What am I missing? |
yes we're looking at the same stuff. It's called for all tracepoints, so how single bpf program would deal with all variants of struct passed as first arg into perf_trace_buf_submit ? |
(So the perf_trace_nnn function records a marker that indicates which tracepoint it is, and then when submit is called I can tell if it's my tracepoint and do the main part of the program.) |
Oh, I would have a single BPF program that calls into child functions for each tracepoint. Each child function would recognize its own struct. |
if recorded tp id == 1106 I realize it would get us close to the BPF program size limit. But the tracing entry update function isn't called for some reason, so I figured it would be worth a shot. |
how that is going to work with multiple scripts doing it at the same time? |
It is not. But neither would placing a probe in tracing_generic_entry_update. I think pretty much all the tools place probes in functions and can't be run simultaneously. I am not proposing multiple probes in a single function in a single tool invocation. Argdist would have to generate a single BPF program that handles all the tracepoints the user asks for. |
I suspect that if you see tracing_generic_entry_update is not called then perf_trace_buf_submit won't be called either, so you probably need to debug it regardless. |
indeed, head = this_cpu_ptr(event_call->perf_events); \
if (__builtin_constant_p(!__task) && !__task && \
hlist_empty(head)) \
return; Not sure what this |
I figured it out after strace-ing What I wonder, though, is whether I now need to call
Sure enough, there are four calls to /cc @4ast |
not sure what's wrong with your strace... did you run it under root? |
I understand that adding kernel tracepoint support in BPF is going to take a while. In the meantime, would it be possible to put together a tool that relies on kprobes? What I had in mind was to enable the tracepoint using ftrace and then place a kprobe in the
trace_event_raw_event_*
function or intrace_event_buffer_commit
, or some similar location. The kprobe would have access to the entry structure defined byTRACE_EVENT
.Is that a direction worth pursuing? Would it be possible (for efficiency) to instruct the ftrace infrastructure to discard the event and not even put it in the ftrace buffer?
The text was updated successfully, but these errors were encountered: