Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kernel tracepoint support (ugly) #419

Closed
goldshtn opened this issue Feb 24, 2016 · 51 comments
Closed

Kernel tracepoint support (ugly) #419

goldshtn opened this issue Feb 24, 2016 · 51 comments

Comments

@goldshtn
Copy link
Collaborator

I understand that adding kernel tracepoint support in BPF is going to take a while. In the meantime, would it be possible to put together a tool that relies on kprobes? What I had in mind was to enable the tracepoint using ftrace and then place a kprobe in the trace_event_raw_event_* function or in trace_event_buffer_commit, or some similar location. The kprobe would have access to the entry structure defined by TRACE_EVENT.

Is that a direction worth pursuing? Would it be possible (for efficiency) to instruct the ftrace infrastructure to discard the event and not even put it in the ftrace buffer?

@4ast
Copy link
Member

4ast commented Feb 24, 2016

:) I was hoping it won't get to it, but that's a reasonable 'workaround'.
Probably enabling tracepoints via perf and adding kprobe to perf_trace_* will be a tiny bit better, since the code that enables tracepoint via perf_event_open will be reused when tracepoint+bpf is ready.
but going via ftrace is also ok-ish. Both are ugly hacks, of course, and will be slower than normal kprobe at the beginning of the function, since trace_event_raw* and perf_trace* are 'notrace', so kprobe will do full 'int 3'.
Also note that they will not have access to struct of trace_event, but rather 'proto' part of it only.

@goldshtn
Copy link
Collaborator Author

OK, suppose I go through perf - would it be possible to have perf discard the events and not write them out to perf.data? I'm looking to eliminate the inefficiency in doing analysis and filtering in user space, otherwise it isn't really worth pursuing...

@4ast
Copy link
Member

4ast commented Feb 25, 2016

if perf_event_open is done for the task, then perf_trace* will be called, but it will not produce any output, since pid != pid_of_our_python_process. Another approach is to add a filter to it that always returns false. It's a bit slower, since alloc + copy of arguments and registers will be happening before discarding event. The main overhead is int3 anyway.

@goldshtn
Copy link
Collaborator Author

OK, so just to make sure I understand the recommended approach. Turn on the tracepoints using perf record with a filter that always returns false so that no data actually goes into a file. Place a kprobe on one of the methods that process the perf event, such as perf_tp_event and try to capture data out of the provided record. Makes sense?

Also adding @brendangregg if you have any thoughts.

@4ast
Copy link
Member

4ast commented Feb 25, 2016

Turn tracepoint on with perf_event_open() syscall, add a filter via PERF_EVENT_IOC_SET_FILTER ioctl and add kprobe to corresponding perf_trace_* function. There are ~400 of them for each tracepoint. Like perf_trace_block_bio_complete. There will be no trace record. Only functions args in ctx->di, si ...
Adding kprobe to perf_tp_event not really usable, since all events will go through it and record will be different and hard to parse.

@goldshtn
Copy link
Collaborator Author

I was kind of hoping to use perf and not having to replicate the whole "turn on the tracepoint and set a filter" functionality :-) That might be a bit out of scope for something like this, which is by definition temporary until BPF supports tracepoints... What do you think?

@4ast
Copy link
Member

4ast commented Feb 25, 2016

you mean to execute perf binary?! that's more complex and slower and fragile than doing a syscall.

@goldshtn
Copy link
Collaborator Author

Well I'm willing to try. While I have your attention, do you know if there's a simple example of turning on a tracepoint and adding a filter?

@4ast
Copy link
Member

4ast commented Feb 25, 2016

bcc/src/src/cc/libbpf.c for opening an event. Doing ioctl is trivial. Just pass a string to it: ioctl(fd, PERF_EVENT_IOC_SET_FILTER, filter);

@goldshtn
Copy link
Collaborator Author

goldshtn commented Mar 5, 2016

All right, I think I have something reasonable at this point -- https://github.com/goldshtn/bcc/tree/tpoint. There are a few small issues to figure out, but the main one is this: some tracepoints don't have a corresponding perf_trace_ function, so I don't know where to place the kprobe.

Here's an example that works nicely:

# ./argdist.py -H 't:block:block_rq_complete(void *dummy, void *dummy2, void *dummy3, int nr_bytes):int:nr_bytes'
[05:36:59]
  nr_bytes            : count     distribution
      0 -> 1          : 0        |                                        |
      2 -> 3          : 0        |                                        |
      4 -> 7          : 0        |                                        |
      8 -> 15         : 1        |****************************************|

But there are many examples of tracepoints for which I can't find the corresponding perf_trace_ function. For example:

  • power:cpu_idle -- the tracepoint exists, but there's no perf_trace_cpu_idle function on my system or in fact anything in /proc/kallsyms matching perf_trace_.*idle
  • All syscalls:* tracepoints -- there's just perf_trace_sys_enter and perf_trace_sys_exit but not specific functions for each syscall

Lastly, the perf_trace_ functions' first argument is not declared in the TP_PROTO macro. Do you know what it is? I can try exploring the trace macro gore but if you happen to know...

I'd appreciate any pointers that would help make this even more useful.

/cc @brendangregg @4ast

@4ast
Copy link
Member

4ast commented Mar 8, 2016

I've looked through the patches in your tree, but it was too hard for me to see what's going on.
My only recommendation is to do it in generic bcc python module instead of in argdist.
Other than that clean it up and send a pull req.
Don't bother parsing first arg. It's a pointer to raw 'data' buffer. tracepoint code stores its args in there and then perf side copies this buffer into userspace.

@goldshtn
Copy link
Collaborator Author

goldshtn commented Mar 8, 2016

@4ast Thanks! I'll add it to the bcc module. About my other question, do you know why some tracepoints don't have corresponding perf_trace_ functions? Is there any workaround?

@4ast
Copy link
Member

4ast commented Mar 8, 2016

syscall* tracepoints are slow and not recommended. It's faster to kprobe into sys_foo.
The perf_trace* for power:cpu_idle is actually done by perf_trace_cpu.
$ sudo cat /sys/kernel/debug/tracing/events/power/cpu_frequency
print fmt: "state=%lu cpu_id=%lu"
which is the same format and perf_trace_cpu() helper function used by power:cpu_idle and power:cpu_frequency.
I don't know how to easily tell which perf_trace* corresponds to tracepoint name.

@brendangregg
Copy link
Member

Great work! And I didn't realize perf_trace_* was such a strong convention, but I see it now. I should have seen it sooner -- it's from the macros (TRACE_EVENT etc).

So I don't get how we get to nr_bytes.

# cat /sys/kernel/debug/tracing/events/block/block_rq_complete/format 
name: block_rq_complete
ID: 777
format:
        field:unsigned short common_type;       offset:0;       size:2; signed:0;
        field:unsigned char common_flags;       offset:2;       size:1; signed:0;
        field:unsigned char common_preempt_count;       offset:3;       size:1; signed:0;
        field:int common_pid;   offset:4;       size:4; signed:1;

        field:dev_t dev;        offset:8;       size:4; signed:0;
        field:sector_t sector;  offset:16;      size:8; signed:0;
        field:unsigned int nr_sector;   offset:24;      size:4; signed:0;
        field:int errors;       offset:28;      size:4; signed:1;
        field:char rwbs[8];     offset:32;      size:8; signed:1;
        field:__data_loc char[] cmd;    offset:40;      size:4; signed:1;

So the args should be: dev, sector, nr_sector, ... right? So eliding arg0 (as @4ast said), the one-liner should be:

./argdist.py -H 't:block:block_rq_complete(void *dummy2, void *dummy3, int nr_sector)

I guess it's clear from your output anyway, since we aren't doing block device I/O of "8 -> 15" bytes :)

As for @4ast's last Q, how we make tracepoint names to perf_trace_* functions, then I think it can all come from the events pathnames:

/sys/kernel/debug/tracing/events/block/block_rq_complete

I think we can parse that to give us the tracepoint name (last two components -> block:block_rq_complete), and also the function name (last component prefixed with perf_trace_). Although they don't all exist as those functions (eg, one kernel I'm on is missing perf_event_block_bio_backmerge), so we'd have to check /sys/kernel/debug/tracing/available_filter_functions.

@brendangregg
Copy link
Member

RE syscall tracing: we probably should have a docs directory, and a simple syscalls.md document to explain how to trace syscalls, and the caveats of kprobes (stability).

I used to know the difference between sys_read and SyS_read, but I forget right now, and it's not easy to duckduckgo for (aka google for).

I suppose bcc could map syscall tracepoints to the sys kprobes, and have a simple lookup table for that. This would be optional -- you could trace the kprobe directly, or try a syscall tracepoint mapping. The problem is maintenance of the lookup table, especially if we need one bcc repo to support different kernel versions. I'm reminded of SystemTap's tapsets, that get mired in kernel version tests. It might not be so bad if we just do it for syscalls and nothing else.

@brendangregg
Copy link
Member

... block:block_bio_backmerge got inlined, hence no perf_trace_...

@4ast
Copy link
Member

4ast commented Mar 9, 2016

@brendangregg re: available_filter_functions is for ftrace only. It's not applicable to kprobe or tracepoints.
Most of the time the last two dirs of /sys/kernel/debug/tracing/events/block/block_bio_backmerge will match to perf_trace_aa_bb, but not always. Like in this case the function is perf_trace_block_bio_merge. Inside include/trace/events/block.h there is line "DEFINE_EVENT(block_bio_merge, block_bio_backmerge" which means that tracepoint block:block_bio_backmerge (and block:block_bio_frontmerge) are served by the same perf_trace_block_bio_merge() handler. I don't know how to derive it from things other than source code.
I'm not worried about mapping of syscalls. I don't think anyone cares about collecting stats for all syscalls, since new ones are constantly added. There is a standard set of syscalls like sys_open and their prototype are quite stable. Most of the time adding sys_ will do the trick.

Anyway I'm working on proper support for tracepoints. The goal is to let bpf programs define 2nd argument into bpf prog as a special struct. Like for block/block_rq_complete the 2nd arg will be struct S {
dev_t dev;
sector_t sector;
unsigned int nr_sector;
int errors;
char rwbs[8];
};
and bpf_prog(struct pt_regs_ctx, struct S_tp) will be able to access it directly as 'tp->nr_sector'.

Technically it's possible to hack it already (without kernel changes). The first argument (that I said not to use) will be filled with such struct, but kprobe should be not on the first instruction, but close to the last and ctx->di will be pointing to filled in struct like above. We'd need to bpf_probe_read() it first.
Why it would work? because this is how perf_trace_* are defined. See include/trace/perf.h:70

@goldshtn
Copy link
Collaborator Author

@brendangregg Re nr_bytes vs. nr_sectors, it's weird but the TP_PROTO macro mentions nr_bytes. See here: http://lxr.free-electrons.com/source/include/trace/events/block.h#L145

TRACE_EVENT(block_rq_complete,
         TP_PROTO(struct request_queue *q, struct request *rq,
                  unsigned int nr_bytes),
...

I don't think it's feasible to come up with the perf_trace_ function name by parsing the source code, so I'll just settle for what I'm currently doing and warn that it might fail for some tracepoints. Indeed, savvy users can simply figure out the perf_trace_ function they need and place a kprobe there directly :-)

@4ast I will experiment with a kretprobe on the perf_trace_ functions and see if I have access to the formatted entry struct. That would be best because that's what the tracepoint consumers are going to expect. (As an aside, is BPF tracepoint support expected in 4.6, 4.7, something later?)

@mcaleavya
Copy link
Contributor

Hello

I did a test looking at nr_bytes vs sector as it didn’t make sense to me how we could get nr_bytes when it is used to work out the sector value. So maybe it didn’t matter what was passed in to the script.

root@henky:~# dd if=/dev/zero of=./aaaa oflag=direct bs=8192 count=10
10+0 records in
10+0 records out
81920 bytes (82 kB) copied, 0.00516186 s, 15.9 MB/s

root@henky:/var/tmp/march# ./argdist.py -H 't:block:block_rq_complete(void dummy, void *dummy2, void *dummy3, int allan):int:allan'
[19:51:10]
allan : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |
_|
[19:51:11]
allan : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |_
|
[19:51:12]
allan : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 2 |**
_|
[19:51:13]
allan : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 2 |_* |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 10 |
**********|

with nr_sector

root@henky:/var/tmp/march# # ./argdist.py -H 't:block:block_rq_complete(void dummy, void *dummy2, void *dummy3, int nr_sector):int:nr_sector'
[20:05:32]
[20:05:33]
nr_sector : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |
_|
[20:05:35]
nr_sector : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |_* |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 10 |
_|
[20:05:36]
nr_sector : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 2 |
_
* |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 0 | |
4096 -> 8191 : 0 | |
8192 -> 16383 : 10 |
*************************************|

On 5 Mar 2016, at 13:45, Sasha Goldshtein notifications@github.com wrote:

All right, I think I have something reasonable at this point -- https://github.com/goldshtn/bcc/tree/tpoint https://github.com/goldshtn/bcc/tree/tpoint. There are a few small issues to figure out, but the main one is this: some tracepoints don't have a corresponding perf_trace_ function, so I don't know where to place the kprobe.

Here's an example that works nicely:

./argdist.py -H 't:block:block_rq_complete(void *dummy, void *dummy2, void *dummy3, int nr_bytes):int:nr_bytes'

[05:36:59]
nr_bytes : count distribution
0 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 1 |****************************************|
But there are many examples of tracepoints for which I can't find the corresponding perf_trace_ function. For example:

power:cpu_idle -- the tracepoint exists, but there's no perf_trace_cpu_idle function on my system or in fact anything in /proc/kallsyms matching perf_trace_.idle
All syscalls:
tracepoints -- there's just perf_trace_sys_enter and perf_trace_sys_exit but not specific functions for each syscall
Lastly, the perf_trace_ functions' first argument is not declared in the TP_PROTO macro. Do you know what it is? I can try exploring the trace macro gore but if you happen to know...

I'd appreciate any pointers that would help make this even more useful.

/cc @brendangregg https://github.com/brendangregg @4ast https://github.com/4ast

Reply to this email directly or view it on GitHub #419 (comment).

@goldshtn
Copy link
Collaborator Author

I might be missing something, but did you simply run the same command with 'allan' and 'nr_sector' as the parameter names? If so, there will be no difference because the parameter names you give the traced method don't matter at all, and have nothing to do with the names used in the trace structure.

@goldshtn
Copy link
Collaborator Author

@4ast I've tried hacking together an example that reads from the entry struct when exiting the perf_trace_ function. I record ctx->di on entry, and then bpf_probe_read from it on exit and populate the entry struct. I might be doing something wrong, but I'm getting nonsensical values. To reproduce, put the following in a .py file and run it, in the background run perf record -e 'block:block_rq_complete'.

#!/usr/bin/env python                                               

from bcc import BPF                                                 
from time import sleep                                              

source = """                                                       
#include <linux/ptrace.h>
#include <linux/blkdev.h>

struct entry {
        dev_t dev;
        sector_t sector;
        unsigned int nr_sector;
        int errors;
        char rwbs[8];
};

BPF_HASH(history, u64, struct entry);
BPF_HASH(curr, u64, u64);

int enter(struct pt_regs *ctx)
{
        u64 tid = bpf_get_current_pid_tgid();
        u64 val = ctx->di;
        curr.update(&tid, &val);
        return 0;
}

int probe(struct pt_regs *ctx)
{
        u64 tid = bpf_get_current_pid_tgid();
        u64 *enter_di = curr.lookup(&tid);
        if (enter_di == 0)
                return 0;

        u64 key = bpf_ktime_get_ns();
        struct entry e = {};
        bpf_probe_read(&e, sizeof(e), (void *)*enter_di);
        history.update(&key, &e);
        return 0;
}
"""

bpf = BPF(text=source)
bpf.attach_kprobe(event="perf_trace_block_rq_complete", fn_name="enter")
bpf.attach_kretprobe(event="perf_trace_block_rq_complete", fn_name="probe")

while True:
        sleep(1)
        print("****")
        for k, v in bpf.get_table("history").items():
                print(k.value, v.rwbs, v.nr_sector, v.sector)

@goldshtn
Copy link
Collaborator Author

@4ast @drzaeus77 Because this would also be my first PR to the bcc module, I wanted to know what you think about the desired API. How about BPF.attach_tracepoint, which takes the tracepoint category and event (e.g., "block", "block_rq_complete") and probe function name and does the perf_event_open magic? What should it do on failure (when it can't resolve perf_trace_nnn for the specified tracepoint)?

@4ast
Copy link
Member

4ast commented Mar 12, 2016

@goldshtn sorry my mistake about first arg. remembering ctx->di isn't going to work. I misread kernel code. Trying to figure out another way...

@4ast
Copy link
Member

4ast commented Mar 13, 2016

ok, the following changes were needed:

struct entry {
        u64 dontuse;
        dev_t dev;
        sector_t sector;
        unsigned int nr_sector;
        int errors;
        char rwbs[8];
};
...
bpf.attach_kprobe(event="tracing_generic_entry_update", fn_name="enter")

and
perf record -a -e 'block:block_rq_complete'
output:

(8653982714137064L, 'WS', 0L, 978592712L)
(8653981359998531L, 'W', 16L, 1121216128L)
(8653982714130641L, 'FWS', 0L, 18446744073709551615L)
(8653981364275177L, 'W', 24L, 822182352L)

also I would suggest to use array type instead of hash to record ctx->di with key==0
It will reduce overhead a lot.

@goldshtn
Copy link
Collaborator Author

@4ast Cool, will keep experimenting. Thanks for the help. BTW, not sure how an array with key=0 would work -- there could be multiple threads calling the trace methods, so it doesn't seem safe to record ctx->di with just one key for all these threads. Am I wrong?

@goldshtn
Copy link
Collaborator Author

@4ast OK, so I set up a kprobe on tracing_generic_entry_update, a kretprobe on perf_trace_nnn, and I enable the tracepoint just like I did before (with perf_event_open). The tracing_generic_entry_update function is never called, but the perf_trace_nnn function is.

When I enable the tracepoint with perf record, everything works fine, even if I specify an always-false filter (I use perf record -e 'net:net_dev_xmit' --filter 'common_pid==-17' -a). But it doesn't work when I enable the tracepoint myself.

The code is still on my tpoint branch. Here are the test programs I use outside of the full argdist script. The first one enables a tracepoint using perf_event_open (this is the same as what I do in argdist, but from Python through ctypes):

#include <string.h>              
#include <linux/perf_event.h>    
#include <stdio.h>               
#include <unistd.h>              
#include <errno.h>
#include <sys/ioctl.h>
#include <linux/unistd.h>

int main()
{
        struct perf_event_attr attr = {};
        int pid = 0 /* current */, cpu = -1, group_fd = -1;
        int pfd;

        printf("__NR_perf_event_open = %d\n", __NR_perf_event_open);
        printf("PERF_TYPE_TRACEPOINT = %d\n", PERF_TYPE_TRACEPOINT);
        printf("PERF_SAMPLE_RAW = %d\n", PERF_SAMPLE_RAW);
        printf("PERF_FLAG_FD_CLOEXEC = %d\n", PERF_FLAG_FD_CLOEXEC);
        printf("PERF_EVENT_IOC_SET_FILTER = %d\n", PERF_EVENT_IOC_SET_FILTER);
        printf("PERF_EVENT_IOC_ENABLE = %d\n", PERF_EVENT_IOC_ENABLE);

        attr.config = 1116;        /* net:net_dev_xmit */
        attr.type = PERF_TYPE_TRACEPOINT;
        attr.sample_type = PERF_SAMPLE_RAW;
        attr.sample_period = 1;
        attr.wakeup_events = 1;

        pfd = syscall(__NR_perf_event_open, &attr, pid, cpu, group_fd,
                      PERF_FLAG_FD_CLOEXEC);
        if (pfd < 0) {
                fprintf(stderr, "perf_event_open failed: %s\n",
                        strerror(errno));
                return 1;
        }

        if (ioctl(pfd, PERF_EVENT_IOC_SET_FILTER, "common_pid == -17") < 0) {
                fprintf(stderr, "ioctl to set event filter failed\n");
                return 1;
        }

        if (ioctl(pfd, PERF_EVENT_IOC_ENABLE, 0) < 0) {
                fprintf(stderr, "ioctl to enable event failed\n");
                return 1;
        }

        printf("Hit ENTER to quit.\n");
        getchar();

        return 0;
}

And here's the script, following your recommendations:

#!/usr/bin/env python                                  

from bcc import BPF                                    
from time import sleep                                 

source = """                                         
#include <linux/ptrace.h>
#include <linux/blkdev.h>

struct entry2 {
        u64 __DONT_USE__;
        void *skbaddr;
        unsigned int len;
        int rc;
};

BPF_HASH(history, u64, struct entry2);
BPF_HASH(curr, u64, u64);

int enter(struct pt_regs *ctx)
{
        u64 tid = bpf_get_current_pid_tgid();
        u64 val = ctx->di;
        curr.update(&tid, &val);
        return 0;
}

int probe(struct pt_regs *ctx)
{
        u64 tid = bpf_get_current_pid_tgid();
        u64 *enter_di = curr.lookup(&tid);
        if (enter_di == 0)
                return 0;

        u64 key = bpf_ktime_get_ns();
        struct entry2 e = {};
        bpf_probe_read(&e, sizeof(e), (void *)*enter_di);
        history.update(&key, &e);
        return 0;
}
"""

bpf = BPF(text=source)
bpf.attach_kprobe(event="tracing_generic_entry_update", fn_name="enter")
bpf.attach_kretprobe(event="perf_trace_net_dev_xmit", fn_name="probe")
print(BPF.open_kprobes())

while True:
        sleep(1)
        print("****")
        for k, v in bpf.get_table("history").items():
                print(k.value, v.skbaddr, v.len, v.rc)

@brendangregg
Copy link
Member

The name BPF.attach_tracepoint sounds good, and if it returns error status on whether it matched on not, the program can deal appropriately: print a warning or exit or ignore it.

If this is only matching some tracepoints, then we might want a "tlist" tool that lists traceable tracepoints. Later on, when proper tracepoint support exists, tlist should just list them all.

@goldshtn
Copy link
Collaborator Author

Yeah, now we just need to figure out how to get the entry struct and I think the experience can be pretty good. The tlist tool could print the tracepoint details and format, and then argdist and trace would let you access the parameters from the format fields directly.

I don't think I can or should retrofit this magic into the BPF module though. No existing methods there actually introduce their own BPF program, and the tracepoint work would require it (the program that reads ctx->di with the entry address and the program that reads it in a kretprobe when it's already filled).

@drzaeus77
Copy link
Collaborator

I spent some time digging through the TRACE_* macros in the kernel, and I'm still optimistic that there's a way to extract the types automatically. I'll keep looking.

@goldshtn
Copy link
Collaborator Author

I have everything set up to generate the structure from the debugfs format file. I just need a reliable way of getting a pointer to the structure that's filled by the perf_trace functions. Is that what you are referring to in "extract the types automatically"?

@drzaeus77
Copy link
Collaborator

I was hoping to #include and/or use some of the rewriter functionality to glean the struct definition directly from kernel headers, rather than rely on debugfs, though the end result should be the same. Besides there being a bunch of nested macros, its not magic how the structs are defined, and should be possible.

@goldshtn
Copy link
Collaborator Author

I don't mind getting the structure from debugfs (especially since I already wrote the code). It's getting a pointer to the filled struct that's an issue. It's being filled by the assign part in the macros. Any suggestion for where to put a probe to get the structure pointer would help a lot.

@drzaeus77
Copy link
Collaborator

It looks to me like the call to perf_trace_buf_submit just after perf_trace_##call is probed will have the filled struct entry in arg0. I haven't been following this discussion closely enough to know if that submit will occur or if the way you have the tracepoint set up will skip over the submit bits. I guess I'll have to try your code myself.

@goldshtn
Copy link
Collaborator Author

Yeah, that would be a good place to try. I will do it tomorrow. Unfortunately I don't see a way to figure out the tracepoint id when looking only at this function, so I will still need a probe in the specific perf_trace function followed by a probe in perf_tp_event (which the inlinable perf_trace_buf_submit calls).

@4ast
Copy link
Member

4ast commented Mar 16, 2016

probe in perf_tp_event is not great, since it will be firing for all tracepoints.
specifying 'common_pid == -1' filter will be sort-of ignored from our hack point of view.
Since tracepoint data will be stored in the buffer and kretprobe will be called and can
see the data.
See perf_trace_##cal() in include/trace/perf.h for details on how it works.
Currently I cannot come up with reasonable explanation that perf_trace_xx is called
but tracing_generic_entry_update is not.

@goldshtn
Copy link
Collaborator Author

The weird thing is that it is called when I use perf to enable the tracepoint, even with the filter. Maybe it's worth looking more into what perf is doing. Maybe need to set up the mmap buffer in order to reach our desired code path.

Can you also explain why perf_tp_event would be called for all tracepoints? Do you mean tracepoints not enabled by our code, so we would be probing unnecessarily?

@4ast
Copy link
Member

4ast commented Mar 16, 2016

yes. it's easier to look at perf_trace_##call() in include/trace/perf.h to see what's it's doing
and why kprobe at perf_tp_event() is not going to work. imo the code is self explanatory

@goldshtn
Copy link
Collaborator Author

I looked at the code, indeed. I think I can record a temporary marker into a BPF array in the perf_trace_nnn function, and then check for that marker in perf_trace_buf_submit and clear it. That way we are still probing all tracepoints but not doing the actual reporting work for tracepoints we don't own. How's that sound?

@4ast
Copy link
Member

4ast commented Mar 16, 2016

I don't see how that would work. The whole point of tracepoints is to see the args, so the main code of the program need to operate after { assign; } part.
What I was proposing is to get a pointer from perf_trace_buf_prepare->tracing_generic_entry_update
and process that pointer in kRETprobe of perf_trace_xxx. Does it make more sense now after looking the perf_trace_xxx code?

@goldshtn
Copy link
Collaborator Author

In the version of the function I'm looking at (http://lxr.free-electrons.com/source/include/trace/perf.h#L70), the submit function comes after the assign part. What am I missing?

@4ast
Copy link
Member

4ast commented Mar 16, 2016

yes we're looking at the same stuff. It's called for all tracepoints, so how single bpf program would deal with all variants of struct passed as first arg into perf_trace_buf_submit ?

@goldshtn
Copy link
Collaborator Author

(So the perf_trace_nnn function records a marker that indicates which tracepoint it is, and then when submit is called I can tell if it's my tracepoint and do the main part of the program.)

@goldshtn
Copy link
Collaborator Author

Oh, I would have a single BPF program that calls into child functions for each tracepoint. Each child function would recognize its own struct.

@goldshtn
Copy link
Collaborator Author

if recorded tp id == 1106
process tp struct for 1106
else ...

I realize it would get us close to the BPF program size limit. But the tracing entry update function isn't called for some reason, so I figured it would be worth a shot.

@4ast
Copy link
Member

4ast commented Mar 16, 2016

how that is going to work with multiple scripts doing it at the same time?

@goldshtn
Copy link
Collaborator Author

It is not. But neither would placing a probe in tracing_generic_entry_update. I think pretty much all the tools place probes in functions and can't be run simultaneously.

I am not proposing multiple probes in a single function in a single tool invocation. Argdist would have to generate a single BPF program that handles all the tracepoints the user asks for.

@4ast
Copy link
Member

4ast commented Mar 16, 2016

I suspect that if you see tracing_generic_entry_update is not called then perf_trace_buf_submit won't be called either, so you probably need to debug it regardless.
No other ideas so far...

@goldshtn
Copy link
Collaborator Author

indeed, perf_trace_buf_submit is not called either. I suppose it means I'm hitting the early return condition in the perf_trace_nnn function:

 head = this_cpu_ptr(event_call->perf_events);                   \
if (__builtin_constant_p(!__task) && !__task &&                  \
                         hlist_empty(head))                      \
      return;              

Not sure what this perf_events list should be, but could it be something perf record initializes and just enabling the tracepoint with perf_event_open doesn't initialize?

@goldshtn
Copy link
Collaborator Author

I figured it out after strace-ing perf record .... I was providing pid=0, cpu=-1 in my call to perf_event_open, which means monitor only the current process on any CPU. I suspect this causes the task-related check above to fail early in perf_trace_nnn. Now I'm using pid=-1, cpu=0, which means monitor any process on CPU 0, and I see the perf_trace_buf_submit function is being called, and I can get the entry structure successfully.

What I wonder, though, is whether I now need to call perf_event_open for each CPU. It would seem that way (e.g. from this thread) but if you can confirm for sure, it would be great. It doesn't seem that perf record ... -a does it -- here's the relevant excerpt from strace:

perf_event_open(0x7ffdb4f4f200, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4
close(4)                                = 0
perf_event_open(0x33c5e48, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT
perf_event_open(0x33c5e48, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = -1 ENOENT
perf_event_open(0x33c5e48, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4
close(4)                                = 0
perf_event_open(0x33c5e48, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4
close(4)                                = 0
perf_event_open(0x33b67d8, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4

Sure enough, there are four calls to perf_event_open that succeed and return the same fd (4), but it's only the perf_event_attr parameter that is sometimes different; the pid and cpu parameters are always -1 and 0.

/cc @4ast

@4ast
Copy link
Member

4ast commented Mar 18, 2016

not sure what's wrong with your strace... did you run it under root?
here is what I see:
perf_event_open(0x886f40, -1, 0, -1, 0x8 /* PERF_FLAG_??? /) = 4
perf_event_open(0x886f40, -1, 1, -1, 0x8 /
PERF_FLAG_??? /) = 5
perf_event_open(0x886f40, -1, 2, -1, 0x8 /
PERF_FLAG_??? /) = 6
perf_event_open(0x886f40, -1, 3, -1, 0x8 /
PERF_FLAG_??? /) = 8
perf_event_open(0x886f40, -1, 4, -1, 0x8 /
PERF_FLAG_??? /) = 9
perf_event_open(0x886f40, -1, 5, -1, 0x8 /
PERF_FLAG_??? /) = 10
perf_event_open(0x886f40, -1, 6, -1, 0x8 /
PERF_FLAG_??? /) = 11
perf_event_open(0x886f40, -1, 7, -1, 0x8 /
PERF_FLAG_??? */) = 12
it has to be called for all cpus.

@goldshtn
Copy link
Collaborator Author

Whew, looks like it's done. As I wrote in the PR #439 I haven't added it to the BPF module because it requires adding a probe and I felt it would make the module do some unusual things which aren't its responsibility. @4ast, thanks a lot for your help and patience.

@2opremio 2opremio mentioned this issue Apr 5, 2016
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants