[RFC] prepare for bpf_probe_read() split into user/kernel #614

brendangregg · 2019-05-09T18:47:40Z

As was discussed at LSFMM, bpf_probe_read() may be split in the future into bpf_probe_read_kernel() and bpf_probe_read_user(), to support other architectures (like SPARC) where the kernel/user address space overlaps, and the pointer isn't sufficient to identify the mode. This will be a major change.

We currently read memory addresses in at least these two ways:

*addr: dereference
str(addr): fetch NULL-terminated string

bpftrace will need a way to determine whether to call the _kernel() or _user() version of bpf_probe_read(). Here's what I propose, which breaks the fewest tools:

*addr: dereference kernel
str(addr): fetch NULL-terminated kernel string
ucopy(addr): dereference user
ustr(addr): fetch NULL-terminated user string

Right now, ucopy() and ustr() would be dummy functions that do nothing, and later on can be switched to call bpf_probe_read_user(). Adding these dummy functions now will help us prevent breaking the bpftrace API in the future.

I picked "u"something() since we have other user functions that start with "u". ucopy() is short for copy_from_user(), and ustr() is the user version of str(). I think these names make sense.

If we are in agreement, we should add these dummy functions sooner rather than later for API stabilization.

Note that #28 calls for a uwrite() function, so if we want symmetry then we could call it uread() instead of ucopy().

Edit: this is option A.

The text was updated successfully, but these errors were encountered:

ajor · 2019-05-09T21:55:22Z

If we're in a kprobe/kernel tracepoint will all pointers be to kernel memory, and vice versa for uprobes/usdts?

brendangregg · 2019-05-09T22:06:47Z

@ajor Ah right, this was suggested as one possible solution. I'll check all my tools to see how feasible that is.

Following that approach, we could leave these with their current syntax and the following meanings, with their behavior changing when bpf_probe_read() is split:

*addr: dereference event-context memory (kernel/user based on event)
str(addr): fetch NULL-terminated event-context string (kernel/user based on event)

And add these optional qualified versions (going with the *read() version here):

kread(addr): dereference kernel
kstr(addr): fetch NULL-terminated kernel string
uread(addr): dereference user
ustr(addr): fetch NULL-terminated user string

Edit: this is option B.

danobi · 2019-05-10T19:06:40Z

What about if you kprobed

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count);

And tried to read from buf? If my understanding is correct, buf is a userspace address despite us being in kprobe context.

brendangregg · 2019-05-10T20:21:08Z

Right, such functions are the exception, and now that I think of it, all the syscalls tracepoints that use buffers will have the same problem. Eg:

# bpftrace -lv 't:syscalls:sys_enter_connect'
failed to initialize usdt context for path syscalls
tracepoint:syscalls:sys_enter_connect
    int __syscall_nr;
    int fd;
    struct sockaddr * uservaddr;
    int addrlen;

With options A or B you'd need to use uread() to handle this.

If this happens for all syscall tracepoints, then I wonder if they can default to user-memory accesses. Would complicate documentation: tracepoints are kernel context unless they are syscall tracepoints.

danobi · 2019-05-10T20:36:04Z

I suspect being explicit would be the better long term option for maintainability. Special casing a bunch of things would probably lead to a bunch of tricky gotcha moments. Especially b/c it's hard to tell sometimes if you're reading junk memory or actual data.

How likely is the bpf_probe_read split going to happen?

ajor · 2019-05-12T21:51:27Z

Yeah, I don't think we have enough information to identify which kprobe arguments are pointers to user space and which are to kernel space.

I'm in favour of option B, as I'd like to keep the familiar C-style syntax as the common one. We could have the same four options from the original post:

*addr: dereference kernel
str(addr): fetch NULL-terminated kernel string
ucopy(addr): dereference user
ustr(addr): fetch NULL-terminated user string

for kprobes and kernel tracepoints, but for uprobes and USDTs just use *addr and str(addr) as we currently do, since I don't think they'll ever have kernel pointers. I'd also suggest not allowing ucopy and ustr in user space probes, so we don't introduce multiple ways of doing the same thing.

Also, maybe the name ucopy should be something different, since currently *addr doesn't actually copy the thing being pointed to into BPF memory. That only happens if it then gets assigned to a map. Possibly something like uptr, but not sure.

brendangregg · 2019-05-12T22:50:08Z

I may save a kernel pointer (eg bio) in a per-tid map and then want to fetch it and print members during a uprobe/usdt (eg, MySQL query done). So Id like having addr being probe context sensitive as proposed, but I think we probably need kstr and kread in uprobe/usdt

…

On Sun, May 12, 2019, 2:51 PM Alastair Robertson ***@***.***> wrote: Yeah, I don't think we have enough information to identify which kprobe arguments are pointers to user space and which are to kernel space. I'm in favour of option B, as I'd like to keep the familiar C-style syntax as the common one. We could have the same four options from the original post: - *addr: dereference kernel - str(addr): fetch NULL-terminated kernel string - ucopy(addr): dereference user - ustr(addr): fetch NULL-terminated user string for kprobes and kernel tracepoints, but for uprobes and USDTs just use *addr and str(addr) as we currently do, since I don't think they'll ever have kernel pointers. I'd also suggest not allowing ucopy and ustr in user space probes, so we don't introduce multiple ways of doing the same thing. Also, maybe the name ucopy should be something different, since currently *addr doesn't actually copy the thing being pointed to into BPF memory. That only happens if it then gets assigned to a map. Possibly something like uptr, but not sure. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#614 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAIM3G3JKGJUWZGZCU3TE5DPVCGOBANCNFSM4HL5BXRQ> .

brendangregg · 2019-06-05T05:30:38Z

for reference: https://lore.kernel.org/lkml/CAHk-=wib9VSbwbS+N82ZPNtvt4vrvYyHyQduhFimX8nyjCyZyA@mail.gmail.com/

brendangregg · 2019-06-07T01:44:52Z

Summarizing the current plan so far. It involves adding 4 builtins, and setting probe memory context.

builtins:

uread(addr): dereference user address
ustr(addr): fetch NULL-terminated user string
kread(addr): dereference kernel address
kstr(addr): fetch NULL-terminated kernel string

context (for *addr and str()):

kprobes/kretprobes: kernel
uprobes/uretprobes: user
tracepoints: kernel (with the exception of syscall tracepoints: user)
other probe types: kernel

Context for each probe type would be documented in the reference guide.

ajor · 2019-06-13T17:24:52Z

I think kptr() and uptr() would make more sense than kcopy/kread() and ucopy/uread() if these functions are just to dereference pointers. Dereferencing doesn't actually involve copying any data in bpftrace - that only happens when the dereferenced pointer is assigned to a map.

brendangregg · 2019-06-13T18:22:20Z

Ok. Hmm. Maybe it should be kptr/uptr.

I'm going through tracepoints and tools to see how feasible the plan is. Eg, to test if these pointers are user or kernel:

# bpftrace -lv 't:syscalls:sys_enter_read'
tracepoint:syscalls:sys_enter_read
    int __syscall_nr;
    unsigned int fd;
    char * buf;
    size_t count;
# bpftrace -lv 't:syscalls:sys_enter_openat'
tracepoint:syscalls:sys_enter_openat
    int __syscall_nr;
    int dfd;
    const char * filename;
    int flags;
    umode_t mode;

testing with bpftest.c:

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
	char filename[] = "bpftest.c";
	char buf[256];
	int fd, bytes;

	printf("filename = 0x%llx\n", &filename);
	printf("buf = 0x%llx\n", &buf);

	fd = open(filename, O_RDONLY);
	if (fd < 0) { printf("ERROR open %d\n", fd); return(1); }

	bytes = read(fd, &buf, sizeof(buf));
	close(fd);

	return 0;
}

running and tracing it:

# bpftrace -e 't:syscalls:sys_enter_openat /comm == "bpftest"/ { printf("%llx %s\n", args->filename, str(args->filename)); }' -c ./bpftest
Attaching 1 probe...
7f51deb6f428 /etc/ld.so.cache
7f51ded77dd0 /lib/x86_64-linux-gnu/libc.so.6
filename = 0x7fff21006526
buf = 0x7fff21006530
7fff21006526 bpftest.c
# bpftrace -e 't:syscalls:sys_enter_read /comm == "bpftest"/ { printf("%llx\n", args->buf); }' -c ./bpftest
Attaching 1 probe...
7ffd241c1208
filename = 0x7ffd241c1906
buf = 0x7ffd241c1910
7ffd241c1910

Ok, so args->filename and args->buf are user-space pointers: bpftrace is printing the same address that the user-space program sees.

brendangregg · 2019-06-24T07:19:03Z

What if kptr() and uptr() just set the address mode for the address argument, and no more? ie, did not dereference. We'd then have:

*addr: dereference address from probe address space context
str(): fetch NULL-terminated string from probe address space context
*uptr(addr): dereference user address
str(uptr(addr)): fetch NULL-terminated user string
*kptr(addr): dereference kernel address
str(kptr(addr)): fetch NULL-terminated kernel string

So we'd only be adding kptr() and uptr(). I think their usage will be super rare, so maybe this is sufficient.

ajor · 2019-06-24T08:50:22Z

Yes I really like this idea!

brendangregg · 2019-06-26T03:01:46Z

In the documentation, we can also make this clear:

kprobes/kretprobes:

arg0...argN, retval: when dereferenced, are kernel addresses.

uprobes/uretprobes:

arg0...argN, retval: when dereferenced, are user addresses.

tracepoints:

args->field: when dereferenced, are kernel addresses, except syscall tracepoints which are user.

fbs · 2020-02-11T08:48:02Z

Sounds good to me. I'll give it a go

fbs · 2020-09-01T15:58:10Z

@sumanthkorikkar do you have time to continue work on this? :)

sumanthkorikkar · 2020-09-02T05:12:50Z

ok. sure. I will check and give this a try. Good opportunity to know various use cases. Thanks.

1. Usecases are described on bpftrace#614 2. bpftrace#1427 Implemented usecases: 1. *addr(), str(addr) -> Use bpf_probe_read_kernel or bpf_probe_read_user based on addrspace context. Addrspace context: 1. kprobe, kretprobe, kfunc, kretfunc, tracepoints without syscalls -> bpf_probe_read_kernel 2. usdt, uprobes, uretprobes, usdt, tracepoint with syscalls -> bpf_probe_read_user 2. *uptr(addr), str(uptr(addr)) -> use bpf_probe_read_user 3. *kptr(addr), str(kptr(addr)) -> use bpf_probe_read_kernel 4. argX, retval on kernel context -> bpf_probe_read_kernel 5. argX, retval on user context -> bpf_probe_read_user 6. args->field -> bpf_probe_read_user Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>

1. Usecases are described on bpftrace#614 2. bpftrace#1427 Implemented usecases: 1. *addr(), str(addr) -> Use bpf_probe_read_kernel or bpf_probe_read_user based on addrspace context. Addrspace context: 1. kprobe, kretprobe, kfunc, kretfunc, tracepoints without syscalls -> bpf_probe_read_kernel 2. usdt, uprobes, uretprobes, usdt, tracepoint with syscalls -> bpf_probe_read_user 2. *uptr(addr), str(uptr(addr)) -> use bpf_probe_read_user 3. *kptr(addr), str(kptr(addr)) -> use bpf_probe_read_kernel 4. argX, retval on kernel context -> bpf_probe_read_kernel 5. argX, retval on user context -> bpf_probe_read_user Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>

1. Usecases are described on #614 2. #1427 Implemented usecases: 1. *addr(), str(addr) -> Use bpf_probe_read_kernel or bpf_probe_read_user based on addrspace context. Addrspace context: 1. kprobe, kretprobe, kfunc, kretfunc, tracepoints without syscalls -> bpf_probe_read_kernel 2. usdt, uprobes, uretprobes, usdt, tracepoint with syscalls -> bpf_probe_read_user 2. *uptr(addr), str(uptr(addr)) -> use bpf_probe_read_user 3. *kptr(addr), str(kptr(addr)) -> use bpf_probe_read_kernel 4. argX, retval on kernel context -> bpf_probe_read_kernel 5. argX, retval on user context -> bpf_probe_read_user Signed-off-by: Sumanth Korikkar <sumanthk@linux.ibm.com>

fbs · 2021-05-04T23:52:00Z

do we still have todos for this? I think it works? @sumanthkorikkar

mmarchini mentioned this issue Jun 6, 2019

Reading from the offset/buffer (dtrace's copyin analogue) #650

Closed

brendangregg mentioned this issue Jun 9, 2019

optional parameter crash #735

Closed

brendangregg changed the title ~~[RFC] add copy_from_user() equivalent~~ [RFC] prepare for bpf_probe_read() split into user/kernel Jun 13, 2019

mmarchini added the enhancement New feature or request, changes on existing features label Jul 24, 2019

fbs mentioned this issue Feb 11, 2020

ARM64: several test case failure in runtime-tests.sh #1137

Closed

danobi linked a pull request Aug 18, 2020 that will close this issue

Addrspace basics #1427

Merged

fbs mentioned this issue Aug 19, 2020

Addrspace basics #1427

Merged

fbs added this to the 0.12.0 milestone Aug 19, 2020

fbs closed this as completed in #1427 Aug 28, 2020

fbs reopened this Sep 2, 2020

mmisono mentioned this issue Sep 8, 2020

uprobe char pointer arguments, output abnormal #1503

Closed

sumanthkorikkar mentioned this issue Sep 8, 2020

Set addrspace info for various builtins #1504

Merged

3 tasks

fbs modified the milestones: 0.12.0, 0.13.0 Mar 1, 2021

russjones mentioned this issue Apr 2, 2021

Move from BCC to libbpf with CO-RE gravitational/teleport#6027

Closed

fbs closed this as completed Jun 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] prepare for bpf_probe_read() split into user/kernel #614

[RFC] prepare for bpf_probe_read() split into user/kernel #614

brendangregg commented May 9, 2019 •

edited

ajor commented May 9, 2019

brendangregg commented May 9, 2019 •

edited

danobi commented May 10, 2019 •

edited

brendangregg commented May 10, 2019

danobi commented May 10, 2019

ajor commented May 12, 2019

brendangregg commented May 12, 2019 via email

brendangregg commented Jun 5, 2019

brendangregg commented Jun 7, 2019

ajor commented Jun 13, 2019 •

edited

brendangregg commented Jun 13, 2019

brendangregg commented Jun 24, 2019

ajor commented Jun 24, 2019

brendangregg commented Jun 26, 2019

fbs commented Feb 11, 2020

fbs commented Sep 1, 2020

sumanthkorikkar commented Sep 2, 2020

fbs commented May 4, 2021

[RFC] prepare for bpf_probe_read() split into user/kernel #614

[RFC] prepare for bpf_probe_read() split into user/kernel #614

Comments

brendangregg commented May 9, 2019 • edited

ajor commented May 9, 2019

brendangregg commented May 9, 2019 • edited

danobi commented May 10, 2019 • edited

brendangregg commented May 10, 2019

danobi commented May 10, 2019

ajor commented May 12, 2019

brendangregg commented May 12, 2019 via email

brendangregg commented Jun 5, 2019

brendangregg commented Jun 7, 2019

ajor commented Jun 13, 2019 • edited

brendangregg commented Jun 13, 2019

brendangregg commented Jun 24, 2019

ajor commented Jun 24, 2019

brendangregg commented Jun 26, 2019

fbs commented Feb 11, 2020

fbs commented Sep 1, 2020

sumanthkorikkar commented Sep 2, 2020

fbs commented May 4, 2021

brendangregg commented May 9, 2019 •

edited

brendangregg commented May 9, 2019 •

edited

danobi commented May 10, 2019 •

edited

ajor commented Jun 13, 2019 •

edited