Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

usdt probes requiring semaphore cannot be used on google container OS #2230

Closed
dalehamel opened this issue Feb 25, 2019 · 37 comments
Closed

Comments

@dalehamel
Copy link
Member

This isn't necessarily a bug in bcc per se, more a write-up of what doesn't work and why - hopefully it will at least help others from going down the rabbit hole I did.

There may be a way for bcc to fix this by finding another way to increment the semaphore, but I don't really see how.

In the chromium kernel source code, here is code in fs/proc/base.c that prevents processes from writing to their own memory maps for security reasons:

static ssize_t mem_write(struct file *file, const char __user *buf,
       size_t count, loff_t *ppos)
{
#ifdef CONFIG_SECURITY_CHROMIUMOS_READONLY_PROC_SELF_MEM
  return -EACCES;
#else
  return mem_rw(file, (char __user*)buf, count, ppos, 1);
#endif
}

This variable is enabled by default in the kernel used by container OS, such as in google's GKE offering: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/master/overlay-lakitu/sys-kernel/lakitu-kernel-4_14/files/base.config#3016

This means that anyone trying to use bcc on a chromium derived OS, and especially anyone trying to use bcc on GKE, will probably also hit this if they try to use usdt probes.

During the process of enabling a usdt probe, some probes need to be enabled by writing to a semaphore - this isn't true of all usdt probes, but is probably true of many (I ran into this with ruby's usdt probes). As the dtrace docs indicate, this is a means to avoid expensive processing around the probe, only adding this extra info/processing if the probe is enabled.

The code that handles this in bcc is here:

bcc/src/cc/usdt/usdt.cc

Lines 109 to 113 in c2e2a26

if (::lseek(memfd, address, SEEK_SET) < 0 ||
::write(memfd, &original, 2) != 2) {
::close(memfd);
return false;
}

And it is essentially the same as the approach described here.

However, this leads to probes silently failing to be enabled if run against a kernel with the above hardening. Using strace, it is obvious why it fails:

openat(AT_FDCWD, "/proc/726288/mem", O_RDWR) = 72
lseek(72, 94200854600568, SEEK_SET)     = 94200854600568
read(72, "\0\0", 2)                     = 2
lseek(72, 94200854600568, SEEK_SET)     = 94200854600568
write(72, "\1\0", 2)                    = -1 EACCES (Permission denied)
close(72)                               = 0

Note that this only will happen for probes where readelf --notes indicates a value for the sempahore:

  stapsdt              0x00000059       NT_STAPSDT (SystemTap probe descriptors)
    Provider: ruby
    Name: cmethod__entry
    Location: 0x000000000019999d, Base: 0x00000000002d8ec0, Semaphore: 0x000000000052bb54
    Arguments: 8@32(%rsp) 8@40(%rsp) 8@48(%rsp) -4@56(%rsp)

As there actually is a sempahore indicated here, this USDT probe would be affected. A similar probe in libc would not be affected and can be attached to as the semaphore is not required for the "enable" mechanism:

  stapsdt              0x0000003c       NT_STAPSDT (SystemTap probe descriptors)
    Provider: libc
    Name: memory_heap_free
    Location: 0x000000000019bfd0, Base: 0x00000000001bdd48, Semaphore: 0x0000000000000000
    Arguments: 8@%r11 8@%rax
@dalehamel
Copy link
Member Author

dalehamel commented Feb 25, 2019

It looks like this error isn't being propagated, even though it is pretty clearly not succeeding. We should probably at least determine why that is, and ensure that it doesn't fail silently (as is now the case).

@yonghong-song
Copy link
Collaborator

@dalehamel I agree with you that the error should be propagated back to user as failing to update semaphore essentially prevents usdt from working. Do you want to work with a pull request to fix the issue?

BTW, Song Liu implemented a mechanism for kernel to update the semaphore so all these user space semaphore writing logic won't be needed. It is available in 4.20.

commit a6ca88b241d5e929e6e60b12ad8cd288f0ffa256
Author: Song Liu <songliubraving@fb.com>
Date:   Mon Oct 1 22:36:36 2018 -0700

    trace_uprobe: support reference counter in fd-based uprobe
    
    This patch enables uprobes with reference counter in fd-based uprobe.
    Highest 32 bits of perf_event_attr.config is used to stored offset
    of the reference count (semaphore).
    
    Format information in /sys/bus/event_source/devices/uprobe/format/ is
    updated to reflect this new feature.
    
    Link: http://lkml.kernel.org/r/20181002053636.1896903-1-songliubraving@fb.com
    
    Cc: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Reviewed-and-tested-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
    Signed-off-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

I think this mechanism should work in your environment as it won't go through mem_write. We just need to implement this in bcc.

@dalehamel
Copy link
Member Author

@yonghong-song thanks for pointing me to that commit, that is very encouraging.

I'll see if I can prove the concept by doing a custom build with this patch, and if I can I might try my hand at adding this mechanism (gated by kernel version) into BCC.

If i'm in that code already, I may as well also look into improving the error propagation.

@yonghong-song
Copy link
Collaborator

That sounds great. Thanks!

@palmtenor
Copy link
Member

@dalehamel Thank you for providing a good use case. We had the Kernel side patch landed a bit ago but we never get the chance to support it in BCC. I will have some time in the next few weeks to add the support. Meanwhile I also agree it would be good to communicate out the write error as we will have that fallback logic anyways.

@dalehamel
Copy link
Member Author

@palmtenor thanks for weighing in!

I read through the patch, and it's not obvious to me what mechanism is required to trigger the behavior of incrementing the semaphore from reading the diff as it seems to just be storing a mask, could you briefly explain what behavior is needed in bcc to trigger this if you have the time? I'm not able to see the full picture right now.

@palmtenor
Copy link
Member

USDT uses uprobe in the underlying logic, where we ask Kernel to attach to a (binary, address) pair, and Kernel actually does the attaching on (inode, address) pair. The patch (that one and a few that one was based on) adds the ability to provide an additional semaphore address to uprobe attaching that Kernel will increment / decrement that value upon the uprobe's lifecycle.

We have working prototype using that feature but for BCC some small refactor would be needed across uprobe API as well as the USDT library to accommodate for this, as well as make it backward compatible. I'm figuring it out:)

@dalehamel
Copy link
Member Author

Thank you for explaining @palmtenor I understand the approach now and have a rough idea of the scope of the work. This is definitely an exciting way forward, as I thought we had basically hit a brick wall in our environment under the current implementation.

When you have the time to create the BCC patch i would love to review it, mostly to improve my understanding.

@dalehamel
Copy link
Member Author

dalehamel commented Nov 22, 2019

@palmtenor have you had any chance to look at this? I just ran into this issue again recently, and it looks like the code is the same as it was when this issue was written:

bcc/src/cc/usdt/usdt.cc

Lines 97 to 120 in 5ce16e4

std::string procmem = tfm::format("/proc/%d/mem", pid_.value());
int memfd = ::open(procmem.c_str(), O_RDWR);
if (memfd < 0)
return false;
int16_t original;
if (::lseek(memfd, address, SEEK_SET) < 0 ||
::read(memfd, &original, 2) != 2) {
::close(memfd);
return false;
}
original = original + val;
if (::lseek(memfd, address, SEEK_SET) < 0 ||
::write(memfd, &original, 2) != 2) {
::close(memfd);
return false;
}
::close(memfd);
return true;
}

@dalehamel
Copy link
Member Author

I did some digging, it looks like this function would need to be modified:

bcc/src/cc/libbpf.c

Lines 980 to 986 in 992e482

res = snprintf(buf, PATH_MAX, "%c:%ss/%s %s:0x%lx", attach_type==BPF_PROBE_ENTRY ? 'p' : 'r',
event_type, ev_alias, config1, (unsigned long)offset);
if (res < 0 || res >= PATH_MAX) {
fprintf(stderr, "Event alias (%s) too long for buffer\n", ev_alias);
close(kfd);
return -1;
}

As this is what is ultimately (i think) creating the uprobe from the call to bpf_attach_uprobe. If semaphore address is passed along (I guess checking if kernel version >= 4.20), then I suppose this should be possible to enable the uprobe with this offset to let it enable the semaphore.

I suppose there would also need to be some conditional logic in the USDT code, for it rely on the kernel to set the semaphore values if the feature is supported.

In my production environment, I am being held-up in rolling out USDT tools without some workaround for this functionality. If you have any draft PR @palmtenor i can try picking it up, otherwise I can poke around and see if I can make progress on this issue when I get a chance to.

@yonghong-song
Copy link
Collaborator

@dalehamel The right place is at function bpf_try_perf_event_open_with_probe.
You do not need to check kernel version, you need check the following on the host

[$ /sys/bus/event_source/devices/uprobe/format]# cat ref_ctr_offset
config:32-63
[$ /sys/bus/event_source/devices/uprobe/format]#

similar to other FD based uprobe/kprobe checking.

Basically, the offset for the semaphore will occupies the top 32bit of perf_event_attr.config.
Not sure this setting is available on your environment or not. If it is, you can enhance bcc to handle such cases.

@dalehamel
Copy link
Member Author

Thanks @yonghong-song, my development host is kernel 5.3 and appears to have this field:

$ cat /sys/bus/event_source/devices/uprobe/format/ref_ctr_offset
config:32-63

I will look into devising a-proof-of concept patch to add support for this method of semaphore incrementation to bcc. Our production kernel is still 4.14, but once we are able to upgrade to 4.20 or later, I'd like to be already have the functionality landed in bcc :)

@dalehamel
Copy link
Member Author

I started working on a patch today, i think i have this figured out, just need to test it out.

@dalehamel
Copy link
Member Author

dalehamel commented Feb 13, 2020

@yonghong-song @palmtenor I have been having a hard time getting my branch to work. It would seem that passing the semaphore address as-is doesn't do the trick.

In userspace, we resolve the 32 bit offset to a 64 bit global address, where we then seek in memory to increment the semaphore.

I believe that I am correctly shifting the bits:

REF CTR OFFSET 34eb60
DEBUG before:           34eb60 after:   34eb6000000000

Where 34eb60 matches what I see for the semaphore address from readelf --notes.

Looking at the kernel source, it appears to use find_ref_ctr_vma to iterate through the modules in the memory map:

https://github.com/torvalds/linux/blob/1cc33161a83d20b5462b1e93f95d3ce6388079ee/kernel/events/uprobes.c#L365

So I would expect that I am passing it in the correct format. However, the semaphore is never incremented, and the probe is never enabled.

Any ideas of what I might be doing wrong in my branch on #2738 ?

@dalehamel
Copy link
Member Author

I've verified that the offset is being passed with this kprobe:

$ bpftrace -e 'kprobe:uprobe_register_refctr { printf("%4x\n", arg2);}'
Attaching 1 probe...
34eb60

@dalehamel
Copy link
Member Author

dalehamel commented Feb 13, 2020

From further kernel probing, it appears that while update_ref_ctr https://github.com/torvalds/linux/blob/1cc33161a83d20b5462b1e93f95d3ce6388079ee/kernel/events/uprobes.c#L426 is being called

__update_ref_ctr isn't (https://github.com/torvalds/linux/blob/1cc33161a83d20b5462b1e93f95d3ce6388079ee/kernel/events/uprobes.c#L377)

This makes me think the issue is perhaps that this code: https://github.com/torvalds/linux/blob/1cc33161a83d20b5462b1e93f95d3ce6388079ee/kernel/events/uprobes.c#L433-L443 isn't finding the virtual memory address, which again leads me to believe that I might be passing in the offset in a wrong or unexpected format

EDIT: confirmed, I probed valid_ref_ctr_vma, and it returns only false for all memory maps.

bpftrace -e 'kretprobe:valid_ref_ctr_vma* { printf("%d\n",  retval);}'

@dalehamel
Copy link
Member Author

It definitely seems like this should be the right address for the probe I'm trying to do (method__entry probe for ruby):

objdump -dt /bin/bin/ruby | grep 34eb60
000000000034eb60 l    d  .probes        0000000000000000              .probes
000000000034eb60 l     O .probes        0000000000000002              ruby_method__entry_semaphore
000000000034eb60 l     O .probes        0000000000000000              __TMC_END__
   27d10:       48 8d 3d 49 6e 32 00    lea    0x326e49(%rip),%rdi        # 34eb60 <ruby_method__entry_semaphore>
   27d17:       48 8d 05 42 6e 32 00    lea    0x326e42(%rip),%rax        # 34eb60 <ruby_method__entry_semaphore>
   27d40:       48 8d 3d 19 6e 32 00    lea    0x326e19(%rip),%rdi        # 34eb60 <ruby_method__entry_semaphore>
   27d47:       48 8d 35 12 6e 32 00    lea    0x326e12(%rip),%rsi        # 34eb60 <ruby_method__entry_semaphore>
  1a95a4:       66 83 3d b4 55 1a 00    cmpw   $0x0,0x1a55b4(%rip)        # 34eb60 <ruby_method__entry_semaphore>
  1ac3d1:       66 83 3d 87 27 1a 00    cmpw   $0x0,0x1a2787(%rip)        # 34eb60 <ruby_method__entry_semaphore>

@dalehamel
Copy link
Member Author

dalehamel commented Feb 14, 2020

I built a kernel with some debug statements, and found that the issue is the offset is in fact wrong. For my ruby example, the result ends up past the last map, which is clearly incorrect. Here are the maps for the process:

55d1f1364000-55d1f1388000 r--p 00000000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
55d1f1388000-55d1f15c4000 r-xp 00024000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
55d1f15c4000-55d1f16ac000 r--p 00260000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
55d1f16ad000-55d1f16b2000 r--p 00348000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
55d1f16b2000-55d1f16b3000 rw-p 0034d000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
55d1f16b3000-55d1f16c4000 rw-p 00000000 00:00 0

...

However, the global address for the semaphore is definitely at 0x5593A6630B60, which is past the last map with the inode for ruby! In this case, it seems that the semaphore ends up in an anonymous map, directly after that last map for ruby.

I have verified this address, when I attach a USDT probe with the traditional means it becomes "1", and when i detach it goes back to "0". So BCC is definitely getting the right address, but the kernel is ignoring this map because the inode doesn't match what it expects.

I wonder if perhaps this constitutes a bug in the kernel code? I also tried a simpler USDT test program (the bpftrace test program for usdt probe with semaphore), and it cannot match any map for that either.

@liu-song-6 as you authored the original patch, do you have a test case program that it works to attach in this way?

@dalehamel
Copy link
Member Author

dalehamel commented Feb 14, 2020

I believe I have figured it out, it amounts to a bug in the linux kernel, and
a difference in the behavior between Linux and BCC.

Given the following memory map:

5560f0b86000-5560f0baa000 r--p 00000000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
5560f0baa000-5560f0de6000 r-xp 00024000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
5560f0de6000-5560f0ece000 r--p 00260000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
5560f0ecf000-5560f0ed4000 r--p 00348000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
5560f0ed4000-5560f0ed5000 rw-p 0034d000 08:01 1244227                    /home/dale.hamel/.rubies/ruby-2.6.5/bin/ruby
5560f0ed5000-5560f0ee6000 rw-p 00000000 00:00 0
5560f2b26000-5560f2f8d000 rw-p 00000000 00:00 0                          [heap]
7fc6ba217000-7fc6bc321000 rw-p 00000000 00:00 0
7fc6bc321000-7fc6c9437000 r--p 00000000 08:01 1572875                    /usr/lib64/locale/locale-archive
7fc6c9437000-7fc6c943a000 rw-p 00000000 00:00 0

BCC happens to always find the first vm area, because it matches only by inode.

The value that it calculates:

0x5560f0b86000 + 0x34eb60 - 0x00000000 = 0x5560f0ed4b60

If we continue down, to the other vm areas, we get the same result:

0x5560f0baa000 + 0x34eb60 - 0x00024000 = 0x5560f0ed4b60

0x5560f0de6000 + 0x34eb60 - 0x00260000 = 0x5560f0ed4b60

But when we hit 0x5560f0ecf000, we get a different result!

0x5560f0ecf000 + 0x34eb60 - 0x00348000 = 0x5560f0ed5b60

Interestingly, it is off by exactly one page (0x1000, page size 4096).

Taking a closer look, we see it is because there is a gap. The previous
vm area stops at 0x5560f0ece000, but this map starts at 0x5560f0ecf000

When we get to the map that should actually contain the ref_ctr_offset, we have
this same problem and it appears to be in the next vm_area because of this:

0x0x5560f0ed4000 + 0x34eb60 - 0x0034d000 = 5560f0ed5b60

This is because that same gap of one page is offsetting it. This is why no
match can be found, and the ref_ctr_offset is never incremented.

Because this is the first memory area that has the WRITE permission and it is
at the very end, it is the most likely to be affected by a gap and be missed.

There are two possible solutions that I can think of:

  • Do what BCC does, and just take the first vm_area that matches the inode. This
    will have to chance for an offset issue.
  • Walk through the vm_areas and accumulate a "page gap" counter for the vm_areas
    leading up to this one, passing this through when calculating the offset.

I have been able to prove this concept with the first approach, though there might be some benefit to the "page gap" counting approach, as this does additional checks. However, these checks are beyond what BCC does now, and BCC seems to work fairly reliably - are these checks actually necessary?

The following patch allows me to attach to a ruby process:

From a3828da634a37bbea023953ba48b10af55086cdf Mon Sep 17 00:00:00 2001
From: Dale Hamel <dale.hamel@srvthe.net>
Date: Fri, 14 Feb 2020 00:43:39 -0500
Subject: [PATCH] Match ref_ctr_offset by inode only

---
 kernel/events/uprobes.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84fa00497c49..51fdd5e4103f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -344,9 +344,7 @@ static bool valid_ref_ctr_vma(struct uprobe *uprobe,
        return uprobe->ref_ctr_offset &&
                vma->vm_file &&
                file_inode(vma->vm_file) == uprobe->inode &&
-               (vma->vm_flags & (VM_WRITE|VM_SHARED)) == VM_WRITE &&
-               vma->vm_start <= vaddr &&
-               vma->vm_end > vaddr;
+               vma->vm_pgoff == 0;
 }

 static struct vm_area_struct *
--
2.21.0

However, it doesn't work for my USDT test program https://github.com/iovisor/bpftrace/blob/master/tests/testprogs/usdt_semaphore_test.c which has these maps:

00400000-00401000 r-xp 00000000 08:03 531342                             /u/workspace/shopify/bpftrace/build-static/tests/testprogs/usdt_semaphore_test
00600000-00601000 r--p 00000000 08:03 531342                             /u/workspace/shopify/bpftrace/build-static/tests/testprogs/usdt_semaphore_test
00601000-00602000 rw-p 00001000 08:03 531342                             /u/workspace/shopify/bpftrace/build-static/tests/testprogs/usdt_semaphore_test

In this case, the ref_ctr_offset is 0x601040, but the address that the kernel calculates to attach to is again outside of the address space at 0xA01040. Interestingly, this is exactly the start of the first vm_area - if we subtract 0x00400000, we get the correct 0x601040. In this case, the ref_ctr_offset is actually correct to begin with, as I can verify with:

dd if=/proc/$(pidof usdt_semaphore_test)/mem bs=1 count=1 skip=$(( 0x601040)) |xxd 

This is verified with a logging message:

[ 2201.782079] Using 0000000000A01040 for global ref_ctr_offset, start: 0000000000400000 pgoff: 00000000 pgoff_shift 00000000
[ 2201.782081] ref_ctr increment failed for inode: 0x81b8e offset: 0x7c9 ref_ctr_offset: 0x601040 of mm: 0x00000000c7da88ce

(first debug line is mine, from this code:)

                rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset);
                printk(KERN_DEBUG "Using %016llX for global ref_ctr_offset, "
                                  "start: %016llX pgoff: %08X pgoff_shift %08X\n",
                                  rc_vaddr, rc_vma->vm_start, rc_vma->vm_pgoff,
                                  (rc_vma->vm_pgoff << PAGE_SHIFT));
                ret = __update_ref_ctr(mm, rc_vaddr, d);
                if (ret)
                        update_ref_ctr_warn(uprobe, mm, d);

It looks like it doesn't need any offset. Examining the BCC code gives us a clue:

bcc/src/cc/usdt/usdt.cc

Lines 74 to 83 in 0d0d353

bool Probe::resolve_global_address(uint64_t *global, const std::string &bin_path,
const uint64_t addr) {
if (in_shared_object(bin_path)) {
return (pid_ &&
!bcc_resolve_global_addr(*pid_, bin_path.c_str(), addr, mod_match_inode_only_, global));
}
*global = addr;
return true;
}

Sure enough, the test executable is ET_EXEC, but the Ruby executable is ET_DYN.

In any case, both of these test programs seem to fail to use this kernel API for different reasons. The ruby program because there is a gap of 1 page that makes the calculations incorrect, and the usdt_semaphore_test because it shouldn't have offsets at all as it is not ET_DYN.

The above patch is a simple way to mirror BCCs behavior for the ET_DYN cases, but for ET_EXEC we will need to add handling to not calculate a global offset, taking the ref_ctr_offset passed and using it directly.

@liu-song-6 @yonghong-song does this make sense to you? What fix do you think is best here?

Thanks! Sorry for the wall of text here.

@dalehamel
Copy link
Member Author

For the ET_EXEC issue, here is a quick hack that seems to do the trick:

From 8270d53ab8bf52a79a7b0821ca3a715b0772e505 Mon Sep 17 00:00:00 2001
From: Dale Hamel <dale.hamel@srvthe.net>
Date: Fri, 14 Feb 2020 09:41:16 -0500
Subject: [PATCH] Handle ET_EXEC

---
 kernel/events/uprobes.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4d70f8fe3a43..6a2a2b1f34c6 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -51,6 +51,9 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
 /* Have a copy of original instruction */
 #define UPROBE_COPY_INSN       0

+// To check if a program is ET_EXEC, NOTE this is arch dependant
+#define ET_EXEC_MASK 0x00000000FFFFFFFF
+
 struct uprobe {
        struct rb_node          rb_node;        /* node in the rb tree */
        refcount_t              ref;
@@ -420,6 +423,9 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,

        if (rc_vma) {
                rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset);
+               if((rc_vma->vm_start & ET_EXEC_MASK) == rc_vma->vm_start)
+                       rc_vaddr -= rc_vma->vm_start;
+
                printk(KERN_DEBUG "Using %016llX for global ref_ctr_offset, "
                                  "start: %016llX pgoff: %08X pgoff_shift %08X\n",
                                  rc_vaddr, rc_vma->vm_start, rc_vma->vm_pgoff,
--
2.21.0

It's not architecture independent, and it looks like there is a function that could be used to determine if the uprobe->file is ET_DYN or ET_EXEC, but it is static (elf_fdpic_fetch_phdrs).

With these two patches I am able to probe both files. Curiously, it looks like the ref counter is actually added to twice for one probe attachment - I'll need to look into that as well.

@liu-song-6
Copy link
Contributor

Is this because of who the USDT is defined? I found the old test program we were using, which doesn't have "attribute ((visibility ("hidden")))". I haven't verified it with latest code.

#define _SDT_HAS_SEMAPHORES 1
extension unsigned short test_user_semaphore attribute
((unused)) attribute ((section (".probes")));
#define TEST test_user_semaphore

#include <stdio.h>
#include <unistd.h>
#include <sys/sdt.h>

int for_uprobe(int c)
{
if (TEST)
STAP_PROBE(test, user);
printf("%d\n", c + 10);
return c + 1;
}

int main(int argc, char *argv[])
{
for_uprobe(argc);
while (1) {
sleep(1);
printf("semaphore %d\n", test_user_semaphore);
}
}

@dalehamel
Copy link
Member Author

Hi @liu-song-6 thank you for your reply. I have run your test program and can confirm it suffers from the same bug, though in this case it seems to be that the file offset is the culprit?

With my patch above it is able to correctly determine the offset, if it is based on the address of the first vm area and the program works. Without my patch on a vanilla 5.3 kernel, I have the same problem where no vma is determined to be valid.

Here is what the memory map looks like:

55ecaac26000-55ecaac27000 r--p 00000000 08:03 137791                     /u/workspace/shopify/bcc/a.out
55ecaac27000-55ecaac28000 r-xp 00001000 08:03 137791                     /u/workspace/shopify/bcc/a.out
55ecaac28000-55ecaac29000 r--p 00002000 08:03 137791                     /u/workspace/shopify/bcc/a.out
55ecaac29000-55ecaac2a000 r--p 00002000 08:03 137791                     /u/workspace/shopify/bcc/a.out
55ecaac2a000-55ecaac2b000 rw-p 00003000 08:03 137791                     /u/workspace/shopify/bcc/a.out

BCC uses the address 0x000055ECAAC2A040 which works. I can get this address with my kernel patch above, which matches on the first vm area, and gets this via the following math:

0x55ecaac26000 + 0x4040 - 0x00000000 = 0x55ecaac2a040

This is also true for :

0x55ecaac27000 + 0x4040 - 0x00001000 = 0x55ecaac2a040
0x55ecaac28000 + 0x4040 - 0x00002000 = 0x55ecaac2a040

However when we get to the next vm area:

0x55ecaac29000 + 0x4040 - 0x00002000 = 0x55ecaac2b040

It is off by a page. The error carries forward to the final memory area (the one the current code matches on because it has the WRITE flag set):

0x55ecaac2a000 + 0x4040 - 0x00003000 = 0x55ecaac2b040

Here is the exact C code used to generate a.out:

#define _SDT_HAS_SEMAPHORES 1
__extension__ unsigned short test_user_semaphore __attribute__ ((unused)) __attribute__ ((section (".probes")));
#define TEST test_user_semaphore

#include <stdio.h>
#include <unistd.h>
#include <sys/sdt.h>

int for_uprobe(int c)
{
if (TEST)
STAP_PROBE(test, user);
printf("%d\n", c + 10);
return c + 1;
}

int main(int argc, char *argv[])
{
for_uprobe(argc);
while (1) {
sleep(1);
printf("semaphore %d\n", test_user_semaphore);
}
}

These test programs are being built with GCC 8.3, though I'm not sure that it matters.

None of these three test cases (ruby - ET_DYN, usdt_semaphore_test - ET_EXEC, and your test program - ET_DYN) work with this API, and using BCC as a reference implementation it is fairly easy to see why.

For the ET_DYN executables, BCC always uses the first vm area and this seems to never suffer from the off-by-one page error.

For the ET_EXEC executables, BCC doesn't bother to calculate any offset, and takes the ref_ctr_offset from the stap notes directly, which the kernel code doesn't do.

@liu-song-6
Copy link
Contributor

I am testing the same program with uprobe_events interface:

# readelf -s a.out  | grep -e test_user -e for_uprobe
    50: 000000000040055d    55 FUNC    GLOBAL DEFAULT   14 for_uprobe
    68: 0000000000601034     2 OBJECT  GLOBAL DEFAULT   27 test_user_semaphore
# ./a.out

(in a different window)

# cd /sys/kernel/debug/tracing
# echo 'p /root/uprobe/a.out:0x55d(0x1034) ' > uprobe_events
# echo 1 > events/uprobes/p_a_0x55d/enable

And it works.

I think the key is, the offset we use for uprobe_events are 0x55d and 0x1034 instead of 0x40055d and 0x601034.

I haven't got chance to try it with bcc and/or use perf_event_open() to create uprobe.

@dalehamel
Copy link
Member Author

Interesting, how are you able to make this translation? Why are you able to just discard 400000 and 600000 respectively? How could I generalize this?

Interestingly, when I build with GCC (8.3) I get:

    47: 0000000000001155    57 FUNC    GLOBAL DEFAULT   13 for_uprobe
    68: 0000000000004040     2 OBJECT  GLOBAL DEFAULT   25 test_user_semaphore

But with clang (8) I get results more similar to yours

    45: 0000000000401140    79 FUNC    GLOBAL DEFAULT   12 for_uprobe
    64: 0000000000404040     2 OBJECT  GLOBAL DEFAULT   24 test_user_semaphore

When I read the elf notes it still gives me the full address. GCC produces a ET_DYN executable, and clang produces an ET_EXEC executable, which is also interesting.

What is really interesting is that for the executable built with clang, I am able to probe it without any modifications to the kernel!

So it would seem that the bug maybe is only specific to GCC-built ELF executables? Can you try with GCC @liu-song-6 ? Also can you confirm that you are building with Clang in your test? I will try building ruby and my other USDT test program with Clang to see if they also work.

I would still consider this a bug if the code is compiler-dependent though.

I haven't got chance to try it with bcc and/or use perf_event_open() to create uprobe.

I am achieving this with my branch #2738 and a patch to bpftrace to use this in src/attached_probe.cpp.

@liu-song-6
Copy link
Contributor

readelf -S can be used to make the address to offset translation:

# readelf -S a.out | grep -e .text -e .probe -A 1 -e Name
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
--
  [14] .text             PROGBITS         0000000000400470  00000470
       00000000000001e2  0000000000000000  AX       0     0     16
--
  [27] .probes           PROGBITS         0000000000601034  00001034
       0000000000000002  0000000000000000  WA       0     0     2

I am using a very old gcc (4.8.5). I didn't mean to use it, it is just the default one on the test machine.

Could you please check the readelf -S output for gcc 8.3 and clang?

@dalehamel
Copy link
Member Author

What is really interesting is that for the executable built with clang, I am able to probe it without any modifications to the kernel!

My mistake sorry, 🤦‍♂️ I was using a different build of bpftrace that was using BCC to seek /proc/[PID]/mem rather than the kernel API, I tested again using the kernel API for sure and neither the GCC or Clang executable works after all - I got ahead of myself.

Could you please check the readelf -S output for gcc 8.3 and clang?

Sure:

Clang 8 (ET_EXEC type), invoked as just clang test.c:

$ readelf -S a.out
There are 31 section headers, starting at offset 0x3a48:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         00000000004002a8  000002a8
       000000000000001c  0000000000000000   A       0     0     1
  [ 2] .note.ABI-tag     NOTE             00000000004002c4  000002c4
       0000000000000020  0000000000000000   A       0     0     4
  [ 3] .gnu.hash         GNU_HASH         00000000004002e8  000002e8
       000000000000001c  0000000000000000   A       4     0     8
  [ 4] .dynsym           DYNSYM           0000000000400308  00000308
       0000000000000090  0000000000000018   A       5     1     8
  [ 5] .dynstr           STRTAB           0000000000400398  00000398
       0000000000000060  0000000000000000   A       0     0     1
  [ 6] .gnu.version      VERSYM           00000000004003f8  000003f8
       000000000000000c  0000000000000002   A       4     0     2
  [ 7] .gnu.version_r    VERNEED          0000000000400408  00000408
       0000000000000030  0000000000000000   A       5     1     8
  [ 8] .rela.dyn         RELA             0000000000400438  00000438
       0000000000000030  0000000000000018   A       4     0     8
  [ 9] .rela.plt         RELA             0000000000400468  00000468
       0000000000000048  0000000000000018  AI       4    22     8
  [10] .init             PROGBITS         0000000000401000  00001000
       0000000000000017  0000000000000000  AX       0     0     4
  [11] .plt              PROGBITS         0000000000401020  00001020
       0000000000000040  0000000000000010  AX       0     0     16
  [12] .text             PROGBITS         0000000000401060  00001060
       000000000000023e  0000000000000000  AX       0     0     16
  [13] .fini             PROGBITS         00000000004012a0  000012a0
       0000000000000009  0000000000000000  AX       0     0     4
  [14] .rodata           PROGBITS         0000000000402000  00002000
       0000000000000012  0000000000000000   A       0     0     4
  [15] .stapsdt.base     PROGBITS         0000000000402012  00002012
       0000000000000001  0000000000000000   A       0     0     1
  [16] .eh_frame_hdr     PROGBITS         0000000000402014  00002014
       000000000000003c  0000000000000000   A       0     0     4
  [17] .eh_frame         PROGBITS         0000000000402050  00002050
       000000000000011c  0000000000000000   A       0     0     8
  [18] .init_array       INIT_ARRAY       0000000000403e10  00002e10
       0000000000000008  0000000000000008  WA       0     0     8
  [19] .fini_array       FINI_ARRAY       0000000000403e18  00002e18
       0000000000000008  0000000000000008  WA       0     0     8
  [20] .dynamic          DYNAMIC          0000000000403e20  00002e20
       00000000000001d0  0000000000000010  WA       5     0     8
  [21] .got              PROGBITS         0000000000403ff0  00002ff0
       0000000000000010  0000000000000008  WA       0     0     8
  [22] .got.plt          PROGBITS         0000000000404000  00003000
       0000000000000030  0000000000000008  WA       0     0     8
  [23] .data             PROGBITS         0000000000404030  00003030
       0000000000000010  0000000000000000  WA       0     0     8
  [24] .probes           PROGBITS         0000000000404040  00003040
       0000000000000002  0000000000000000  WA       0     0     2
  [25] .bss              NOBITS           0000000000404042  00003042
       0000000000000006  0000000000000000  WA       0     0     1
  [26] .comment          PROGBITS         0000000000000000  00003042
       000000000000004f  0000000000000001  MS       0     0     1
  [27] .note.stapsdt     NOTE             0000000000000000  00003094
       0000000000000038  0000000000000000           0     0     4
  [28] .symtab           SYMTAB           0000000000000000  000030d0
       0000000000000648  0000000000000018          29    45     8
  [29] .strtab           STRTAB           0000000000000000  00003718
       0000000000000217  0000000000000000           0     0     1
  [30] .shstrtab         STRTAB           0000000000000000  0000392f
       0000000000000114  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

GCC 8.3 (invoked as just gcc test.c):

$ readelf -S a.out
There are 32 section headers, starting at offset 0x3ad0:

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .interp           PROGBITS         00000000000002a8  000002a8
       000000000000001c  0000000000000000   A       0     0     1
  [ 2] .note.ABI-tag     NOTE             00000000000002c4  000002c4
       0000000000000020  0000000000000000   A       0     0     4
  [ 3] .gnu.hash         GNU_HASH         00000000000002e8  000002e8
       0000000000000024  0000000000000000   A       4     0     8
  [ 4] .dynsym           DYNSYM           0000000000000310  00000310
       00000000000000d8  0000000000000018   A       5     1     8
  [ 5] .dynstr           STRTAB           00000000000003e8  000003e8
       00000000000000a5  0000000000000000   A       0     0     1
  [ 6] .gnu.version      VERSYM           000000000000048e  0000048e
       0000000000000012  0000000000000002   A       4     0     2
  [ 7] .gnu.version_r    VERNEED          00000000000004a0  000004a0
       0000000000000030  0000000000000000   A       5     1     8
  [ 8] .rela.dyn         RELA             00000000000004d0  000004d0
       00000000000000c0  0000000000000018   A       4     0     8
  [ 9] .rela.plt         RELA             0000000000000590  00000590
       0000000000000048  0000000000000018  AI       4    23     8
  [10] .init             PROGBITS         0000000000001000  00001000
       0000000000000017  0000000000000000  AX       0     0     4
  [11] .plt              PROGBITS         0000000000001020  00001020
       0000000000000040  0000000000000010  AX       0     0     16
  [12] .plt.got          PROGBITS         0000000000001060  00001060
       0000000000000008  0000000000000008  AX       0     0     8
  [13] .text             PROGBITS         0000000000001070  00001070
       000000000000021e  0000000000000000  AX       0     0     16
  [14] .fini             PROGBITS         0000000000001290  00001290
       0000000000000009  0000000000000000  AX       0     0     4
  [15] .rodata           PROGBITS         0000000000002000  00002000
       0000000000000016  0000000000000000   A       0     0     4
  [16] .stapsdt.base     PROGBITS         0000000000002016  00002016
       0000000000000001  0000000000000000   A       0     0     1
  [17] .eh_frame_hdr     PROGBITS         0000000000002018  00002018
       0000000000000044  0000000000000000   A       0     0     4
  [18] .eh_frame         PROGBITS         0000000000002060  00002060
       0000000000000134  0000000000000000   A       0     0     8
  [19] .init_array       INIT_ARRAY       0000000000003de8  00002de8
       0000000000000008  0000000000000008  WA       0     0     8
  [20] .fini_array       FINI_ARRAY       0000000000003df0  00002df0
       0000000000000008  0000000000000008  WA       0     0     8
  [21] .dynamic          DYNAMIC          0000000000003df8  00002df8
       00000000000001e0  0000000000000010  WA       5     0     8
  [22] .got              PROGBITS         0000000000003fd8  00002fd8
       0000000000000028  0000000000000008  WA       0     0     8
  [23] .got.plt          PROGBITS         0000000000004000  00003000
       0000000000000030  0000000000000008  WA       0     0     8
  [24] .data             PROGBITS         0000000000004030  00003030
       0000000000000010  0000000000000000  WA       0     0     8
  [25] .probes           PROGBITS         0000000000004040  00003040
       0000000000000002  0000000000000000  WA       0     0     2
  [26] .bss              NOBITS           0000000000004042  00003042
       0000000000000006  0000000000000000  WA       0     0     1
  [27] .comment          PROGBITS         0000000000000000  00003042
       0000000000000022  0000000000000001  MS       0     0     1
  [28] .note.stapsdt     NOTE             0000000000000000  00003064
       0000000000000038  0000000000000000           0     0     4
  [29] .symtab           SYMTAB           0000000000000000  000030a0
       00000000000006a8  0000000000000018          30    47     8
  [30] .strtab           STRTAB           0000000000000000  00003748
       0000000000000269  0000000000000000           0     0     1
  [31] .shstrtab         STRTAB           0000000000000000  000039b1
       0000000000000118  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

@liu-song-6
Copy link
Contributor

So for these two binary, the offset of the semaphore is 0x3040. I think current kernel should handle that well. Not sure about bcc and bpftrace.

@dalehamel
Copy link
Member Author

So for these two binary, the offset of the semaphore is 0x3040

Those are not the addresses that I see in the systemtap notes:

GCC version (ET_EXEC) it is 0x0000000000004040, which what I also read with bcc which uses libelf

$ readelf -n a.out
Displaying notes found in: .note.stapsdt
  Owner                 Data size       Description
  stapsdt              0x00000023       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: user
    Location: 0x000000000000116c, Base: 0x0000000000002016, Semaphore: 0x0000000000004040
    Arguments:

Clang version (ET_DYN) it is:

Displaying notes found in: .note.ABI-tag
  Owner                 Data size       Description
  GNU                  0x00000010       NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 3.2.0

Displaying notes found in: .note.stapsdt
  Owner                 Data size       Description
  stapsdt              0x00000023       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: user
    Location: 0x000000000040115f, Base: 0x0000000000402012, Semaphore: 0x0000000000404040
    Arguments:

The field you show is what is shown for the offset tab, but these tools all use the address tab. What is curious about this is that these tools work when using the address tab. I don't think this can be a coincidence, because the tools checking this address are able to read it.

What really strikes me is that in the examples I've looked at so far, these are exactly one page off, which is a good sign that you are correct and there may be a bug in BCC then, as this is the address it is already reading.

I'll see about patching what field BCC reads, to see if it can use this offset instead of address, and if that fixes the problem.

@dalehamel
Copy link
Member Author

It looks like offset to use is the "file offset":
https://github.com/cuviper/elfutils/blob/master/src/readelf.c#L1266 as this is what is displayed by the -S flag.

And from the struct definition we have this: https://github.com/cuviper/elfutils/blob/08ed26703d658b7ae57ab60b865d05c1cde777e3/libelf/elf.h#L403-L404

  Elf64_Addr	sh_addr;		/* Section virtual addr at execution */
  Elf64_Off	sh_offset;		/* Section file offset */

(note this appears to be an older version of libelf, but shows the point

BCC however appears to parse the description, and gets the virtual address offset. The offset that is needed by the kernel API appears to be the section file offset.

BCC appears to get the address from https://github.com/iovisor/bcc/blob/master/src/cc/bcc_elf.c#L61-L72, which earlier calls gelf_getnote https://github.com/iovisor/bcc/blob/master/src/cc/bcc_elf.c#L96, and then parsing the offset out of the description. It appears the sh_offset field is not exposed through the elf note.

I'll work on an alternative approach to retrieve this value based on what readelf.c appears to do as a prototype. Hopefully this will be enough to get the tool to work, as it seems like it would be the most reliable way.

@liu-song-6 thank you for debugging this with me! I had no idea that this other offset existed, as none of the tools I have been working with show it, and it seems like the key to making this work.

@dalehamel
Copy link
Member Author

@liu-song-6 yup that did it! I created a new function which reads this field from the elf file:

bcc/src/cc/bcc_elf.c

Lines 687 to 721 in 1edd94a

uint64_t bcc_elf_usdt_probe_section_offset(const char *path) {
Elf *e = NULL;
Elf_Scn *section = NULL;
int fd;
size_t stridx;
uint64_t probe_section_offset = 0;
if (openelf(path, &e, &fd) < 0)
goto exit;
if (elf_getshdrstrndx(e, &stridx) != 0)
goto exit;
// Find the probe section offset
while ((section = elf_nextscn(e, section)) != 0) {
GElf_Shdr header;
char *name;
if (!gelf_getshdr(section, &header))
continue;
name = elf_strptr(e, stridx, header.sh_name);
if (name && !strcmp(name, ".probes")) {
probe_section_offset = header.sh_addr - header.sh_offset;
}
}
exit:
if (e)
elf_end(e);
if (fd >= 0)
close(fd);
return probe_section_offset;
}

I then detect if the support for this call exists, and if so I subtract this from the ref_ctr_offset to get the file offset, and it works ilke a charm.

If you'd like to try it yourself, you can use this binary (on a system with glib 2.27+):

bpftrace.gz

And your test program, using this invocation:

bpftrace -e 'usdt::test:user {printf("hi\n");exit();}' -p $(pidof a.out)

@yonghong-song
Copy link
Collaborator

I am testing the same program with uprobe_events interface:

# readelf -s a.out  | grep -e test_user -e for_uprobe
    50: 000000000040055d    55 FUNC    GLOBAL DEFAULT   14 for_uprobe
    68: 0000000000601034     2 OBJECT  GLOBAL DEFAULT   27 test_user_semaphore
# ./a.out

(in a different window)

# cd /sys/kernel/debug/tracing
# echo 'p /root/uprobe/a.out:0x55d(0x1034) ' > uprobe_events
# echo 1 > events/uprobes/p_a_0x55d/enable

And it works.

I think the key is, the offset we use for uprobe_events are 0x55d and 0x1034 instead of 0x40055d and 0x601034.

I haven't got chance to try it with bcc and/or use perf_event_open() to create uprobe.

@liu-song-6 could you explain what does /root/uprobe/a.out:0x55d(0x1034) do?

echo 'p /root/uprobe/a.out:0x55d(0x1034) ' > uprobe_events

I typically just use /root/uprobe/a.out:<offset>.

@liu-song-6
Copy link
Contributor

@yonghong-song 0x55d is offset of the function to probe. 0x1034 is the offset for the USDT semaphore. The kernel will increase the semaphore when the uprobe is enabled.

@yonghong-song
Copy link
Collaborator

Thanks. I guess uprobetracer.rst probably needs update to include this.

@yonghong-song
Copy link
Collaborator

So for these two binary, the offset of the semaphore is 0x3040

Those are not the addresses that I see in the systemtap notes:

GCC version (ET_EXEC) it is 0x0000000000004040, which what I also read with bcc which uses libelf

$ readelf -n a.out
Displaying notes found in: .note.stapsdt
  Owner                 Data size       Description
  stapsdt              0x00000023       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: user
    Location: 0x000000000000116c, Base: 0x0000000000002016, Semaphore: 0x0000000000004040
    Arguments:

Clang version (ET_DYN) it is:

Displaying notes found in: .note.ABI-tag
  Owner                 Data size       Description
  GNU                  0x00000010       NT_GNU_ABI_TAG (ABI version tag)
    OS: Linux, ABI: 3.2.0

Displaying notes found in: .note.stapsdt
  Owner                 Data size       Description
  stapsdt              0x00000023       NT_STAPSDT (SystemTap probe descriptors)
    Provider: test
    Name: user
    Location: 0x000000000040115f, Base: 0x0000000000402012, Semaphore: 0x0000000000404040
    Arguments:

The field you show is what is shown for the offset tab, but these tools all use the address tab. What is curious about this is that these tools work when using the address tab. I don't think this can be a coincidence, because the tools checking this address are able to read it.

What really strikes me is that in the examples I've looked at so far, these are exactly one page off, which is a good sign that you are correct and there may be a bug in BCC then, as this is the address it is already reading.

I'll see about patching what field BCC reads, to see if it can use this offset instead of address, and if that fixes the problem.

@dalehamel Could not fully understand the above example? For EXEC binary, we should already get semaphore address, and we should be fine? It is shared library we has issue as shared library binary file does not really contain address, but rather an offset and we need to go through vm mapping to calculate the address proper? As you mentioned we may have issues as we only go through the first mapping. We may need to find a writable section to calculate actual address.

@dalehamel
Copy link
Member Author

@yonghong-song no need to worry about the above, this problem is solved. Based on the wonderful example that @liu-song-6 gave me, I was able to write this function to get the correct adjustment value:

bcc/src/cc/bcc_elf.c

Lines 687 to 721 in 1edd94a

uint64_t bcc_elf_usdt_probe_section_offset(const char *path) {
Elf *e = NULL;
Elf_Scn *section = NULL;
int fd;
size_t stridx;
uint64_t probe_section_offset = 0;
if (openelf(path, &e, &fd) < 0)
goto exit;
if (elf_getshdrstrndx(e, &stridx) != 0)
goto exit;
// Find the probe section offset
while ((section = elf_nextscn(e, section)) != 0) {
GElf_Shdr header;
char *name;
if (!gelf_getshdr(section, &header))
continue;
name = elf_strptr(e, stridx, header.sh_name);
if (name && !strcmp(name, ".probes")) {
probe_section_offset = header.sh_addr - header.sh_offset;
}
}
exit:
if (e)
elf_end(e);
if (fd >= 0)
close(fd);
return probe_section_offset;
}

Basically the value read by the address field is relative to the start of the memory image. This is why it works in BCC - we just so happen to always match the very start of memory, as we hit the first map that has a matching inode. Since thevalue is absolute, in calculating the global address we just add the semaphore address and it works.

For the kernel API, we must pass a different address - one relative to the "probe" section. To achieve this, the new function reads this by parsing the ELF section headers, and subtracting this from the semaphore address to get the relative semaphore address. This value is what @liu-song-6 used when he ran readelf -S above.

This comment #2230 (comment)
and this comment #2230 (comment) were the key to figuring this out! Thanks again @liu-song-6

@yonghong-song I am finishing up the work on this API in #2738. I have it added for the C and python API, but haven't fully tested the C++ and Lua APIs. I plan to work on this on Tuesday, hope to submit the finished patch soon.

Regards, and thanks again for the help and instruction on this @liu-song-6 @yonghong-song

@yonghong-song
Copy link
Collaborator

@dalehamel Thanks for explanation. Yes, we do not have API to attach uprobe and at the same time increasing in-kernel reference for uprobe semaphore. A new API should be fine.

@dalehamel
Copy link
Member Author

This is fixed by #3135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants