New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
usdt probes requiring semaphore cannot be used on google container OS #2230
Comments
It looks like this error isn't being propagated, even though it is pretty clearly not succeeding. We should probably at least determine why that is, and ensure that it doesn't fail silently (as is now the case). |
@dalehamel I agree with you that the error should be propagated back to user as failing to update semaphore essentially prevents usdt from working. Do you want to work with a pull request to fix the issue? BTW, Song Liu implemented a mechanism for kernel to update the semaphore so all these user space semaphore writing logic won't be needed. It is available in 4.20.
I think this mechanism should work in your environment as it won't go through |
@yonghong-song thanks for pointing me to that commit, that is very encouraging. I'll see if I can prove the concept by doing a custom build with this patch, and if I can I might try my hand at adding this mechanism (gated by kernel version) into BCC. If i'm in that code already, I may as well also look into improving the error propagation. |
That sounds great. Thanks! |
@dalehamel Thank you for providing a good use case. We had the Kernel side patch landed a bit ago but we never get the chance to support it in BCC. I will have some time in the next few weeks to add the support. Meanwhile I also agree it would be good to communicate out the |
@palmtenor thanks for weighing in! I read through the patch, and it's not obvious to me what mechanism is required to trigger the behavior of incrementing the semaphore from reading the diff as it seems to just be storing a mask, could you briefly explain what behavior is needed in bcc to trigger this if you have the time? I'm not able to see the full picture right now. |
USDT uses uprobe in the underlying logic, where we ask Kernel to attach to a (binary, address) pair, and Kernel actually does the attaching on (inode, address) pair. The patch (that one and a few that one was based on) adds the ability to provide an additional semaphore address to uprobe attaching that Kernel will increment / decrement that value upon the uprobe's lifecycle. We have working prototype using that feature but for BCC some small refactor would be needed across uprobe API as well as the USDT library to accommodate for this, as well as make it backward compatible. I'm figuring it out:) |
Thank you for explaining @palmtenor I understand the approach now and have a rough idea of the scope of the work. This is definitely an exciting way forward, as I thought we had basically hit a brick wall in our environment under the current implementation. When you have the time to create the BCC patch i would love to review it, mostly to improve my understanding. |
@palmtenor have you had any chance to look at this? I just ran into this issue again recently, and it looks like the code is the same as it was when this issue was written: Lines 97 to 120 in 5ce16e4
|
I did some digging, it looks like this function would need to be modified: Lines 980 to 986 in 992e482
As this is what is ultimately (i think) creating the uprobe from the call to I suppose there would also need to be some conditional logic in the USDT code, for it rely on the kernel to set the semaphore values if the feature is supported. In my production environment, I am being held-up in rolling out USDT tools without some workaround for this functionality. If you have any draft PR @palmtenor i can try picking it up, otherwise I can poke around and see if I can make progress on this issue when I get a chance to. |
@dalehamel The right place is at function
similar to other FD based uprobe/kprobe checking. Basically, the offset for the semaphore will occupies the top 32bit of |
Thanks @yonghong-song, my development host is kernel 5.3 and appears to have this field:
I will look into devising a-proof-of concept patch to add support for this method of semaphore incrementation to bcc. Our production kernel is still 4.14, but once we are able to upgrade to 4.20 or later, I'd like to be already have the functionality landed in bcc :) |
I started working on a patch today, i think i have this figured out, just need to test it out. |
@yonghong-song @palmtenor I have been having a hard time getting my branch to work. It would seem that passing the semaphore address as-is doesn't do the trick. In userspace, we resolve the 32 bit offset to a 64 bit global address, where we then seek in memory to increment the semaphore. I believe that I am correctly shifting the bits:
Where 34eb60 matches what I see for the semaphore address from readelf --notes. Looking at the kernel source, it appears to use So I would expect that I am passing it in the correct format. However, the semaphore is never incremented, and the probe is never enabled. Any ideas of what I might be doing wrong in my branch on #2738 ? |
I've verified that the offset is being passed with this kprobe:
|
From further kernel probing, it appears that while
This makes me think the issue is perhaps that this code: https://github.com/torvalds/linux/blob/1cc33161a83d20b5462b1e93f95d3ce6388079ee/kernel/events/uprobes.c#L433-L443 isn't finding the virtual memory address, which again leads me to believe that I might be passing in the offset in a wrong or unexpected format EDIT: confirmed, I probed
|
It definitely seems like this should be the right address for the probe I'm trying to do (method__entry probe for ruby):
|
I built a kernel with some debug statements, and found that the issue is the offset is in fact wrong. For my ruby example, the result ends up past the last map, which is clearly incorrect. Here are the maps for the process:
However, the global address for the semaphore is definitely at I have verified this address, when I attach a USDT probe with the traditional means it becomes "1", and when i detach it goes back to "0". So BCC is definitely getting the right address, but the kernel is ignoring this map because the inode doesn't match what it expects. I wonder if perhaps this constitutes a bug in the kernel code? I also tried a simpler USDT test program (the bpftrace test program for usdt probe with semaphore), and it cannot match any map for that either. @liu-song-6 as you authored the original patch, do you have a test case program that it works to attach in this way? |
I believe I have figured it out, it amounts to a bug in the linux kernel, and Given the following memory map:
BCC happens to always find the first vm area, because it matches only by inode. The value that it calculates: 0x5560f0b86000 + 0x34eb60 - 0x00000000 = 0x5560f0ed4b60 If we continue down, to the other vm areas, we get the same result: 0x5560f0baa000 + 0x34eb60 - 0x00024000 = 0x5560f0ed4b60 0x5560f0de6000 + 0x34eb60 - 0x00260000 = 0x5560f0ed4b60 But when we hit 0x5560f0ecf000, we get a different result! 0x5560f0ecf000 + 0x34eb60 - 0x00348000 = 0x5560f0ed5b60 Interestingly, it is off by exactly one page (0x1000, page size 4096). Taking a closer look, we see it is because there is a gap. The previous When we get to the map that should actually contain the ref_ctr_offset, we have 0x0x5560f0ed4000 + 0x34eb60 - 0x0034d000 = 5560f0ed5b60 This is because that same gap of one page is offsetting it. This is why no Because this is the first memory area that has the WRITE permission and it is There are two possible solutions that I can think of:
I have been able to prove this concept with the first approach, though there might be some benefit to the "page gap" counting approach, as this does additional checks. However, these checks are beyond what BCC does now, and BCC seems to work fairly reliably - are these checks actually necessary? The following patch allows me to attach to a ruby process: From a3828da634a37bbea023953ba48b10af55086cdf Mon Sep 17 00:00:00 2001
From: Dale Hamel <dale.hamel@srvthe.net>
Date: Fri, 14 Feb 2020 00:43:39 -0500
Subject: [PATCH] Match ref_ctr_offset by inode only
---
kernel/events/uprobes.c | 4 +---
1 file changed, 1 insertion(+), 3 deletions(-)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 84fa00497c49..51fdd5e4103f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -344,9 +344,7 @@ static bool valid_ref_ctr_vma(struct uprobe *uprobe,
return uprobe->ref_ctr_offset &&
vma->vm_file &&
file_inode(vma->vm_file) == uprobe->inode &&
- (vma->vm_flags & (VM_WRITE|VM_SHARED)) == VM_WRITE &&
- vma->vm_start <= vaddr &&
- vma->vm_end > vaddr;
+ vma->vm_pgoff == 0;
}
static struct vm_area_struct *
--
2.21.0
However, it doesn't work for my USDT test program https://github.com/iovisor/bpftrace/blob/master/tests/testprogs/usdt_semaphore_test.c which has these maps:
In this case, the ref_ctr_offset is 0x601040, but the address that the kernel calculates to attach to is again outside of the address space at 0xA01040. Interestingly, this is exactly the start of the first vm_area - if we subtract dd if=/proc/$(pidof usdt_semaphore_test)/mem bs=1 count=1 skip=$(( 0x601040)) |xxd This is verified with a logging message:
(first debug line is mine, from this code:) rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset);
printk(KERN_DEBUG "Using %016llX for global ref_ctr_offset, "
"start: %016llX pgoff: %08X pgoff_shift %08X\n",
rc_vaddr, rc_vma->vm_start, rc_vma->vm_pgoff,
(rc_vma->vm_pgoff << PAGE_SHIFT));
ret = __update_ref_ctr(mm, rc_vaddr, d);
if (ret)
update_ref_ctr_warn(uprobe, mm, d); It looks like it doesn't need any offset. Examining the BCC code gives us a clue: Lines 74 to 83 in 0d0d353
Sure enough, the test executable is In any case, both of these test programs seem to fail to use this kernel API for different reasons. The ruby program because there is a gap of 1 page that makes the calculations incorrect, and the usdt_semaphore_test because it shouldn't have offsets at all as it is not ET_DYN. The above patch is a simple way to mirror BCCs behavior for the ET_DYN cases, but for ET_EXEC we will need to add handling to not calculate a global offset, taking the ref_ctr_offset passed and using it directly. @liu-song-6 @yonghong-song does this make sense to you? What fix do you think is best here? Thanks! Sorry for the wall of text here. |
For the ET_EXEC issue, here is a quick hack that seems to do the trick: From 8270d53ab8bf52a79a7b0821ca3a715b0772e505 Mon Sep 17 00:00:00 2001
From: Dale Hamel <dale.hamel@srvthe.net>
Date: Fri, 14 Feb 2020 09:41:16 -0500
Subject: [PATCH] Handle ET_EXEC
---
kernel/events/uprobes.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index 4d70f8fe3a43..6a2a2b1f34c6 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -51,6 +51,9 @@ DEFINE_STATIC_PERCPU_RWSEM(dup_mmap_sem);
/* Have a copy of original instruction */
#define UPROBE_COPY_INSN 0
+// To check if a program is ET_EXEC, NOTE this is arch dependant
+#define ET_EXEC_MASK 0x00000000FFFFFFFF
+
struct uprobe {
struct rb_node rb_node; /* node in the rb tree */
refcount_t ref;
@@ -420,6 +423,9 @@ static int update_ref_ctr(struct uprobe *uprobe, struct mm_struct *mm,
if (rc_vma) {
rc_vaddr = offset_to_vaddr(rc_vma, uprobe->ref_ctr_offset);
+ if((rc_vma->vm_start & ET_EXEC_MASK) == rc_vma->vm_start)
+ rc_vaddr -= rc_vma->vm_start;
+
printk(KERN_DEBUG "Using %016llX for global ref_ctr_offset, "
"start: %016llX pgoff: %08X pgoff_shift %08X\n",
rc_vaddr, rc_vma->vm_start, rc_vma->vm_pgoff,
--
2.21.0
It's not architecture independent, and it looks like there is a function that could be used to determine if the uprobe->file is ET_DYN or ET_EXEC, but it is static (elf_fdpic_fetch_phdrs). With these two patches I am able to probe both files. Curiously, it looks like the ref counter is actually added to twice for one probe attachment - I'll need to look into that as well. |
Is this because of who the USDT is defined? I found the old test program we were using, which doesn't have "attribute ((visibility ("hidden")))". I haven't verified it with latest code. #define _SDT_HAS_SEMAPHORES 1 #include <stdio.h> int for_uprobe(int c) int main(int argc, char *argv[]) |
Hi @liu-song-6 thank you for your reply. I have run your test program and can confirm it suffers from the same bug, though in this case it seems to be that the file offset is the culprit? With my patch above it is able to correctly determine the offset, if it is based on the address of the first vm area and the program works. Without my patch on a vanilla 5.3 kernel, I have the same problem where no vma is determined to be valid. Here is what the memory map looks like:
BCC uses the address 0x55ecaac26000 + 0x4040 - 0x00000000 = 0x55ecaac2a040 This is also true for : 0x55ecaac27000 + 0x4040 - 0x00001000 = 0x55ecaac2a040 However when we get to the next vm area: 0x55ecaac29000 + 0x4040 - 0x00002000 = 0x55ecaac2b040 It is off by a page. The error carries forward to the final memory area (the one the current code matches on because it has the WRITE flag set): 0x55ecaac2a000 + 0x4040 - 0x00003000 = 0x55ecaac2b040 Here is the exact C code used to generate a.out: #define _SDT_HAS_SEMAPHORES 1
__extension__ unsigned short test_user_semaphore __attribute__ ((unused)) __attribute__ ((section (".probes")));
#define TEST test_user_semaphore
#include <stdio.h>
#include <unistd.h>
#include <sys/sdt.h>
int for_uprobe(int c)
{
if (TEST)
STAP_PROBE(test, user);
printf("%d\n", c + 10);
return c + 1;
}
int main(int argc, char *argv[])
{
for_uprobe(argc);
while (1) {
sleep(1);
printf("semaphore %d\n", test_user_semaphore);
}
} These test programs are being built with GCC 8.3, though I'm not sure that it matters. None of these three test cases (ruby - ET_DYN, usdt_semaphore_test - ET_EXEC, and your test program - ET_DYN) work with this API, and using BCC as a reference implementation it is fairly easy to see why. For the ET_DYN executables, BCC always uses the first vm area and this seems to never suffer from the off-by-one page error. For the ET_EXEC executables, BCC doesn't bother to calculate any offset, and takes the ref_ctr_offset from the stap notes directly, which the kernel code doesn't do. |
I am testing the same program with uprobe_events interface:
And it works. I think the key is, the offset we use for uprobe_events are 0x55d and 0x1034 instead of 0x40055d and 0x601034. I haven't got chance to try it with bcc and/or use perf_event_open() to create uprobe. |
Interesting, how are you able to make this translation? Why are you able to just discard 400000 and 600000 respectively? How could I generalize this? Interestingly, when I build with GCC (8.3) I get:
But with clang (8) I get results more similar to yours
When I read the elf notes it still gives me the full address. GCC produces a ET_DYN executable, and clang produces an ET_EXEC executable, which is also interesting. What is really interesting is that for the executable built with clang, I am able to probe it without any modifications to the kernel! So it would seem that the bug maybe is only specific to GCC-built ELF executables? Can you try with GCC @liu-song-6 ? Also can you confirm that you are building with Clang in your test? I will try building ruby and my other USDT test program with Clang to see if they also work. I would still consider this a bug if the code is compiler-dependent though.
I am achieving this with my branch #2738 and a patch to bpftrace to use this in src/attached_probe.cpp. |
I am using a very old gcc (4.8.5). I didn't mean to use it, it is just the default one on the test machine. Could you please check the |
My mistake sorry, 🤦♂️ I was using a different build of bpftrace that was using BCC to seek /proc/[PID]/mem rather than the kernel API, I tested again using the kernel API for sure and neither the GCC or Clang executable works after all - I got ahead of myself.
Sure: Clang 8 (ET_EXEC type), invoked as just
GCC 8.3 (invoked as just
|
So for these two binary, the offset of the semaphore is 0x3040. I think current kernel should handle that well. Not sure about bcc and bpftrace. |
Those are not the addresses that I see in the systemtap notes: GCC version (ET_EXEC) it is 0x0000000000004040, which what I also read with bcc which uses libelf
Clang version (ET_DYN) it is:
The field you show is what is shown for the What really strikes me is that in the examples I've looked at so far, these are exactly one page off, which is a good sign that you are correct and there may be a bug in BCC then, as this is the address it is already reading. I'll see about patching what field BCC reads, to see if it can use this offset instead of address, and if that fixes the problem. |
It looks like offset to use is the "file offset": And from the struct definition we have this: https://github.com/cuviper/elfutils/blob/08ed26703d658b7ae57ab60b865d05c1cde777e3/libelf/elf.h#L403-L404
(note this appears to be an older version of libelf, but shows the point BCC however appears to parse the description, and gets the virtual address offset. The offset that is needed by the kernel API appears to be the section file offset. BCC appears to get the address from https://github.com/iovisor/bcc/blob/master/src/cc/bcc_elf.c#L61-L72, which earlier calls I'll work on an alternative approach to retrieve this value based on what @liu-song-6 thank you for debugging this with me! I had no idea that this other offset existed, as none of the tools I have been working with show it, and it seems like the key to making this work. |
@liu-song-6 yup that did it! I created a new function which reads this field from the elf file: Lines 687 to 721 in 1edd94a
I then detect if the support for this call exists, and if so I subtract this from the ref_ctr_offset to get the file offset, and it works ilke a charm. If you'd like to try it yourself, you can use this binary (on a system with glib 2.27+): And your test program, using this invocation:
|
@liu-song-6 could you explain what does
I typically just use |
@yonghong-song 0x55d is offset of the function to probe. 0x1034 is the offset for the USDT semaphore. The kernel will increase the semaphore when the uprobe is enabled. |
Thanks. I guess uprobetracer.rst probably needs update to include this. |
@dalehamel Could not fully understand the above example? For EXEC binary, we should already get semaphore address, and we should be fine? It is shared library we has issue as shared library binary file does not really contain address, but rather an offset and we need to go through vm mapping to calculate the address proper? As you mentioned we may have issues as we only go through the first mapping. We may need to find a writable section to calculate actual address. |
@yonghong-song no need to worry about the above, this problem is solved. Based on the wonderful example that @liu-song-6 gave me, I was able to write this function to get the correct adjustment value: Lines 687 to 721 in 1edd94a
Basically the value read by the For the kernel API, we must pass a different address - one relative to the "probe" section. To achieve this, the new function reads this by parsing the ELF section headers, and subtracting this from the semaphore address to get the relative semaphore address. This value is what @liu-song-6 used when he ran This comment #2230 (comment) @yonghong-song I am finishing up the work on this API in #2738. I have it added for the C and python API, but haven't fully tested the C++ and Lua APIs. I plan to work on this on Tuesday, hope to submit the finished patch soon. Regards, and thanks again for the help and instruction on this @liu-song-6 @yonghong-song |
@dalehamel Thanks for explanation. Yes, we do not have API to attach uprobe and at the same time increasing in-kernel reference for uprobe semaphore. A new API should be fine. |
This is fixed by #3135 |
This isn't necessarily a bug in bcc per se, more a write-up of what doesn't work and why - hopefully it will at least help others from going down the rabbit hole I did.
There may be a way for bcc to fix this by finding another way to increment the semaphore, but I don't really see how.
In the chromium kernel source code, here is code in
fs/proc/base.c
that prevents processes from writing to their own memory maps for security reasons:This variable is enabled by default in the kernel used by container OS, such as in google's GKE offering: https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/master/overlay-lakitu/sys-kernel/lakitu-kernel-4_14/files/base.config#3016
This means that anyone trying to use bcc on a chromium derived OS, and especially anyone trying to use bcc on GKE, will probably also hit this if they try to use usdt probes.
During the process of enabling a usdt probe, some probes need to be enabled by writing to a semaphore - this isn't true of all usdt probes, but is probably true of many (I ran into this with ruby's usdt probes). As the dtrace docs indicate, this is a means to avoid expensive processing around the probe, only adding this extra info/processing if the probe is enabled.
The code that handles this in bcc is here:
bcc/src/cc/usdt/usdt.cc
Lines 109 to 113 in c2e2a26
And it is essentially the same as the approach described here.
However, this leads to probes silently failing to be enabled if run against a kernel with the above hardening. Using strace, it is obvious why it fails:
Note that this only will happen for probes where
readelf --notes
indicates a value for the sempahore:As there actually is a sempahore indicated here, this USDT probe would be affected. A similar probe in libc would not be affected and can be attached to as the semaphore is not required for the "enable" mechanism:
The text was updated successfully, but these errors were encountered: