Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libbpf-tools: add profile #3782

Merged
merged 1 commit into from
Feb 7, 2024
Merged

libbpf-tools: add profile #3782

merged 1 commit into from
Feb 7, 2024

Conversation

ekyooo
Copy link
Contributor

@ekyooo ekyooo commented Jan 2, 2022

It is based on Brendan Gregg's BCC profile and much of the code for it is borrowed from offcputime.c and ruqlen.c.
I'm not sure if libbpf-tools requires a profile. So I'm sharing this just to get some feedback before I finish developing of this patch.
Could you give me your opinion? (if necessary or not)
Thank you.

This is an example of usage.

# ./profile -iK
Sampling at 49 Hertz of all threads by kernel stack... Hit Ctrl-C to end.
^C    
    _raw_spin_unlock_irq
    finish_task_switch
    __schedule
    schedule
    schedule_timeout
    msleep
    HDMI21_Tx_EARC_MainThread
    kthread
    ret_from_fork
    -                hdmi21_earc_eng (1427)
        1

    __pi___clean_dcache_area_poc
    dma_direct_map_page
    dma_direct_map_sg
    dma_buf_unified_map
    dma_buf_map_attachment
    kbase_mem_umm_map_attachment
    kbase_mem_import
    kbase_ioctl
    __arm64_compat_sys_ioctl
    el0_svc_common.constprop.0
    el0_svc_compat_handler
    el0_svc_compat
    -                surface-manager (1911)
        4

    __pi___clean_dcache_area_poc
    dma_direct_map_page
    dma_direct_map_sg
    dma_buf_unified_map
    dma_buf_map_attachment
    kbase_mem_umm_map_attachment
    kbase_mem_import
    kbase_ioctl
    __arm64_compat_sys_ioctl
    el0_svc_common.constprop.0
    el0_svc_compat_handler
    el0_svc_compat
    -                tAiToneService (2141)
        12

    [Missed Kernel Stack]
    -                WebAppMgr (3624)
        1

...

This is the result of using the -f (folded) option.

# ./profile -fdK
...
tCMState;low_mem_notify_threshold;__alloc_pages_nodemask;__get_free_pages;__pollwait;unix_poll;sock_poll;do_sys_poll;__arm64_sys_poll;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 2
surface-manager;sock_poll;do_sys_poll;__arm64_sys_poll;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 1
qml-runner;sock_poll;do_sys_poll;__arm64_sys_poll;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 96
mali-cmar-backe;_raw_spin_unlock_irqrestore;kbase_pm_update_cores_state;kbase_pm_do_poweron;kbase_pm_update_active;kbase_hwaccess_pm_gpu_active;kbase_pm_context_active_handle_suspend;kbase_js_sched;kbase_jd_submit;kbase_ioctl;__arm64_compat_sys_ioctl;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 2
surface-manager;__fget;__fget_light;__fdget;sockfd_lookup_light;__sys_sendmsg;__arm64_compat_sys_sendmsg;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 1
...
WARNING: 4 stack traces could not be displayed. Consider increasing --stack-storage-size.

Signed-off-by: Eunseon Lee es.lee@lge.com

@davemarchevsky
Copy link
Collaborator

I'm not sure if libbpf-tools requires a profile. So I'm sharing this just to get some feedback before I finish developing of this patch.
Could you give me your opinion? (if necessary or not)

In general a conversion of a bcc python API tool to a libbpf tool is desired and very appreciated.

@ekyooo
Copy link
Contributor Author

ekyooo commented Jan 3, 2022

In general a conversion of a bcc python API tool to a libbpf tool is desired and very appreciated.

Thank you for your reply.
I'll check the failed case and fix it.
Thank you.

@ekyooo
Copy link
Contributor Author

ekyooo commented Jan 11, 2022

This patch has basic functionality with some limitations of the profile.
This is the test result.

# ./profile -iK
Sampling at 49 Hertz of all threads by kernel stack... Hit Ctrl-C to end.
^C    
    _raw_spin_unlock_irq
    finish_task_switch
    __schedule
    schedule
    schedule_timeout
    msleep
    HDMI21_Tx_EARC_MainThread
    kthread
    ret_from_fork
    -                hdmi21_earc_eng (1427)
        1

    __pi___clean_dcache_area_poc
    dma_direct_map_page
    dma_direct_map_sg
    dma_buf_unified_map
    dma_buf_map_attachment
    kbase_mem_umm_map_attachment
    kbase_mem_import
    kbase_ioctl
    __arm64_compat_sys_ioctl
    el0_svc_common.constprop.0
    el0_svc_compat_handler
    el0_svc_compat
    -                surface-manager (1911)
        4

    __pi___clean_dcache_area_poc
    dma_direct_map_page
    dma_direct_map_sg
    dma_buf_unified_map
    dma_buf_map_attachment
    kbase_mem_umm_map_attachment
    kbase_mem_import
    kbase_ioctl
    __arm64_compat_sys_ioctl
    el0_svc_common.constprop.0
    el0_svc_compat_handler
    el0_svc_compat
    -                tAiToneService (2141)
        12

    [Missed Kernel Stack]
    -                WebAppMgr (3624)
        1

...

This is the result of using the -f (folded) option.

# ./profile -fdK
...
tCMState;low_mem_notify_threshold;__alloc_pages_nodemask;__get_free_pages;__pollwait;unix_poll;sock_poll;do_sys_poll;__arm64_sys_poll;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 2
surface-manager;sock_poll;do_sys_poll;__arm64_sys_poll;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 1
qml-runner;sock_poll;do_sys_poll;__arm64_sys_poll;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 96
mali-cmar-backe;_raw_spin_unlock_irqrestore;kbase_pm_update_cores_state;kbase_pm_do_poweron;kbase_pm_update_active;kbase_hwaccess_pm_gpu_active;kbase_pm_context_active_handle_suspend;kbase_js_sched;kbase_jd_submit;kbase_ioctl;__arm64_compat_sys_ioctl;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 2
surface-manager;__fget;__fget_light;__fdget;sockfd_lookup_light;__sys_sendmsg;__arm64_compat_sys_sendmsg;el0_svc_common.constprop.0;el0_svc_compat_handler;el0_svc_compat 1
...
WARNING: 4 stack traces could not be displayed. Consider increasing --stack-storage-size.

Current limitations are as follows.

  • kernel_ip
    The bpf code in the python profile uses some kernel macros like PAGE_OFFSET to get the kernel_ip, but as I checked it is not included in vmlinux.h or bpf_helpers.h.
    I saw the page below about it, could you please let me know if there is another way to use macros if you have?

Unfortunately, BTF (as well as DWARF) doesn’t record #define macros, so some common macros might be missing with vmlinux.h. Most commonly missing ones might be provided as part of libbpf’s bpf_helpers.h (kernel-side "library", provided by libbpf). (https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html)

Alternatively, the code to get kernel_ip is excluded in this patch. How about having this for now and including the kernel_ip code after libbpf-tools supports the macro?

  • sorting
    The printed output is not yet sorted similar to offcputime.

  • options
    some options such as --cgroupmap, --mntnsmap, -a are not yet included.

Thank you for reading this.
If you agree to the restrictions above, I'll ask detailed opinions about the code.
All comments are welcome.
Thank you.

@@ -54,6 +54,7 @@ APPS = \
tcpconnlat \
tcprtt \
vfsstat \
profile \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the list sorted and add an entry to .gitignore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fixed. Thank you.

@@ -0,0 +1,67 @@
// SPDX-License-Identifier: GPL-2.0
// Copyright (c) 2021 Wenbo Zhang
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong copyright info. Please use C++ style comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I edited comment style.
For copyright information, I've now followed opensnoop's license-identifier.
I hope to have one chance to fix it after more code review.
Thank you.


struct {
__uint(type, BPF_MAP_TYPE_STACK_TRACE);
__uint(key_size, sizeof(u32));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use __type(key, u32); here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, but the code is throwing a runtime error like below. So I keep the original code.
Thank you.

libbpf: Error in bpf_create_map_xattr(stackmap):Invalid argument(-22). Retrying without BTF.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you update your git submodule ?

u64 *valp;
static const u64 zero;

if (include_idle == false && pid == 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!include_idle

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fixed. Thank you.

return 0;

// create map key
struct key_t key = {.pid = tgid};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare variables first.

This .pid = tgid is very counterintuitive. I would suggest you use pid and tid instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. It's fixed. Thank you.

@chenhengqi
Copy link
Collaborator

return 0;

// create map key
struct key_t key = {.pid = pid};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declare variable first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I missed it and fixed it now.
Thank you.


libbpf_set_print(libbpf_print_fn);

err = bump_memlock_rlimit();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to other tools, we don't need this anymore.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fixed it. Thank you.

@ekyooo
Copy link
Contributor Author

ekyooo commented Jan 18, 2022

For the PAGE_OFFSET things, please refer to https://facebookmicrosites.github.io/bpf/blog/2020/02/20/bcc-to-libbpf-howto-guide.html#dealing-with-compile-time-ifs-in-bcc .

Thank you for the information and I checked the page.
I think you recommend me to use "libbpf-provided extern Kconfig variables" for PAGE_OFFSET, right? (because struct flavors solution and other helper APIs of "BPF CO-RE reference guide" is not proper for this case, as I understand.)
I'm checking the macros on different architecture,

  • arm64 : It seems it can use CONFIG_ARM64 and CONFIG_ARM64_VA_BITS from kconfig
  • arm: It seems it can use CONFIG_PAGE_OFFSET from kconfig
  • x86: It seems it can use some kconfig, but has dependency with other macros, so could be vulnerable to change...
    so I'm not sure if it's the right direction for now.

Alternatively, I'm trying to check if I can refer or use machine__get_kernel_start of perf(linux/tools/perf/util/machine.h) to get kernel start address.

I've shared my status for feedback to you, any opinions are welcome if you have for this.
Thank you.


// arm64 only supported
if (CONFIG_ARM64) {
extern __u32 CONFIG_ARM64_VA_BITS __kconfig __weak;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember the 'declare variables first.' as you commented.
However, referencing other tools in libbpf-tools, I found some exceptions in cases like this (for variables used only in if conditions).
Could you please share your opinion?

// populate extras to fix the kernel stack
extern bool CONFIG_ARM64 __kconfig __weak;

// arm64 only supported
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I've added logic to set kernel_ip only for arm64 due to some limitations and difficulty accessing kernel PAGE_OFFSET values at runtime with hope of more support for other architectures with next patches by me or more technical engineer.
Could you please share your opinion?
Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote code to apply the same logic for x86 but to be honest: I do not understand what does this logic.
I saw it was present in bcc version of this gadget:

// populate extras to fix the kernel stack

So, does someone know what do we need this code and what is the meaning of "fixing kernel stack"?

[Here is my code for information:

diff --git a/libbpf-tools/profile.bpf.c b/libbpf-tools/profile.bpf.c
index 08ac5687..9e4f5d0f 100644
--- a/libbpf-tools/profile.bpf.c
+++ b/libbpf-tools/profile.bpf.c
@@ -59,13 +59,13 @@ int do_perf_event(struct bpf_perf_event_data *ctx) {
 
        if (key.kern_stack_id >= 0) {
                // populate extras to fix the kernel stack
+               u64 ip = PT_REGS_IP(&ctx->regs);
+               u64 page_offset;
 
                // arm64 only supported
-#ifdef __TARGET_ARCH_arm64
+#if defined(__TARGET_ARCH_arm64)
                extern __u32 CONFIG_ARM64_VA_BITS __kconfig __weak;
                extern bool CONFIG_ARM64_64K_PAGES __kconfig __weak;
-               u64 ip = PT_REGS_IP(&ctx->regs);
-               u64 page_offset;
                u32 va;
 
                if (CONFIG_ARM64_VA_BITS)
@@ -79,11 +79,19 @@ int do_perf_event(struct bpf_perf_event_data *ctx) {
                        page_offset = (-(1UL << (va)));
                else
                        page_offset = (0xffffffffffffffffUL - (1UL << (va - 1)) + 1);
-
-               if (ip > page_offset) {
+#elif defined(__TARGET_ARCH_x86)
+#define __PAGE_OFFSET_BASE_L4 0xffff888000000000UL
+#define __PAGE_OFFSET_BASE_L5 0xff11000000000000UL
+               extern bool CONFIG_DYNAMIC_MEMORY_LAYOUT __kconfig __weak;
+               extern bool CONFIG_X86_5LEVEL __kconfig __weak;
+
+               if (CONFIG_DYNAMIC_MEMORY_LAYOUT && CONFIG_X86_5LEVEL)
+                       page_offset = __PAGE_OFFSET_BASE_L5;
+               else
+                       page_offset = __PAGE_OFFSET_BASE_L4;
+#endif /* __TARGET_ARCH_x86 */
+               if (ip > page_offset)
                        key.kernel_ip = ip;
-               }
-#endif
        }
 
        valp = bpf_map_lookup_or_try_init(&counts, &key, &zero);

]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, as I understand, It's to append ip address of the register to the top of the kernel stack trace to provide more detailed information to the user. page_offset is used to determine whether ip is a kernel address.

Thank you very much for the code for x86. But unfortunately I changed the code from yesterday...
because

  • If the target doesn't have Kconfig, the profile won't work at all.
  • On some versions of Linux, page_offset is not an effective value for this purpose.
    Could you please check the updated code?
    I'm really sorry and thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really sorry and thank you.

No problem, your actual version is simpler (and so better) than the previous and my proposal for x86.

Nonetheless, I do not really understand the goal of this code (or how it is printed).
Indeed, from my tests, I got the following output:

$ sudo ./profile -K 5
Sampling at 49 Hertz of all threads by kernel stack for 5 secs.
    #0  0xffffffff98ed69cb arch_local_irq_enable+0xb
    #1  0xffffffff99a00ec3 __schedule+0x773
    #2  0xffffffff99a0109f schedule+0x4f
    #3  0xffffffff98edf5f7 do_sched_yield+0x87
    #4  0xffffffff98edf62e __x64_sys_sched_yield+0xe
    #5  0xffffffff999f3b21 do_syscall_64+0x61
    #6  0xffffffff99c0007c entry_SYSCALL_64_after_hwframe+0x44
    kernel_ip#7  0xffffffff98ed69cb arch_local_irq_enable+0xb
...

You can see the entry where k->kernel_ip is not NULL begining with kernel_ip.
But this entry is the same as entry #0, so this is why I do not understand this code (maybe I am not using the tool like I should do).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, I fixed kernel_ip to be at the top of the call stack. Thank you so much for finding it.
And yes, as you said, I found a duplicate with the first entry in the call stack. I will check and update my opinion on it. I have ever seen this before and it's was produced with some conditions in my faint memory. please let me check it. Thank you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where I see it before was a result of bcc docker. (https://github.com/iovisor/bcc/blob/master/QUICKSTART.md)
These are the test results for profile.py. And it has the same kernel_ip and call stack entry #0..

...
    do_syscall_64   // kernel_ip
    do_syscall_64   // call stack entry #0
    entry_SYSCALL_64_after_hwframe
    [unknown]
    [unknown]
    -                qemu-system-x86 (36165)
        7

    native_write_msr_safe  // kernel_ip
    native_write_msr_safe  // call stack entry #0
    kvm_set_shared_msr
    vmx_save_host_state
...

@yonghong-song
If you have the same experience, could you please share your opinion on the equality of call stack entry #0 and kernel ip in profile.py?
Thank you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi.

This code was indeed already present in the bcc version of this tool, I just do not understand why we need this (maybe it was mandatory with some older kernels because we cannot get the full kernel stack from the BPF_MAP_TYPE_STACK_TRACE?).

Thank you for your investigation though :)

Best regards.

if (LINUX_KERNEL_VERSION >= KERNEL_VERSION(5, 4, 0))
page_offset = (-(1UL << (va)));
else
page_offset = (0xffffffffffffffffUL - (1UL << (va - 1)) + 1);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since v5.4, the definition of PAGE_OFFSET has changed and it has been divided into two cases.
If you don't need it, I'd appreciate your feedback.
Thank you.

@ekyooo
Copy link
Contributor Author

ekyooo commented Feb 15, 2022

I found and fixed a bug in the first patch. I will update soon. Sorry and thank you.

@ekyooo ekyooo force-pushed the profile branch 2 times, most recently from 8033322 to 5a07dfe Compare February 20, 2022 11:17
@ekyooo
Copy link
Contributor Author

ekyooo commented Feb 20, 2022

This is the test result of c88462c
It's sorted by count.

# profile 
Sampling at 49 Hertz of all threads by user + kernel stack... Hit Ctrl-C to end.
^C    _raw_write_unlock_irq
    ep_poll
    do_epoll_wait
    __arm64_sys_epoll_pwait
    el0_svc_common.constprop.0
    el0_svc_handler
    el0_svc
    _raw_write_unlock_irq
    epoll_pwait
    sd_event_wait
    sd_event_run
    sd_event_loop
    [unknown]
    __libc_start_main
    [unknown]
    -                systemd-oomd (825)
        2

    _raw_spin_unlock_irq
    do_timerfd_settime
    __arm64_sys_timerfd_settime
    el0_svc_common.constprop.0
    el0_svc_handler
    el0_svc
    _raw_spin_unlock_irq
    timerfd_settime
    sd_event_prepare
    sd_event_run
    sd_event_loop
    [unknown]
    __libc_start_main
    [unknown]
    -                systemd-oomd (825)
        2

    _raw_spin_unlock_irqrestore
    finish_wait
    __skb_wait_for_more_packets
    __skb_recv_datagram
    skb_recv_datagram
    raw_recvmsg
    inet_recvmsg
    sock_recvmsg
    __sys_recvfrom
    __arm64_sys_recvfrom
    el0_svc_common.constprop.0
    el0_svc_handler
    el0_svc
    _raw_spin_unlock_irqrestore
    recvfrom
    [unknown]
    __libc_start_main
    [unknown]
    -                rdisc (346)
        1

@ekyooo
Copy link
Contributor Author

ekyooo commented Feb 20, 2022

This is the test result of 5a07dfe

Before:

  # profile -d
      psiginfo
      vscanf
      __snprintf_chk
      [unknown]
      [unknown]
      [unknown]
      [unknown]
      [unknown]
      sd_event_exit
      sd_event_dispatch
      sd_event_run
      [unknown]
      __libc_start_main
      [unknown]
      -                systemd-journal (204)
          1

      xas_load
      xas_find
      filemap_map_pages
      __handle_mm_fault
      handle_mm_fault
      do_page_fault
      do_translation_fault
      do_mem_abort
      do_el0_ia_bp_hardening
      el0_ia
      xas_load
      --
  failed to get syms
      -                PmLogCtl (138757)
        1

After:

  # profile -d
      #0  0xffffffc01018b7e8 __arm64_sys_clock_nanosleep+0x0
      #1  0xffffffc01009a93c el0_svc_handler+0x34
      #2  0xffffffc010084a08 el0_svc+0x8
      #3  0xffffffc01018b7e8 __arm64_sys_clock_nanosleep+0x0
      --
      #4  0x0000007fa0bffd14 clock_nanosleep+0x94 (/usr/lib/libc-2.31.so+0x9ed14)
      #5  0x0000007fa0c0530c nanosleep+0x1c (/usr/lib/libc-2.31.so+0xa430c)
      #6  0x0000007fa0c051e4 sleep+0x34 (/usr/lib/libc-2.31.so+0xa41e4)
      #7  0x000000558a5a9608 flb_loop+0x28 (/usr/bin/fluent-bit+0x52608)
      #8  0x000000558a59f1c4 flb_main+0xa84 (/usr/bin/fluent-bit+0x481c4)
      #9  0x0000007fa0b85124 __libc_start_main+0xe4 (/usr/lib/libc-2.31.so+0x24124)
      #10 0x000000558a59d828 _start+0x34 (/usr/bin/fluent-bit+0x46828)
      -                fluent-bit (1238)
          1

      #0  0xffffffc01027daa4 generic_copy_file_checks+0x334
      #1  0xffffffc0102ba634 __handle_mm_fault+0x8dc
      #2  0xffffffc0102baa20 handle_mm_fault+0x168
      #3  0xffffffc010ad23c0 do_page_fault+0x148
      #4  0xffffffc010ad27c0 do_translation_fault+0xb0
      #5  0xffffffc0100816b0 do_mem_abort+0x50
      #6  0xffffffc0100843b0 el0_da+0x1c
      #7  0xffffffc01027daa4 generic_copy_file_checks+0x334
      --
      #8  0x0000007f8dc12648 [unknown]
      #9  0x0000007f8dc0aef8 [unknown]
      #10 0x0000007f8dc1c990 [unknown]
      #11 0x0000007f8dc08b0c [unknown]
      #12 0x0000007f8dc08e48 [unknown]
      #13 0x0000007f8dc081c8 [unknown]
      -                PmLogCtl (2412)
          1

Copy link
Contributor

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi.

Thank you for translating this tool!
I tested it and it works, I have just some comments (regarding things I do not understand).
Note that, I am simple contributor here, my review will not have consequence regarding the merging.

Best regards.

if (CONFIG_ARM64_VA_BITS)
va = CONFIG_ARM64_VA_BITS;
else if (CONFIG_ARM64_64K_PAGES)
va = 42;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please indicate what this value is.
I guess it is the number of bit used for a page virtual address, but a comment (or a define) would be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

// populate extras to fix the kernel stack
extern bool CONFIG_ARM64 __kconfig __weak;

// arm64 only supported
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote code to apply the same logic for x86 but to be honest: I do not understand what does this logic.
I saw it was present in bcc version of this gadget:

// populate extras to fix the kernel stack

So, does someone know what do we need this code and what is the meaning of "fixing kernel stack"?

[Here is my code for information:

diff --git a/libbpf-tools/profile.bpf.c b/libbpf-tools/profile.bpf.c
index 08ac5687..9e4f5d0f 100644
--- a/libbpf-tools/profile.bpf.c
+++ b/libbpf-tools/profile.bpf.c
@@ -59,13 +59,13 @@ int do_perf_event(struct bpf_perf_event_data *ctx) {
 
        if (key.kern_stack_id >= 0) {
                // populate extras to fix the kernel stack
+               u64 ip = PT_REGS_IP(&ctx->regs);
+               u64 page_offset;
 
                // arm64 only supported
-#ifdef __TARGET_ARCH_arm64
+#if defined(__TARGET_ARCH_arm64)
                extern __u32 CONFIG_ARM64_VA_BITS __kconfig __weak;
                extern bool CONFIG_ARM64_64K_PAGES __kconfig __weak;
-               u64 ip = PT_REGS_IP(&ctx->regs);
-               u64 page_offset;
                u32 va;
 
                if (CONFIG_ARM64_VA_BITS)
@@ -79,11 +79,19 @@ int do_perf_event(struct bpf_perf_event_data *ctx) {
                        page_offset = (-(1UL << (va)));
                else
                        page_offset = (0xffffffffffffffffUL - (1UL << (va - 1)) + 1);
-
-               if (ip > page_offset) {
+#elif defined(__TARGET_ARCH_x86)
+#define __PAGE_OFFSET_BASE_L4 0xffff888000000000UL
+#define __PAGE_OFFSET_BASE_L5 0xff11000000000000UL
+               extern bool CONFIG_DYNAMIC_MEMORY_LAYOUT __kconfig __weak;
+               extern bool CONFIG_X86_5LEVEL __kconfig __weak;
+
+               if (CONFIG_DYNAMIC_MEMORY_LAYOUT && CONFIG_X86_5LEVEL)
+                       page_offset = __PAGE_OFFSET_BASE_L5;
+               else
+                       page_offset = __PAGE_OFFSET_BASE_L4;
+#endif /* __TARGET_ARCH_x86 */
+               if (ip > page_offset)
                        key.kernel_ip = ip;
-               }
-#endif
        }
 
        valp = bpf_map_lookup_or_try_init(&counts, &key, &zero);

]

@@ -250,13 +252,65 @@ static int cmp_counts(const void *dx, const void *dy)
return x > y ? -1 : !(x == y);
}

static bool batch_map_ops = true; /* hope for the best */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand this code.
You want to detect if bpf_map_lookup_batch() is available, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes batch_map_ops is initially set to true and then changed to false if bpf_map_lookup_batch() is not supported.
In fact, it was referenced in syscount.c. :). I couldn't figure out the meaning of the comments alone. So if you like, I'll edit the comment to be more detailed.

Copy link
Contributor

@eiffel-fl eiffel-fl Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh sorry I did not see you set this to false later, this totally makes sense then!
Regarding the comment, if you think it is better to edit it, then do, otherwise I am fine with this, this is a cool feature.

Copy link
Contributor Author

@ekyooo ekyooo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for your test, review and opinion.

@@ -250,13 +252,65 @@ static int cmp_counts(const void *dx, const void *dy)
return x > y ? -1 : !(x == y);
}

static bool batch_map_ops = true; /* hope for the best */
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes batch_map_ops is initially set to true and then changed to false if bpf_map_lookup_batch() is not supported.
In fact, it was referenced in syscount.c. :). I couldn't figure out the meaning of the comments alone. So if you like, I'll edit the comment to be more detailed.

// populate extras to fix the kernel stack
extern bool CONFIG_ARM64 __kconfig __weak;

// arm64 only supported
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, as I understand, It's to append ip address of the register to the top of the kernel stack trace to provide more detailed information to the user. page_offset is used to determine whether ip is a kernel address.

Thank you very much for the code for x86. But unfortunately I changed the code from yesterday...
because

  • If the target doesn't have Kconfig, the profile won't work at all.
  • On some versions of Linux, page_offset is not an effective value for this purpose.
    Could you please check the updated code?
    I'm really sorry and thank you.

if (CONFIG_ARM64_VA_BITS)
va = CONFIG_ARM64_VA_BITS;
else if (CONFIG_ARM64_64K_PAGES)
va = 42;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

@ekyooo
Copy link
Contributor Author

ekyooo commented Jul 20, 2022

Hi @ekyooo is it possible to sort the output like perf top like below? the top function cpu cycles percentage for processes/kernel, it is easier to understand :-)

@vincentmli, @eiffel-fl
It looks useful, I'll add it.
Thank you.

@vincentmli
Copy link

Hi @ekyooo @eiffel-fl I tested the profile, I noticed if I want to profile a process in container, profile is unable to get the process symbols in container, if I run the process in host, not in container, profile is able to get symbol for the process, the process has gcc flag -fno-omit-frame-pointer compiled that contains symbols, it appears all existing libbpf-tools tool are not name space aware, right. it would be nice to have profile support profiling process running in container with symbols resolved, maybe this is can be another follow-up PR once this PR is merged? by the way, I tested perf tool, perf is able to profile process in container and get symbol resolved.

@eiffel-fl
Copy link
Contributor

eiffel-fl commented Jul 20, 2022

it appears all existing libbpf-tools tool are not name space aware, right.

It is indeed the case but I am not sure if the goals of this tools are to be namespace aware.
In Inspektor Gadget, we modified some of the libbpf tools to be namespace aware (e.g. profile).

@vincentmli
Copy link

It is indeed the case but I am not sure if the goals of this tools are to be namespace aware. In Inspektor Gadget, we modified some of the libbpf tools to be namespace aware (e.g. profile).

right, I understand these tools in the beginning are not aimed for container environment, just thought it might be good to add it now since container environment like k8s is getting popular :-). your profile example is very good, I guess the changes you made can be directly copied/pasted/modified to existing libbpf-tools and it should work, right?

@eiffel-fl
Copy link
Contributor

I guess the changes you made can be directly copied/pasted/modified to existing libbpf-tools and it should work, right?

I think you will need to add the mntns map handling in profile.c and also to add an option to this program, like --mnt-ns-id, to then load the mount namespace ID in the corresponding map so filtering is done in eBPF.
I will take a deeper look to this to see how it translates in terms of implementation.

@ekyooo
Copy link
Contributor Author

ekyooo commented Aug 4, 2022

Could you review it? Or are you waiting for me to do something?

@ekyooo
Copy link
Contributor Author

ekyooo commented Dec 20, 2023

It's just rebased on the master branch.
I am not requesting a review now.

@ekyooo
Copy link
Contributor Author

ekyooo commented Jan 5, 2024

This patch includes the following changes:

  • Reduced the nesting of the "print_counts" functions by separating sub-functions.
  • Removed the logic related to "adding kernel_ip." It can be added again if we find it necessary.
  • Show symbol names only in stack trace for code simplicity. An expanded form of the stack trace (including module information and symbol offsets) will be reapplied after the current review is complete.
  • Fixed other issues. (sorry for missing your comment)

This version of the code looks cleaner to read. Could you please review this?
Thank you.

@yezhem
Multi-process support has been implemented in Profile.
However, in order to simplify the current code for review, I plan to apply it as an additional patch after the current review is completed.
Thank you.

Copy link
Contributor

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Thank you for rebasing and still working on this PR despite it was first opened two years ago!
I will take a deeper look and test it later, as I want to fully understand the modifications you brought to the code. For now, I have some comments.

Best regards.

Comment on lines +138 to +151
case 'U':
env.user_stacks_only = true;
break;
case 'K':
env.kernel_stacks_only = true;
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my usage of your tool, I am currently modifying this to have only one flag:
inspektor-gadget/inspektor-gadget#1417

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your good opinion.
But applying it in this patch make difference with bcc python Profile.
How about applying it to both the Python and C versions through a separate patch?

@chenhengqi
Could you please share your thoughts on this?

Comment on lines 288 to 291
items[i].k.pid = keys[i].pid;
items[i].k.user_stack_id = keys[i].user_stack_id;
items[i].k.kern_stack_id = keys[i].kern_stack_id;
strncpy(items[i].k.name, keys[i].name, TASK_COMM_LEN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you are using the key_t both in userspace and eBPF, you can use memcpy here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it. Thank you.

Comment on lines 401 to 360
static void print_count(struct key_t *event, __u64 count, int sfd,
struct ksyms *ksyms, struct syms_cache *syms_cache)
{
unsigned long *ip;

ip = calloc(env.perf_max_stack_depth, sizeof(*ip));
if (!ip) {
fprintf(stderr, "failed to alloc ip\n");
return;
}

/* kernel stack */
if (!env.user_stacks_only && !STACK_ID_EFAULT(event->kern_stack_id)) {
if (bpf_map_lookup_elem(sfd, &event->kern_stack_id, ip) != 0)
printf(" [Missed Kernel Stack]\n");
else
print_kernel_stacktrace(ip, ksyms);
}

/* user stack */
if (!env.kernel_stacks_only && !STACK_ID_EFAULT(event->user_stack_id)) {
if (NEED_DELIMITER(env.delimiter, event->user_stack_id, event->kern_stack_id))
printf(" --\n");

if (bpf_map_lookup_elem(sfd, &event->user_stack_id, ip) != 0)
printf(" [Missed User Stack]\n");
else
print_user_stacktrace(ip, syms_cache, event->pid);
}

printf(" %-16s %s (%d)\n", "-", event->name, event->pid);
printf(" %lld\n\n", count);

free(ip);
}

static void print_count_folded(struct key_t *event, __u64 count, int sfd,
struct ksyms *ksyms, struct syms_cache *syms_cache)
{
unsigned long *ip;

ip = calloc(env.perf_max_stack_depth, sizeof(*ip));
if (!ip) {
fprintf(stderr, "failed to alloc ip\n");
return;
}

printf("%s", event->name);

/* user stack */
if (!env.kernel_stacks_only && !STACK_ID_EFAULT(event->user_stack_id)) {
if (bpf_map_lookup_elem(sfd, &event->user_stack_id, ip) != 0)
printf(";[Missed User Stack]");
else
print_user_stacktrace_folded(ip, syms_cache, event->pid);
}

/* kernel stack */
if (!env.user_stacks_only && !STACK_ID_EFAULT(event->kern_stack_id)) {
if (NEED_DELIMITER(env.delimiter, event->user_stack_id, event->kern_stack_id))
printf(";-");

if (bpf_map_lookup_elem(sfd, &event->kern_stack_id, ip) != 0)
printf(";[Missed Kernel Stack]");
else
print_kernel_stacktrace_folded(ip, ksyms);
}

printf(" %lld\n", count);

free(ip);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can factorize this code to reduce the duplicated one.
Maybe, we can check the folded argument in the print_*_stacktrace() functions?

Copy link
Contributor Author

@ekyooo ekyooo Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, we can check the folded argument in the print_*_stacktrace() functions?

It can reduce duplication slightly, but it cannot reduce meaningfully because iterating direction and format are different.

In principle, it's good to separate functions to handle data and representation. For separating this, I can try using iterator that takes addresses and syms and then return resolved symbols, or I can try returning resolved symbols through another array.
But It can make complex code in C language or need more memory usage.

Any thoughts on using iterators for this? Shall I try it?

@ekyooo ekyooo force-pushed the profile branch 2 times, most recently from 992702f to 9098638 Compare January 11, 2024 11:22
@ekyooo
Copy link
Contributor Author

ekyooo commented Jan 11, 2024

I discovered and fixed an issue where some stack traces had zero counts.
Sorry for the issue.

And I'm working on removing duplication.
Thank you.

@ekyooo
Copy link
Contributor Author

ekyooo commented Jan 18, 2024

I think you can factorize this code to reduce the duplicated one. Maybe, we can check the folded argument in the print_*_stacktrace() functions?

I reduced duplicate code and excluded the folded option for the sake of simplicity. The folded option can be added in the next step if needed.
I further applied additional modifications to enhance code simplicity and clarity.
Could you please review this?
Thank you.

@eiffel-fl
Copy link
Contributor

Hi!

I think you can factorize this code to reduce the duplicated one. Maybe, we can check the folded argument in the print_*_stacktrace() functions?

I reduced duplicate code and excluded the folded option for the sake of simplicity. The folded option can be added in the next step if needed. I further applied additional modifications to enhance code simplicity and clarity. Could you please review this? Thank you.

Thank you! I will take a look at it during this week!

Copy link
Contributor

@eiffel-fl eiffel-fl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

I tested it and it works perfectly:

$ yes > /dev/null &
$ sudo ./profile -p $!
...
        1

    vfs_write
    ksys_write
    __x64_sys_write
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    write
    [unknown]
    -                yes (36571)
        1
...

I just found one small nit and I have a question with regard to resource usage but in my opinion, this is ready to go.

Best regards.

Comment on lines 267 to 273
if (x > y)
return -1;

if (x < y)
return 1;

return 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (x > y)
return -1;
if (x < y)
return 1;
return 0;
return y - x;

qsort function requires than something greater, equal or less than 0 is returned, so using this should do the trick.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback.
I have applied it.

fprintf(stderr, "failed to attach perf event on cpu: "
"%d\n", i);
links[i] = NULL;
close(fd);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do not call close() in cleanup, I suppose bpf_link__destroy() also closes the underlying perf event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked it again and confirmed that the link->detach called in bpf_link__destroy() closes the file descriptor. Thank you.

Signed-off-by: Eunseon Lee <es.lee@lge.com>
Copy link
Contributor Author

@ekyooo ekyooo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been updated with your feedback along with some other minor changes.
Thank you.

fprintf(stderr, "failed to attach perf event on cpu: "
"%d\n", i);
links[i] = NULL;
close(fd);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I checked it again and confirmed that the link->detach called in bpf_link__destroy() closes the file descriptor. Thank you.

Comment on lines 267 to 273
if (x > y)
return -1;

if (x < y)
return 1;

return 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback.
I have applied it.

Copy link
Contributor

@Bojun-Seo Bojun-Seo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested and approve this pull request.
Followings are the steps I did on Ubuntu 22.04.3 LTS, x86_64 machine.

/* Build profile */
$ git clone https://github.com/ekyooo/bcc.git -b profile & cd bcc
$ mkdir build & cd build
$ cmake ..
$ cd ../libbpf-tools
$ make profile

/* Build test program */
$ cat test.c
#include <stdlib.h>

unsigned int foo(unsigned int i) {
  unsigned int tmp = i + 50;
  return tmp * 3;
}

int main(int argc, char* argv[]) {
  unsigned int i = 0;
  while (i >= 0) {
    i = foo(i);
    ++i;
  }
  return 0;
}
$ gcc test.c

/* Run test program and profile */
$ ./a.out &
$ sudo ./profile -p 134827 -U
Sampling at 49 Hertz of PID 134827 by user... Hit Ctrl-C to end.
^C    main
    [unknown]
    -                a.out (134827)
        611

    main
    [unknown]
    -                a.out (134827)
        278

    foo
    main
    [unknown]
    -                a.out (134827)
        254

    foo
    main
    [unknown]
    -                a.out (134827)
        232

    main
    [unknown]
    -                a.out (134827)
        221

    foo
    main
    [unknown]
    -                a.out (134827)
        199

    foo
    main
    [unknown]
    -                a.out (134827)
        123

    foo
    main
    [unknown]
    -                a.out (134827)
        117

    foo
    main
    [unknown]
    -                a.out (134827)
        114

    foo
    main
    [unknown]
    -                a.out (134827)
        113

    foo
    [unknown]
    -                a.out (134827)
        105

/* Run profile.py and compare the result */
$ sudo python3 ../tools/profile.py -p 134827 -U
Sampling at 49 Hertz of PID [134827] by user stack... Hit Ctrl-C to end.

^C
    foo
    main
    __libc_start_call_main
    -                a.out (134827)
        3

    foo
    main
    __libc_start_call_main
    -                a.out (134827)
        3

    foo
    main
    __libc_start_call_main
    -                a.out (134827)
        4

    foo
    main
    __libc_start_call_main
    -                a.out (134827)
        5

    foo
    __libc_start_call_main
    -                a.out (134827)
        5

    foo
    main
    __libc_start_call_main
    -                a.out (134827)
        6

    main
    __libc_start_call_main
    -                a.out (134827)
        7

    foo
    main
    __libc_start_call_main
    -                a.out (134827)
        8

    main
    __libc_start_call_main
    -                a.out (134827)
        9

    foo
    main
    __libc_start_call_main
    -                a.out (134827)
        11

    main
    __libc_start_call_main
    -                a.out (134827)
        27

It looks same except unknown printing. And unknown printing issue is related with #4849. Which means it is not directly related with this PR.

@ekyooo
Copy link
Contributor Author

ekyooo commented Feb 2, 2024

@eiffel-fl @Bojun-Seo
Thank you for the review, testing and approving!

Can anybody take a look and possibly merge this?

@yonghong-song
Copy link
Collaborator

Thanks for detailed reviews by @chenhengqi @eiffel-fl @Bojun-Seo. The patch looks good to me. I see in the current patch bpf_map_get_next_key() is used. The kernel has implemented batch lookup which should be more reliable. But this can be done as a followup.

@yonghong-song yonghong-song merged commit b14c463 into iovisor:master Feb 7, 2024
1 of 12 checks passed
@ekyooo
Copy link
Contributor Author

ekyooo commented Feb 7, 2024

The kernel has implemented batch lookup which should be more reliable. But this can be done as a followup.

I have code that uses batch lookup. I will apply it with the next patch.

@yonghong-song @chenhengqi @eiffel-fl @Bojun-Seo
Thank you for all your efforts.
It's all thanks to you.

@eiffel-fl
Copy link
Contributor

You are more than welcome! It was a pleasure to work with you!

@ekyooo
Copy link
Contributor Author

ekyooo commented Apr 16, 2024

@eiffel-fl
Are you in OSS NA? If so, I'd like to meet you for a moment.

@eiffel-fl
Copy link
Contributor

Hi!

@eiffel-fl Are you in OSS NA? If so, I'd like to meet you for a moment.

Sadly no, I will mail you on your Signed-off-by mail the next time I will attend an event (I do know when it will be nonetheless).

Best regards and enjoy OSS :D!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants