Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix systemd chart update (eBPF) #13884

Merged
merged 12 commits into from
Nov 1, 2022

Conversation

thiagoftsm
Copy link
Contributor

@thiagoftsm thiagoftsm commented Oct 26, 2022

Summary

This PR fix:

2022-10-23 03:18:30: netdata ERROR : PLUGINSD[ebpf] : (0258@collectors/plugins.d:pluginsd_dimens): requested a DIMENSION, without a CHART, on host 'box'. Disabling it.
2022-10-23 03:18:30: netdata IERR  : PLUGINSD[ebpf] : (0337@parser/parser.c     :parser_action  ): action_function() failed with rc = 2
2022-10-23 03:18:30: netdata IERR  : PLUGINSD[ebpf] : (0348@parser/parser.c     :parser_action  ): parser_action() failed.
2022-10-23 03:18:30: netdata ERROR : PLUGINSD[ebpf] : (0133@collectors/plugins.d:pluginsd_worker): '/usr/libexec/netdata/plugins.d/ebpf.plugin' (pid 21184) disconnected after 59773 successful data collections (ENDs).
2022-10-23 03:18:34: netdata ERROR : PLUGINSD[ebpf] : (0412@libnetdata/popen/pop:netdata_pclose ): child pid 21184 exited with code 15.
2022-10-23 03:18:34: netdata ERROR : PLUGINSD[ebpf] : (0091@collectors/plugins.d:pluginsd_worker): '/usr/libexec/netdata/plugins.d/ebpf.plugin' (pid 21184) exited with error code 15, but has given useful output in the past (59773 times). Will not start it again - it is disabled.
2022-10-23 03:18:44: netdata INFO  : PLUGINSD[ebpf] : (0120@libnetdata/threads/t:thread_cleanup ): thread with task id 21167 finished

When a service was starting after eBPF plugin is running, it was not recreating properly the dimensions and this could create previous error. This PR is fixing this.

Test Plan

I am really sorry for this, but you will need to use systemd to test the PR. 😄

  1. Compile this PR on an environment with systemd. I suggest to compile with flag -DNETDATA_DEV_MODE=1.
  2. Be sure that integration between eBPF and cgroup inside your /etc/netdata/ebpf.d.conf:
[global]
    apps = yes
    cgroups = yes
  1. Start netdata.
  2. Wait at least 2 minutes and stop a service.
  3. Now start the service again, and wait few minutes.
  4. After these steps the plugin should not stop anymore.
Additional Information

It is not necessary to test on different kernels.

A quick update, I ran this PR during 10 hours without any issue. I was not able to run too long with current master on Arch Linux.

For users: How does this change affect me? Describe the PR affects users: - Which area of Netdata is affected by the change? eBPF.plugin - Can they see the change or is it an under the hood? If they can see it, where? Yes, the plugin won't fail when a service is stopped. - How is the user impacted by the change? A better plugin - What are there any benefits of the change? eBPF will monitor systemd without issues.

@github-actions github-actions bot added area/collectors Everything related to data collection collectors/ebpf labels Oct 26, 2022
@MrZammler
Copy link
Contributor

Hi Thiago!

Not sure what I'm doing wrong here, I've compiled your branch on an Ubuntu 22.04, in system directories, etc.

Running it I get at startup:

2022-10-26 09:04:40: netdata ERROR : PLUGIN[cgroups] : (0578@collectors/cgroups.p:netdata_cgroup_): Cannot initialize shared memory used by cgroup and eBPF, integration won't happen. (errno 13, Permission denied)
2022-10-26 09:04:40: netdata INFO  : PLUGINSD[ebpf] : (0191@libnetdata/threads/t:thread_start   ): thread created with task id 3658698
2022-10-26 09:04:40: netdata INFO  : PLUGINSD[ebpf] : (0153@libnetdata/threads/t:thread_set_name): set name of thread 3658698 to PLUGINSD[ebpf]
2022-10-26 09:04:40: netdata INFO  : PLUGINSD[ebpf] : (0131@collectors/plugins.d:pluginsd_worker): connected to '/usr/libexec/netdata/plugins.d/ebpf.plugin' running on pid 3658735
2022-10-26 09:04:40:  INFO  : MAIN : (3751@collectors/ebpf.plug:parse_network_v): Name resolution is disabled, collector will not parser "hostnames" list.
2022-10-26 09:04:40:  INFO  : MAIN : (3346@collectors/ebpf.plug:parse_ip_list  ): The network value of CIDR 127.0.0.1/8 was updated for 127.0.0.0 .
2022-10-26 09:04:40:  INFO  : MAIN : (3265@collectors/ebpf.plug:fill_ip_list   ): Adding values 127.0.0.0 - 127.255.255.255 to excluded IP list "socket" used on network viewer
2022-10-26 09:04:40:  INFO  : MAIN : (3265@collectors/ebpf.plug:fill_ip_list   ): Adding values 10.0.0.0 - 10.255.255.255 to included IP list "socket" used on network viewer
2022-10-26 09:04:40:  INFO  : MAIN : (3265@collectors/ebpf.plug:fill_ip_list   ): Adding values 172.16.0.0 - 172.31.255.255 to included IP list "socket" used on network viewer
2022-10-26 09:04:40:  INFO  : MAIN : (3265@collectors/ebpf.plug:fill_ip_list   ): Adding values 192.168.0.0 - 192.168.255.255 to included IP list "socket" used on network viewer
2022-10-26 09:04:40:  INFO  : MAIN : (3272@collectors/ebpf.plug:fill_ip_list   ): Adding values fc00:: - fdff:ffff:ffff:ffff:ffff:ffff:ffff:ffff to included IP list "socket" used on network viewer
2022-10-26 09:04:40:  INFO  : MAIN : (3272@collectors/ebpf.plug:fill_ip_list   ): Adding values ::1 - ::1 to excluded IP list "socket" used on network viewer
2022-10-26 09:04:40:  INFO  : MAIN : (3802@collectors/ebpf.plug:link_dimension_): Adding values Netdata( 19999) to dimension name list used on network viewer
2022-10-26 09:04:40:  INFO  : MAIN : (1951@collectors/ebpf.plug:ebpf_parse_args): Cannot read process groups configuration file '/etc/netdata/apps_groups.conf'. Will try '/usr/lib/netdata/conf.d/apps_groups.conf'
2022-10-26 09:04:40: ebpf.plugin ERROR : MAIN : (2165@collectors/ebpf.plug:main           ): Setrlimit(RLIMIT_MEMLOCK) (errno 1, Operation not permitted)
2022-10-26 09:04:40: netdata ERROR : PLUGINSD[ebpf] : (0228@parser/parser.c     :parser_next    ): read failed: end of file (errno 9, Bad file descriptor)
2022-10-26 09:04:40: netdata ERROR : PLUGINSD[ebpf] : (0133@collectors/plugins.d:pluginsd_worker): '/usr/libexec/netdata/plugins.d/ebpf.plugin' (pid 3658735) disconnected after 0 successful data collections (ENDs).
2022-10-26 09:04:40: netdata ERROR : PLUGINSD[ebpf] : (0412@libnetdata/popen/pop:netdata_pclose ): child pid 3658735 exited with code 4.
2022-10-26 09:04:40: netdata ERROR : PLUGINSD[ebpf] : (0083@collectors/plugins.d:pluginsd_worker): '/usr/libexec/netdata/plugins.d/ebpf.plugin' (pid 3658735) exited with error code 4 and haven't collected any data. Disabling it.
2022-10-26 09:04:40: netdata INFO  : PLUGINSD[ebpf] : (0120@libnetdata/threads/t:thread_cleanup ): thread with task id 3658698 finished
2022-10-26 09:04:42: go.d ERROR: prometheus[ebpf_exporter_local] Get "http://127.0.0.1:9435/metrics": dial tcp 127.0.0.1:9435: connect: connection refused
2022-10-26 09:04:42: go.d ERROR: prometheus[ebpf_exporter_local] check failed

@thiagoftsm
Copy link
Contributor Author

Hello @MrZammler ,

Are you running netdata/`eBPF.plugin' as root or do you have special permissions like this:

root@hades:/home/thiago/Netdata/netdata# ls -l /usr/libexec/netdata/plugins.d/ebpf.plugin 
-rwsr-x--- 1 root netdata 3298016 Oct 26 01:29 /usr/libexec/netdata/plugins.d/ebpf.plugin*

?

The error : Setrlimit(RLIMIT_MEMLOCK) (errno 1, Operation not permitted) means that on your system the process did not have permissions to adjust is own memory limit.

Best regards!

@MrZammler
Copy link
Contributor

As far as I see, yes, it should be running as root (e.g. apps.plugin does).

ls -l /usr/libexec/netdata/plugins.d/ebpf.plugin 
-rwsr-x--- 1 root netdata 3376352 Oct 26 09:00 /usr/libexec/netdata/plugins.d/ebpf.plugin

@thiagoftsm
Copy link
Contributor Author

Do you start netdata using sudo? 🤔 , If yes, probably you need to do this

@MrZammler
Copy link
Contributor

Do you start netdata using sudo? thinking , If yes, probably you need to do this

Thanks! I'm starting from the systemd service 🤔 ok let me check a bit and will let you know. Will check on the same system current master to see if it behaves the same.

@MrZammler
Copy link
Contributor

Yes, happens on build from master and the package, so something system related. I'll dig a bit into it if I can find something.

@MrZammler
Copy link
Contributor

Most likely appears to be a problem with the container I was using to test. Sorry to bother, will test on another VM!

@thiagoftsm
Copy link
Contributor Author

No problem @MrZammler I am happy you could discover. I was thinking that you run direct on a host.

Copy link
Contributor

@MrZammler MrZammler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with stopping, starting cron via systemd, charts appear to behave correctly, no errors in log!

@Dim-P
Copy link
Contributor

Dim-P commented Oct 31, 2022

Hey @thiagoftsm . When testing on Ubuntu 22.04.1 LTS (Jammy Jellyfish) (having cgroups = yes), I am seeing the following crash (not sure if related to this PR or not, I will try master too):

2022-10-31 18:24:48: netdata ERROR : PLUGIN[proc netdev] : (1023@collectors/proc.plug:do_proc_net_dev): Cannot refresh interface wlo1 speed by reading '/sys/class/net/wlo1/speed'. (errno 22, Invalid argument)
2022-10-31 18:24:50: netdata ERROR : PLUGIN[proc netdev] : (1023@collectors/proc.plug:do_proc_net_dev): Cannot refresh interface wlo1 speed by reading '/sys/class/net/wlo1/speed'. (errno 22, Invalid argument)
2022-10-31 18:24:52: netdata ERROR : PLUGIN[proc netdev] : (1023@collectors/proc.plug:do_proc_net_dev): Cannot refresh interface wlo1 speed by reading '/sys/class/net/wlo1/speed'. (errno 22, Invalid argument)
2022-10-31 18:24:54: netdata ERROR : PLUGIN[proc netdev] : (1023@collectors/proc.plug:do_proc_net_dev): Cannot refresh interface wlo1 speed by reading '/sys/class/net/wlo1/speed'. (errno 22, Invalid argument)
2022-10-31 18:24:56: netdata ERROR : PLUGIN[proc netdev] : (1023@collectors/proc.plug:do_proc_net_dev): Cannot refresh interface wlo1 speed by reading '/sys/class/net/wlo1/speed'. (errno 22, Invalid argument)
=================================================================
==63883==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x603000012190 in thread T1
2022-10-31 18:24:58: netdata ERROR : PLUGIN[proc netdev] : (1023@collectors/proc.plug:do_proc_net_dev): Cannot refresh interface wlo1 speed by reading '/sys/class/net/wlo1/speed'. (errno 22, Invalid argument)
    #0 0x5583898ed847 in __interceptor_free (/usr/libexec/netdata/plugins.d/ebpf.plugin+0xdc847)
    #1 0x5583899c6a59 in freez libnetdata/libnetdata.c:406
    #2 0x55838998bde9 in cleanup_variables_from_other_threads collectors/ebpf.plugin/ebpf_apps.c:978
    #3 0x55838998caa7 in collect_data_for_all_processes collectors/ebpf.plugin/ebpf_apps.c:1134
    #4 0x55838995b5ce in process_collector collectors/ebpf.plugin/ebpf_process.c:1143
    #5 0x55838995bc4b in ebpf_process_thread collectors/ebpf.plugin/ebpf_process.c:1353
    #6 0x5583899ee15d in thread_start libnetdata/threads/threads.c:203
    #7 0x7fca6bbafb42 in start_thread nptl/pthread_create.c:442
    #8 0x7fca6bc419ff  (/lib/x86_64-linux-gnu/libc.so.6+0x1269ff)

0x603000012190 is located 16 bytes to the right of 32-byte region [0x603000012160,0x603000012180)
allocated by thread T11 here:
    #0 0x5583898edd67 in __interceptor_calloc (/usr/libexec/netdata/plugins.d/ebpf.plugin+0xdcd67)
    #1 0x5583899c6b07 in callocz libnetdata/libnetdata.c:416
    #2 0x558389949c94 in ebpf_fd_allocate_global_vectors collectors/ebpf.plugin/ebpf_fd.c:1093
    #3 0x55838994c39b in ebpf_fd_thread collectors/ebpf.plugin/ebpf_fd.c:1155
    #4 0x5583899ee15d in thread_start libnetdata/threads/threads.c:203
    #5 0x7fca6bbafb42 in start_thread nptl/pthread_create.c:442

Thread T1 created by T0 here:
    #0 0x5583898919d5 in pthread_create (/usr/libexec/netdata/plugins.d/ebpf.plugin+0x809d5)
    #1 0x5583899ee7b8 in netdata_thread_create libnetdata/threads/threads.c:217
    #2 0x55838993940f in main collectors/ebpf.plugin/ebpf.c:2203
    #3 0x7fca6bb44d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

Thread T11 created by T0 here:
    #0 0x5583898919d5 in pthread_create (/usr/libexec/netdata/plugins.d/ebpf.plugin+0x809d5)
    #1 0x5583899ee7b8 in netdata_thread_create libnetdata/threads/threads.c:217
    #2 0x55838993940f in main collectors/ebpf.plugin/ebpf.c:2203
    #3 0x7fca6bb44d8f in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58

SUMMARY: AddressSanitizer: bad-free (/usr/libexec/netdata/plugins.d/ebpf.plugin+0xdc847) in __interceptor_free
==63883==ABORTING
2022-10-31 18:24:59: netdata ERROR : PLUGINSD[ebpf] : (0228@parser/parser.c     :parser_next    ): read failed: end of file (errno 9, Bad file descriptor)
2022-10-31 18:24:59: netdata ERROR : PLUGINSD[ebpf] : (0133@collectors/plugins.d:pluginsd_worker): '/usr/libexec/netdata/plugins.d/ebpf.plugin' (pid 63883) disconnected after 146 successful data collections (ENDs).
2022-10-31 18:25:00: netdata ERROR : PLUGINSD[ebpf] : (0412@libnetdata/popen/pop:netdata_pclose ): child pid 63883 exited with code 1.
2022-10-31 18:25:00: netdata ERROR : PLUGINSD[ebpf] : (0091@collectors/plugins.d:pluginsd_worker): '/usr/libexec/netdata/plugins.d/ebpf.plugin' (pid 63883) exited with error code 1, but has given useful output in the past (2868 times). Waiting a bit before starting it again.

I am installing with:

sudo CFLAGS="-Og -ggdb -Wall -Wextra -fsanitize=address -static-libasan -fno-omit-frame-pointer -Wformat-signedness -fstack-protector-all -DNETDATA_DEV_MODE=1 -DNETDATA_INTERNAL_CHECKS=1 -D_FORTIFY_SOURCE=2 -DNETDATA_VERIFY_LOCKS=1 -Wformat-truncation=2 -Wunused-result" ./netdata-installer.sh --dont-wait --disable-telemetry --disable-cloud --disable-ml --disable-go --disable-lto --use-system-protobuf

@thiagoftsm
Copy link
Contributor Author

Hello @Dim-P ,

Thanks for your report. This issue is not related with this PR, I will try to recreate it. Is this happening when you stop netdata?

Best regards!

@Dim-P
Copy link
Contributor

Dim-P commented Oct 31, 2022

Hello @Dim-P ,

Thanks for your report. This issue is not related with this PR, I will try to recreate it. Is this happening when you stop netdata?

Best regards!

No, it's happening a little while after I restart the Netdata service.

Copy link
Contributor

@Dim-P Dim-P left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could not reproduce the crash that this PR fixes (on Ubuntu 22.04 with systemd), but I am approving it since I tested it and eBPF works fine (so that it goes into the next release if merged in time).

@thiagoftsm thiagoftsm merged commit d4e0f11 into netdata:master Nov 1, 2022
@thiagoftsm thiagoftsm deleted the fix_ebpf_chart_creation branch November 1, 2022 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/collectors Everything related to data collection collectors/ebpf
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants