-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: EBPF Plugin Kernel SEGFAULT #15103
Comments
Hello @Knot3n , Thank you for your report. I am going to install Rocky Linux and try to recreate the issue, because unfortunately I did not see it on other Linux distributions like Alma 8.6. Please, do you have a backtrace from these coredumps? Finally, it is interesting that you are having system frozen, because eBPF programs runs inside VMs, so we do not expect crashes in VM to freeze the system, but Rocky does not use the exactly same kernel delivered by kernel team. Best regards! |
Hi @thiagoftsm , unfortunaly there is no Backtrace.
To give you more insights: We updated yesterday Rocky Linux (can share also the ugpraded packets) - afterwards the systems were going crazy - i thought since one VM freezed, it was just a one timer , after that 3 other VM's are also freezed and were not responding anymore. Since then (3 hours) im searching for the problem ... was it a old kernel .. was it a old netdata version etc. as a hotfix we disabled now the eBPF plugin. |
Please, give me the list of packages you are running, the best scenario for me is to recreate exactly the same environment you are running. We apologize for the issues! As soon I recreate the problems on my environment I will make a PR to fix the issue. |
These are the upgraded packages.dnf history info 105 On this system following services are running:
docker for flow stats + AS-STATS with python -> NGINX as reverse proxy. Since it happens not only on that VM also on others it is def. something with the eBPF Plugin. The other system is hosting for example:
Thanks, hopefully you will find something - if you need more infos from me, feel free to ask. |
@Knot3n , please, can you give me the kernel version running on both VMs? |
Sorry @Knot3n , I saw now that the list was hidden. 🤦♂️ |
Hello @Knot3n , I made a VM and I had more or less the same packages than you (I am only sharing here the most important for eBPF): [root@rocky netdata]# rpm -qa| grep kernel
kernel-devel-4.18.0-477.10.1.el8_8.x86_64
kernel-4.18.0-477.10.1.el8_8.x86_64
kernel-tools-libs-4.18.0-477.10.1.el8_8.x86_64
kernel-tools-4.18.0-477.10.1.el8_8.x86_64
kernel-core-4.18.0-477.10.1.el8_8.x86_64
kernel-headers-4.18.0-477.10.1.el8_8.x86_64
kernel-modules-4.18.0-477.10.1.el8_8.x86_64
[root@rocky netdata]# rpm -qa| grep gcc
libgcc-8.5.0-18.el8.x86_64
gcc-8.5.0-18.el8.x86_64
gcc-plugin-annobin-8.5.0-18.el8.x86_64
gcc-c++-8.5.0-18.el8.x86_64
[root@rocky netdata]# ps aux | grep netdata
netdata 15637 0.5 7.3 546896 72268 ? SNsl 14:02 0:15 /usr/sbin/netdata -P /run/netdata/netdata.pid -D
netdata 15639 0.0 0.9 61396 9272 ? SNl 14:02 0:00 /usr/sbin/netdata --special-spawn-server
netdata 15792 0.5 1.0 166612 10124 ? SNl 14:02 0:14 /usr/libexec/netdata/plugins.d/apps.plugin 1
root 15793 0.0 8.7 1214384 86344 ? SNl 14:02 0:01 /usr/libexec/netdata/plugins.d/ebpf.plugin 1
netdata 15794 0.0 0.2 39984 2896 ? SN 14:02 0:00 /usr/libexec/netdata/plugins.d/debugfs.plugin 1
netdata 15796 0.1 5.7 773868 56976 ? SNl 14:02 0:03 /usr/libexec/netdata/plugins.d/go.d.plugin 1
root 16047 0.0 0.1 12144 1208 pts/0 S+ 14:48 0:00 grep --color=auto netdata
[root@rocky netdata]# ps -p 15793 -o pid,cmd,etime,uid,gid
PID CMD ELAPSED UID GID
15793 /usr/libexec/netdata/plugin 46:38 0 990
[root@rocky netdata]# systemctl stop netdata
[root@rocky netdata]# coredumpctl list
No coredumps found. but, I could not have the same issues you are having. To investigate the issue I used this VM. We had an issue more or less like this and some users gave me access to their VM for me debug and fix. One of these VMs having issues were a VMware, that I do not have, so I have three questions for you:
Best regards! |
It doesn't seem to be very much Kernel related. I tried starting my machine with several Kernel versions (including an older Rocky Linux 8.7) and I always get the same segfault. I'm running Rocky Linux in a virtual machine (bhyve). |
Hello guys, I am sorry for the delay, but health issues kept me out almost one week. I returned today for the work and I will dedicate time on this tomorrow. Best regards! |
Hello everyone, I made the PR and I set it as draft, because I am still testing it on different Linux Distributions. Best regards! |
In the meantime I installed v1.40.0 and the problem is solved. No more SEGFAULT. |
I can confirm that this also happens in 1.42.0:
Here's the log entries, they repeat every day:
Removing, purging and reinstalling from scratch gives me:
The same happens when I restart netdata using |
Hello @desrod , According to the I do no have any coredump since latest netdata was released, so for me to recreate and fix what you are reporting, please, answer next questions:
Best regards! |
Bug description
We are gettin EBPF KERNEL SEGFAULTS which leads to freeze the system at sometimes.
kernel - EBPF SEGFAULT:
May 25 12:31:49 hermes systemd[1]: systemd-hostnamed.service: Succeeded.
May 25 12:31:51 hermes kernel: warning:
/opt/netdata/usr/libexec/netdata/plugins.d/apps.plugin' has both setuid-root and effective capabilities. Therefore not raising all capabilities. May 25 12:31:51 hermes [2282]: PROCFILE: Cannot open file '/opt/netdata/etc/netdata/apps_groups.conf' May 25 12:31:51 hermes [2282]: Cannot read process groups configuration file '/opt/netdata/etc/netdata/apps_groups.conf'. Will try '/opt/netdata/usr/lib/netdata/conf.d/apps_groups.conf' May 25 12:31:51 hermes [2276]: Does not have a configuration file inside
/opt/netdata/etc/netdata/ebpf.d.conf. It will try to load stock file.May 25 12:31:51 hermes [2282]: Loaded config file '/opt/netdata/usr/lib/netdata/conf.d/apps_groups.conf'
May 25 12:31:51 hermes [2282]: started on pid 2282
May 25 12:31:51 hermes [2282]: set name of thread 2317 to APPS_READER
May 25 12:31:51 hermes [2304]: no charts enabled - nothing to do.
May 25 12:31:51 hermes [2276]: Cannot read process groups configuration file '/opt/netdata/etc/netdata/apps_groups.conf'. Will try '/opt/netdata/usr/lib/netdata/conf.d/apps_groups.conf'
May 25 12:31:51 hermes [2276]: PROCFILE: Cannot open file '/proc/10365/status'
May 25 12:31:51 hermes [2276]: Cannot open /proc/10365/status
May 25 12:31:51 hermes [2276]: thread created with task id 2397
May 25 12:31:51 hermes [2276]: set name of thread 2397 to EBPF CACHESTAT
May 25 12:31:51 hermes [2276]: thread created with task id 2395
May 25 12:31:51 hermes [2276]: set name of thread 2395 to EBPF CGROUP INT
May 25 12:31:51 hermes [2276]: thread created with task id 2396
May 25 12:31:51 hermes [2276]: set name of thread 2396 to EBPF PROCESS
May 25 12:31:51 hermes [2276]: thread created with task id 2399
May 25 12:31:51 hermes [2276]: set name of thread 2399 to EBPF SWAP
May 25 12:31:51 hermes [2276]: thread created with task id 2398
May 25 12:31:51 hermes [2276]: set name of thread 2398 to EBPF SYNC
May 25 12:31:51 hermes [2276]: thread created with task id 2400
May 25 12:31:51 hermes [2276]: set name of thread 2400 to EBPF MOUNT
May 25 12:31:51 hermes [2276]: thread created with task id 2401
May 25 12:31:51 hermes [2276]: set name of thread 2401 to EBPF FD
May 25 12:31:51 hermes [2276]: thread created with task id 2403
May 25 12:31:51 hermes [2276]: set name of thread 2403 to EBPF OOMKILL
May 25 12:31:51 hermes [2276]: thread created with task id 2404
May 25 12:31:51 hermes [2276]: set name of thread 2404 to EBPF SHM
May 25 12:31:51 hermes [2276]: thread created with task id 2402
May 25 12:31:51 hermes [2276]: set name of thread 2402 to EBPF SOFTIRQ
May 25 12:31:51 hermes kernel: EBPF SOFTIRQ[2402]: segfault at 68 ip 000000000047b91b sp 00007f9472ccc8b0 error 4
May 25 12:31:51 hermes kernel: Code: c1 e0 02 48 01 d0 48 c1 e0 05 48 01 c8 c9 c3 55 48 89 e5 48 83 ec 10 48 89 7d f8 48 89 75 f0 48 83 7d f0 00 75 0a 48 8b 45 f8 <48> 8b 40 68 eb 18 48 8b 4d f8 48 8b 45 f0 ba 01 00 00 00 48 89 ce
May 25 12:31:51 hermes kernel: EBPF PROCESS[2396]: segfault at 68 ip 000000000047b91b sp 00007f9472d9e830 error 4 in ebpf.plugin[401000+345000]
May 25 12:31:51 hermes kernel: Code: c1 e0 02 48 01 d0 48 c1 e0 05 48 01 c8 c9 c3 55 48 89 e5 48 83 ec 10 48 89 7d f8 48 89 75 f0 48 83 7d f0 00 75 0a 48 8b 45 f8 <48> 8b 40 68 eb 18 48 8b 4d f8 48 8b 45 f0 ba 01 00 00 00 48 89 ce
Expected behavior
No SEGFAULT
We had not several systems freezes because of EBPF SEGFAULTS in combination with netdata.
Rocky Linux 8 + 9
Steps to reproduce
...
Installation method
kickstart.sh
System info
Netdata build info
Additional info
systemctl status netdata drops:
systemd-coredump[3507]: [🡕] Process 3403 (ebpf.plugin) of user 992 dumped core.
The text was updated successfully, but these errors were encountered: