Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netdata hogs cpu 100% and hungs after creating next dbengine segment #6701

Closed
aldem opened this issue Aug 21, 2019 · 16 comments · Fixed by #6731
Closed

netdata hogs cpu 100% and hungs after creating next dbengine segment #6701

aldem opened this issue Aug 21, 2019 · 16 comments · Fixed by #6731
Assignees
Labels
Projects
Milestone

Comments

@aldem
Copy link

aldem commented Aug 21, 2019

Hi,

For few times already, I have noticed that netdata is going wild and simply hangs with 100% CPU usage in one of it's threads, Web UI is not responding, and data collected is not stored.

Unfortunately, I could not find anything that would allow to reproduce this behavior, I could not find any specific pattern in system load nor processes starting/stopping/running, it may run for weeks without any issues or it may run for just couple days.

What I know so far:

  • First issue was with version 1.16.0 (release), the last - with 1.16.1.
  • All systems where it happens are running Proxmox, though different versions (thus, different kernel versions too, in particular, Debian 9/Proxmox 5/Kernel 4.15 and Debian 10/Proxmox 6/Kernel 5.0, and different hardware and even different physical locations).
  • System load, starting/stopping containers/vms does not trigger the issue, it happens right after it tries to create next dbengine segment (at least, this is the last message in log before it hangs).
  • Once hung, it does not react to signals.

My configuration is dead simple:

[global]
        history = 90000
        memory mode = dbengine
        page cache size = 128
        dbengine disk space = 512

Yes, I know that "history" is useless for dbengine, but it is there :) I have also few hosts monitored by fping, apcd, hddtemp, smart monitoring, sensors but that's all (ca. 3000 metrics and 450 charts). System itself has few (mostly idling) lxc containers and couple KVMs (also mostly idling). It has plenty of RAM only half of which is in use, and one of KVMs is also running netdata inside (same version) without any issues.

I have compiled debug version in the hope to backtrack where it hungs when (if) this happens, and it did, so I have collected some details.

Last log messages before it started to hog CPU:

2019-08-21 07:31:02: netdata INFO  : MAIN : Creating new data and journal files in path /opt/netdata-debug/var/cache/netdata/dbengine
2019-08-21 07:31:02: netdata INFO  : MAIN : Created data file "/opt/netdata-debug/var/cache/netdata/dbengine/datafile-1-0000000448.ndf".
2019-08-21 07:31:02: netdata INFO  : MAIN : Created journal file "/opt/netdata-debug/var/cache/netdata/dbengine/journalfile-1-0000000448.njf".

This perfectly correlates with my CPU hog detector (it reports when process is using CPU for at least 75% within 30 seconds):

2019-08-21 07:31:33 [21473 (netdata)] hog detected: 30

It reports pid of main process, while htop revealed actual thread pid that was consuming CPU.

strace on this thread does not reveal anything useful:

# strace -p 21484
strace: Process 21484 attached
strace: [ Process PID=21484 runs in x32 mode. ]
strace: [ Process PID=21484 runs in 64 bit mode. ]
futex(0x7fed4a056de0, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fed50864050, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x7fed51065270, FUTEX_WAKE_PRIVATE, 2147483647) = 1
...

futex() calls were generated with a rate ca. 5-10 per second, nothing else was called, stacktarce in /proc was empty (which probably means that it was not waiting for syscall).

Finally, I pulled in gdb:

# gdb -p 21484
Attaching to process 21484
....
0x00007fed564506c3 in __GI___pthread_mutex_lock (mutex=0x5622c8adc160) at ../nptl/pthread_mutex_lock.c:67
67      ../nptl/pthread_mutex_lock.c: No such file or directory.
(gdb) bt
#0  0x00007fed564506c3 in __GI___pthread_mutex_lock (mutex=0x5622c8adc160) at ../nptl/pthread_mutex_lock.c:67
#1  0x00007fed566f64e9 in uv_mutex_lock () from /usr/lib/x86_64-linux-gnu/libuv.so.1
#2  0x00005622c1854e0f in rrdeng_page_descr_mutex_lock (ctx=ctx@entry=0x7fed56071010, descr=descr@entry=0x5622c9692800) at database/engine/rrdenglocking.c:91
#3  0x00005622c185cb6d in do_flush_pages (wc=wc@entry=0x7fed56071010, force=force@entry=1, completion=0x7fed5386a830) at database/engine/rrdengine.c:342
#4  0x00005622c185d412 in rrdeng_worker (arg=0x7fed56071010) at database/engine/rrdengine.c:756
#5  0x00007fed5644dfa3 in start_thread (arg=<optimized out>) at pthread_create.c:486
#6  0x00007fed5637e4cf in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

At this point I am a bit stuck. Either this is a bug in kernel (which spreads every kernel from 4.15 to 5.0), or... I ran out of ideas.

That is exactly why I am posting this - may be someone has any clue how to proceed to actually nail this issue? Logs are useless (nothing unusual), stack trace and strace are meaningless, problem could not be reliably reproduced but for sure it is in there.

Thank you!

@aldem aldem added bug needs triage Issues which need to be manually labelled labels Aug 21, 2019
@cakrit
Copy link
Contributor

cakrit commented Aug 21, 2019

@aldem we'd like to see more log messages (or the entire log file, if possible), before 2019-08-21 07:31:02. We suspect that there was already a different dbengine operation in progress, or for some reason an operation did not release the lock it had. We should see some evidence of that in the log file.

Of course remove any sensitive data you have on that file (hostname/ip), if you add all of it error.log.

@aldem
Copy link
Author

aldem commented Aug 21, 2019

Sure, the log is attached. It covers the period since last start before crash, last line in this file is indeed the last line produced by that instance after I tried to stop it.

If there are some additional options that I can use to get more diagnostics when compiling/starting - please let me know, though it may take days or even weeks before this happens again.

pre-crash-error.log.gz

@mfundul
Copy link
Contributor

mfundul commented Aug 21, 2019

@aldem great detective work! It seems that the hang is a result of the failure to finish deleting a file 8 hours earlier as shown in the logs:

2019-08-20 22:59:03: netdata INFO  : MAIN : Deleting data file "/opt/netdata-debug/var/cache/netdata/dbengine/datafile-1-0000000428.ndf".

It appears that netdata was still trying to delete this file 8 hours later, which is obviously a bug.

I will work on this as soon as I am available.

@mfundul mfundul added area/database priority/high Super important issue and removed needs triage Issues which need to be manually labelled labels Aug 21, 2019
@mfundul
Copy link
Contributor

mfundul commented Aug 26, 2019

@aldem I have failed to reproduce the problem in the following scenarios:

  • 2 servers that have been running for weeks
  • 2 VMs that have been running for 2 days with specially crafted code that triggers file deletions very often

I'm starting to suspect it might be something specific to the Proxmox kernel. Maybe it could be related to this or this?

I will now test specifically the proxmox kernel and see if I have better luck.

@mfundul
Copy link
Contributor

mfundul commented Aug 26, 2019

In any case @aldem can you please share with me any non-sensitive output from the dmesg command when you come across the issue again?

@aldem
Copy link
Author

aldem commented Aug 26, 2019

@mfundul I have checked those two cases but it doesn't look that I have similar issue. All systems where it was running didn't have any crashes, didn't hung and in general are stable.

This is output from my dmesg before and after the crash I have reported:

[Sat Aug 17 21:15:38 2019] vmbr0: port 3(tap210i0) entered disabled state
[Sat Aug 17 21:15:38 2019] vmbr0: port 3(tap210i0) entered blocking state
[Sat Aug 17 21:15:38 2019] vmbr0: port 3(tap210i0) entered forwarding state
[Sun Aug 18 13:51:41 2019] Adding 8388604k swap on /dev/sda1.  Priority:-2 extents:1 across:8388604k SSFS
**** netdata hung Aug 21 at 7:30 ****
[Thu Aug 22 13:22:17 2019] audit: type=1400 audit(1566472953.906:30): apparmor="STATUS" operation="profile_load" profile="/usr/bin/lxc-start" name="lxc-111_</var/lib/lxc>" pid=26599 comm="apparmor_parser"
[Thu Aug 22 13:22:17 2019] vmbr0: port 7(veth111i0) entered blocking state
[Thu Aug 22 13:22:17 2019] vmbr0: port 7(veth111i0) entered disabled state
[Thu Aug 22 13:22:17 2019] device veth111i0 entered promiscuous mode
[Thu Aug 22 13:22:17 2019] eth0: renamed from vethQFGJQO
[Thu Aug 22 13:22:19 2019] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[Thu Aug 22 13:22:19 2019] vmbr0: port 7(veth111i0) entered blocking state
[Thu Aug 22 13:22:19 2019] vmbr0: port 7(veth111i0) entered forwarding state

So there is absolutely nothing reported by kernel... And for the sake of completeness, this is what was recorded in journal (non-kernel stuff):

Aug 21 07:19:47 buzz smartd[839]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 150 [Raw 40 (Min/Max 21/49)] to 146 [Raw 41 (Min/Max 21/49)]
Aug 21 07:19:48 buzz influxd[855]: ts=2019-08-21T05:19:48.366278Z lvl=info msg="Retention policy deletion check (start)" log_id=0HAEAIl0000 service=retention trace_id=0HOeiAZW000 op_name=retention_delete_check o
Aug 21 07:19:48 buzz influxd[855]: ts=2019-08-21T05:19:48.366338Z lvl=info msg="Retention policy deletion check (end)" log_id=0HAEAIl0000 service=retention trace_id=0HOeiAZW000 op_name=retention_delete_check op_
... more similar messages from influxd/smartd ...
Aug 21 11:33:53 buzz systemd[1]: netdata.service: Main process exited, code=killed, status=5/TRAP
Aug 21 11:33:54 buzz systemd[1]: netdata.service: Failed with result 'signal'.

11:33 is the point when I killed it.

I don't think that it is related to the Proxmox kernel, especially because this happened on 4.15 and 5.0 (yes, both are Proxmox, but version difference is huge).

This is really mysterious case, as on one hand I have Proxmox systems where it is running for more than month (4.15 and 5.0 kernels), but also systems where it randomly hangs...

I will try to stress it a bit by starting/stopping containers periodically, may be this could help somehow to trigger the issue.

But what about the bug you have mentioned before ("still trying to delete this file 8 hours later, which is obviously a bug")?

@mfundul
Copy link
Contributor

mfundul commented Aug 26, 2019

But what about the bug you have mentioned before ("still trying to delete this file 8 hours later, which is obviously a bug")?

This is what I was trying to reproduce. It had been a problem in the early stages of development but was fixed (allegedly?) so the root cause is not apparent by simply doing a code review anymore. Ideally I'd like to reproduce it and examine the process state with gdb.

@mfundul
Copy link
Contributor

mfundul commented Aug 26, 2019

Another thing that has caused 100% CPU in netdata in the past and has also caused unresponsive WEB UI which are symptoms you've also seen is a faulty URL parser PR. This can also explain what you've experienced, however it would have to be some nightly netdata version and not the release version of 1.16.0.

If that was the case you should not see this problem again if you use the latest code.

@aldem
Copy link
Author

aldem commented Aug 27, 2019

My last issue was with (unreleased) 1.16.1 (I installed it to verify if it happens with newer version).

@mfundul
Copy link
Contributor

mfundul commented Aug 27, 2019

There seems to be one big problem with tag 1.16.1, it has a heap memory corruption bug that can cause random crashes. It was fixed with this commit. I would advise to update the code to the latest.

@mfundul mfundul added this to Not Planned in Core via automation Aug 27, 2019
@mfundul mfundul added this to the v1.17-Sprint4 milestone Aug 27, 2019
@mfundul mfundul added the size:5 label Aug 27, 2019
@mfundul
Copy link
Contributor

mfundul commented Aug 27, 2019

@aldem can you test #6731? See if that resolves your problem.

@aldem
Copy link
Author

aldem commented Aug 27, 2019

Did that issue exist in 1.16 release? If I remember correctly in all cases when it hung the Web GUI was open (though unfocused).

@mfundul
Copy link
Contributor

mfundul commented Aug 27, 2019

The heap memory corruption of the URL parser did not exist in the v1.16.0 release. The #6731 Pull Request is not related to any URL parser bugs and also applies to v1.16.0 code.

@aldem
Copy link
Author

aldem commented Aug 27, 2019

Ok, I have applied your fix to 1.16 release and compiled it with -O0 to be able to get more from gdb when (and if) it hangs, but this could take quite some time...

Core automation moved this from Not Planned to Dev Done Aug 28, 2019
@mfundul
Copy link
Contributor

mfundul commented Aug 28, 2019

The #6731 pull request has been merged to master. I'm reopening this in case you see the problem again. You can use the latest code of master instead of the closed PR.

@mfundul mfundul reopened this Aug 28, 2019
Core automation moved this from Dev Done to In progress Aug 28, 2019
@cakrit
Copy link
Contributor

cakrit commented Sep 2, 2019

Closing since there is no additional feedback, we'll reopen if necessary.

@cakrit cakrit closed this as completed Sep 2, 2019
Core automation moved this from In progress to Dev Done Sep 2, 2019
@gpapamathND gpapamathND moved this from Dev Done to Done in Core Sep 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Core
  
Done
Development

Successfully merging a pull request may close this issue.

3 participants