New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netdata hogs cpu 100% and hungs after creating next dbengine segment #6701
Comments
@aldem we'd like to see more log messages (or the entire log file, if possible), before 2019-08-21 07:31:02. We suspect that there was already a different dbengine operation in progress, or for some reason an operation did not release the lock it had. We should see some evidence of that in the log file. Of course remove any sensitive data you have on that file (hostname/ip), if you add all of it error.log. |
Sure, the log is attached. It covers the period since last start before crash, last line in this file is indeed the last line produced by that instance after I tried to stop it. If there are some additional options that I can use to get more diagnostics when compiling/starting - please let me know, though it may take days or even weeks before this happens again. |
@aldem great detective work! It seems that the hang is a result of the failure to finish deleting a file 8 hours earlier as shown in the logs:
It appears that netdata was still trying to delete this file 8 hours later, which is obviously a bug. I will work on this as soon as I am available. |
@aldem I have failed to reproduce the problem in the following scenarios:
I'm starting to suspect it might be something specific to the Proxmox kernel. Maybe it could be related to this or this? I will now test specifically the proxmox kernel and see if I have better luck. |
In any case @aldem can you please share with me any non-sensitive output from the |
@mfundul I have checked those two cases but it doesn't look that I have similar issue. All systems where it was running didn't have any crashes, didn't hung and in general are stable. This is output from my dmesg before and after the crash I have reported:
So there is absolutely nothing reported by kernel... And for the sake of completeness, this is what was recorded in journal (non-kernel stuff):
11:33 is the point when I killed it. I don't think that it is related to the Proxmox kernel, especially because this happened on 4.15 and 5.0 (yes, both are Proxmox, but version difference is huge). This is really mysterious case, as on one hand I have Proxmox systems where it is running for more than month (4.15 and 5.0 kernels), but also systems where it randomly hangs... I will try to stress it a bit by starting/stopping containers periodically, may be this could help somehow to trigger the issue. But what about the bug you have mentioned before ("still trying to delete this file 8 hours later, which is obviously a bug")? |
This is what I was trying to reproduce. It had been a problem in the early stages of development but was fixed (allegedly?) so the root cause is not apparent by simply doing a code review anymore. Ideally I'd like to reproduce it and examine the process state with gdb. |
Another thing that has caused 100% CPU in netdata in the past and has also caused unresponsive WEB UI which are symptoms you've also seen is a faulty URL parser PR. This can also explain what you've experienced, however it would have to be some nightly netdata version and not the release version of 1.16.0. If that was the case you should not see this problem again if you use the latest code. |
My last issue was with (unreleased) 1.16.1 (I installed it to verify if it happens with newer version). |
There seems to be one big problem with tag 1.16.1, it has a heap memory corruption bug that can cause random crashes. It was fixed with this commit. I would advise to update the code to the latest. |
Did that issue exist in 1.16 release? If I remember correctly in all cases when it hung the Web GUI was open (though unfocused). |
The heap memory corruption of the URL parser did not exist in the v1.16.0 release. The #6731 Pull Request is not related to any URL parser bugs and also applies to v1.16.0 code. |
Ok, I have applied your fix to 1.16 release and compiled it with -O0 to be able to get more from gdb when (and if) it hangs, but this could take quite some time... |
The #6731 pull request has been merged to master. I'm reopening this in case you see the problem again. You can use the latest code of master instead of the closed PR. |
Closing since there is no additional feedback, we'll reopen if necessary. |
Hi,
For few times already, I have noticed that netdata is going wild and simply hangs with 100% CPU usage in one of it's threads, Web UI is not responding, and data collected is not stored.
Unfortunately, I could not find anything that would allow to reproduce this behavior, I could not find any specific pattern in system load nor processes starting/stopping/running, it may run for weeks without any issues or it may run for just couple days.
What I know so far:
My configuration is dead simple:
Yes, I know that "history" is useless for dbengine, but it is there :) I have also few hosts monitored by fping, apcd, hddtemp, smart monitoring, sensors but that's all (ca. 3000 metrics and 450 charts). System itself has few (mostly idling) lxc containers and couple KVMs (also mostly idling). It has plenty of RAM only half of which is in use, and one of KVMs is also running netdata inside (same version) without any issues.
I have compiled debug version in the hope to backtrack where it hungs when (if) this happens, and it did, so I have collected some details.
Last log messages before it started to hog CPU:
This perfectly correlates with my CPU hog detector (it reports when process is using CPU for at least 75% within 30 seconds):
2019-08-21 07:31:33 [21473 (netdata)] hog detected: 30
It reports pid of main process, while htop revealed actual thread pid that was consuming CPU.
strace on this thread does not reveal anything useful:
futex() calls were generated with a rate ca. 5-10 per second, nothing else was called, stacktarce in /proc was empty (which probably means that it was not waiting for syscall).
Finally, I pulled in gdb:
At this point I am a bit stuck. Either this is a bug in kernel (which spreads every kernel from 4.15 to 5.0), or... I ran out of ideas.
That is exactly why I am posting this - may be someone has any clue how to proceed to actually nail this issue? Logs are useless (nothing unusual), stack trace and strace are meaningless, problem could not be reliably reproduced but for sure it is in there.
Thank you!
The text was updated successfully, but these errors were encountered: