-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netdata stops working after some time: lots of sockets in CLOSE_WAIT from prometheus remote host #7440
Comments
Ok, this looks like way too many open database engine files.
|
i see request is |
60 sec is nothing. |
Hello @cpw , I'd still like to see the logs to understand the exact configuration of netdata to help us debug the problem. You can send the Feel free to remove any sensitive information that there may be in the log if you like. |
I'll email that over.. (sent). |
Did you send it to me or to support? I don't see it in my inbox. |
I sent both documents requested to both email addresses. (support and markos). |
In the logs I see every day at about 7:38 at least the SIGUSR1 signal being received like clockwork, usually accompanied by SIGTERM. This is probably some scheduled task. I also see some kind of memory corruption at more random points in times some days. |
@cpw can you disable log flood protection cause I'm missing a lot of logs, by setting |
The daily activity is your netdata-updater? It's also doing SIGUSR1 to cycle logs. Memory corruption? You have me slightly concerned now... I would like to know where you see that? I'll make the config change. |
One instance in the logs is:
Let's see if we can see something more without log flood protection. |
Can you tell me what your |
Whatever the default is. it looks like it's showing 32 as the default:
|
So it died again overnight. I wasn't able to grab an export - the browser UI was unresponsive. Anyway, I can email you the data if that's something you want. I also captured pmap and strace output for the primary pid. I tried to capture ltrace but that caused the process to crash completely and thus restart. |
Sure, please e-mail the logs and whatever else you got to markos@netdata.cloud . |
@cpw please reopen this if you see the problem again. |
Reopening this problem. The issue occurred again, however I see very interesting new errors in the log files:
The last time the server responded to prometheus was at 13:08 roughly. I hope this helps a bit. |
Hi @cpw , It was reopened, I will close the other as you requested. |
Thank you. I think the new error might be enlightening for the dbengine. I have it in broken state at present - so if someone can suggest diagnostic data that might help I'll happily gather it. |
@cpw can you send me your logs again to markos@netdata.cloud ? Let's hope I'll never have to ask again! |
Bug report summary
It seems my netdata stops working after a time. I have to restart the process for it to resume. The problem occurs on an intermittent basis - there seems to be little pattern to when the process stops working.
When this happens I notice a lot of CLOSE_WAIT sockets from my remote prometheus host (set up as a scrape in prometheus). Is this a cause, or a symptom? I do not know.
OS / Environment
Debian Linux buster, using the kickstarter install following nightly (it seems to update reasonably regularly). I'm running on the "baremetal" of the server - there's plenty of containers but netdata itself is not in one.
Netdata version (ouput of
netdata -V
)netdata v1.19.0-37-nightly (this problem has been happening intermittently since around the v18 or so release)
Component Name
Everything, but probably related to prometheus somehow?
Steps To Reproduce
Wait. Though I can't guarantee there's not something specific to my environment.
Expected behavior
That netdata runs for more than around 24 hours without dying and needing to restart.
Observations
There is nothing in the logs - I tried compiling the debugging as in the documentation - it revealed nothing. The logs seem to sort of, but not really, continue. I can see the scraper in the logs up until the explosion of connections, after which netdata becomes unresponsive to all network activity.
The process definitely seems hung, since when it is working normally, the command
systemctl restart netdata
returns in a second - but when it is hung the same command takes around 60-90 seconds to complete (I'm guessing systemd is KILLing the process after the graceful stop timeout).Most of the time, the "netdata" process is still around and doesn't indicate zombie status, but on a couple of occasions I've noticed it's completely vanished.
Attached is an example netstat output from when it was in such a wedged state. I have anonymized public IP addresses in this trace. The "prometheus" host is 10.0.0.241 - the host running netdata is 10.10.0.254.
Also attached is the output from lsof for the netdata pid at time of failure. There seems to be quite a lot of file handles open to the dbengine stuff. I don't know if this is relevant.
I have streaming setup from approximately 5 other environments on a few seconds basis.
The prometheus scrape config is:
This was working very well for many months, until around 2 months ago, when I'd guess netdata v18 or so hit.
This has become so symptomatic I have implemented a simple cron task to restart netdata when the "wc -l" of CLOSE_WAIT connections for the 19999 port exceeds 50.
I would like advice on what diagnostics I should try next to see if I can figure out what's causing the runaway connections. I'm not 100% convinced this isn't a problem with something else causing netdata to lock up, but I have nothing to go on but the dead connections at present.
netstat.txt
lsof.txt
The text was updated successfully, but these errors were encountered: