New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
health monitoring stops working for slave hosts after some time #10548
Comments
Hi @razielin Thank you for reporting this. A couple of questions.
|
Hi @stelfrag.
Most of them - yes, maybe just a few of them are not. If it is important I can give you the full list of versions of all slave hosts.
Yes, I can. Streaming of metrics works fine. |
I haven't found any correlation between netdata versions and the bug occurrence.
|
I have the same issue with streaming setup with fully functional slave node (with database and alarms). Get system.ram data and active alarms directly from slave and from master (ips and names are replaced):
Compare data
Host info
|
Fixed in the latest releases |
@stelfrag @cpipilas |
Hi @razielin ! Could you please do the following on the parent node: ?
Can you also please share the value of Thank you! |
Hi @MrZammler
But the |
@MrZammler
|
Hi @razielin you shouldn't need any special building options, or to compile from source ... Will check and let you know. Thanks! |
@razielin @MrZammler just a note here, after editing the netdata.conf inside the docker container a restart of the agent is required for the new config to take effect. Also please keep in mind that editing the netdata.conf directly requires to uncomment the line in order to override the default value. An alternative to directly editing the netdata.conf is using the netdatacli to edit those values when you start netdata.
so full command would be
I edited the config as @MrZammler suggested (added debug2.log and error2.log) and I see the debug logs and error logs getting generated in /var/log/netdata directory inside the container.
@razielin could you please try with the -W options |
@dimko @MrZammler
to the end of my netdata's Dockerfile. The resulted netdata start command looks ok
But still no luck, debug2.log remains empty as with config file. |
Hey, @razielin 👋 Do you still have the issue with v1.37.0? There have been a lot of changes/optimizations since v1.28.0. |
Bug report summary
I have one netdata master server and about 10 slave netdata servers that are streaming metrics to the master server.
All health configuration files are configured at the master server only. But after a while (about 2-7 days, but it rather nondeterministic), periodic checks for all alarms of a particular slave host are stopped from being executing on the master server. Active alarms stay forever in their current states.
As you can see
"last_updated"
and"next_update"
are two days ago in the past compared to“now”
. The"last_updated"
value is the same for all alarms of the slave host.This situation occurs for the rest of the slave hosts after some time. Streaming of metrics to the master keeps working properly.
I haven't found anything strange in the log. If you need some part of it, let me know, please.
OS / Environment
The netdata master service works inside a docker container from the official netdata image v1.28.0.
Configuration of the host system:
cat docker-compose.yml:
Netdata version
v1.28.0
But this problem also occurred in the previous versions as well ( v1.26 and some older ones).
Component Name
health
Steps To Reproduce
I don't know. It is non-deterministic.
Expected behavior
Health monitoring keeps working properly for the streamed slave hosts.
The text was updated successfully, but these errors were encountered: