New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect reachability status #3
Comments
Indeed there is an edge case that could cause the system to report this false-positive reachability. 🙁 A quick fix would be to restart the Netdata agent. Unfortunately, if this was an ephemeral agent then there is nothing that you can do at the moment. Having said that, the team is currently working on both preventing this behaviour, as well as in the ability to permanently remove an agent from your space. |
This issue has been mentioned on Netdata Community. There might be relevant details there: https://community.netdata.cloud/t/cloud-service-outage/517/2 |
OK, I have this issue and there is also what looks like a gap in my historical data. |
@harrisbz and @manos-saratsis |
We are working on this as we speak @luisj1983 , we see it got much worse after yesterday's issues and are taking some drastic actions :) |
@cakrit OK, no worries. Let me know if I can help with anything from my side! :) |
We got this working on Friday with a workaround, but we're still having issues and today we lost the connections again. It's top priority for us! |
@cakrit OK, thanks for the update! 👍 |
Reducing the priority as the state has been improved a lot |
@lvrach a user mentioned that we might be sending unreachable notifications too early https://community.netdata.cloud/t/revisit-unreachable-hosts-alarm-netdata-cloud/814/3 |
This issue has been mentioned on the Netdata Community. There might be relevant details there: https://community.netdata.cloud/t/revisit-unreachable-hosts-alarm-netdata-cloud/814/6 |
An update on this:
I expect we'll be able to close this issue very soon. |
Part of this will be resolved tomorrow, with the improved handling of the load resulting from reconnections. The refactored way of handling agent and node connections/disconnections is what will resolve this completely, but the performance/stability team is still involved in ensuring we don't drop any messages. |
Thank you for your efforts here. We recently started using Netdata Cloud, and really like it. This issue is really annoying, though. We are getting unreachable warnings a lot, while there's no downtime on the servers being monitored. This issue is still open, and the problem persists for us. |
Hi @ringe, I understand that this is frustrating. Having used the feature ourselves, we understand the noise it produces and we are racing to understand why the system believes that some nodes get disconnected. In the meantime, please do tell me more about the alarms that have improved your performance and reliability. We want to improve our default alarms so that every user will get maximum value with no configuration. Your feedback is gold to us! Cheers :) |
I opened a new bug specifically for the notifications that go out, which are really annoying. |
Has been addressed in the new architecture. Closing it. |
Describe the bug
Sometimes in Cloud the reachability status is not correct:
To Reproduce
Steps to reproduce the behavior: No clear steps to reproduce
Expected behavior
The reachability status of the nodes should always be correct
Screenshots
Incorrect reachable status
Incorrect unreachable status
The text was updated successfully, but these errors were encountered: