Incorrect reachability status #3

manos-saratsis · 2020-11-20T11:14:24Z

Describe the bug
Sometimes in Cloud the reachability status is not correct:

the node appears reachable when is not
the node appears unreachable when is reachable

To Reproduce
Steps to reproduce the behavior: No clear steps to reproduce

Expected behavior
The reachability status of the nodes should always be correct

Screenshots

Incorrect reachable status

Incorrect unreachable status

ghost · 2020-11-20T19:28:57Z

Indeed there is an edge case that could cause the system to report this false-positive reachability. 🙁

A quick fix would be to restart the Netdata agent. Unfortunately, if this was an ephemeral agent then there is nothing that you can do at the moment.

Having said that, the team is currently working on both preventing this behaviour, as well as in the ability to permanently remove an agent from your space.

netdata-community-bot · 2020-11-25T12:57:11Z

This issue has been mentioned on Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/cloud-service-outage/517/2

luisj1983 · 2020-11-25T14:47:00Z

OK, I have this issue and there is also what looks like a gap in my historical data.
This may have been since I changed the "dbengine multihost disk space" from '256' to '1200'.
I've restarted agents etc but still some nodes aren't appearing.
Tell me what you need from me.

luisj1983 · 2020-12-04T16:45:47Z

@harrisbz and @manos-saratsis
Any chance of getting some love for this issue? My cloud dashboard now shows ZERO nodes and so is completely useless and has been so for quite some time now.

cakrit · 2020-12-04T16:59:35Z

We are working on this as we speak @luisj1983 , we see it got much worse after yesterday's issues and are taking some drastic actions :)

luisj1983 · 2020-12-04T17:18:20Z

@cakrit OK, no worries. Let me know if I can help with anything from my side! :)
Good luck!

cakrit · 2020-12-07T15:46:18Z

We got this working on Friday with a workaround, but we're still having issues and today we lost the connections again. It's top priority for us!

luisj1983 · 2020-12-07T17:01:38Z

@cakrit OK, thanks for the update! 👍

manos-saratsis · 2020-12-14T17:18:58Z

Reducing the priority as the state has been improved a lot

manos-saratsis · 2021-01-22T13:50:54Z

@lvrach a user mentioned that we might be sending unreachable notifications too early https://community.netdata.cloud/t/revisit-unreachable-hosts-alarm-netdata-cloud/814/3

netdata-community-bot · 2021-01-22T13:51:55Z

This issue has been mentioned on the Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/revisit-unreachable-hosts-alarm-netdata-cloud/814/6

cakrit · 2021-01-23T18:25:07Z

An update on this:

We identified two main sources of the discrepancies we have been seeing in reachability status, both of which result in dropped messages. Other discrepancies are also caused by the same two sources: VerneMQ and debouncing.
We are very close to delivering changes in production that will address the issues with VermeMQ.
We are also fully focused on resolving the debouncing issues, starting with the connection/disconnection messages.

I expect we'll be able to close this issue very soon.

cakrit · 2021-02-01T20:31:05Z

Part of this will be resolved tomorrow, with the improved handling of the load resulting from reconnections. The refactored way of handling agent and node connections/disconnections is what will resolve this completely, but the performance/stability team is still involved in ensuring we don't drop any messages.

ringe · 2021-02-24T09:48:24Z

Thank you for your efforts here. We recently started using Netdata Cloud, and really like it.
There's a lot of default warnings levels that we don't care about, so we adjust them. Others have really improved our performance and reliability.

This issue is really annoying, though. We are getting unreachable warnings a lot, while there's no downtime on the servers being monitored.

This issue is still open, and the problem persists for us.

odyslam · 2021-02-24T10:14:03Z

Hi @ringe,

I understand that this is frustrating. Having used the feature ourselves, we understand the noise it produces and we are racing to understand why the system believes that some nodes get disconnected.

In the meantime, please do tell me more about the alarms that have improved your performance and reliability. We want to improve our default alarms so that every user will get maximum value with no configuration. Your feedback is gold to us!

Cheers :)

cakrit · 2021-03-26T20:23:37Z

I opened a new bug specifically for the notifications that go out, which are really annoying.

dimko · 2021-11-24T11:29:32Z

Has been addressed in the new architecture. Closing it.

manos-saratsis added the bug Something isn't working label Nov 20, 2020

manos-saratsis self-assigned this Nov 20, 2020

manos-saratsis added the internal submit label Nov 20, 2020

manos-saratsis removed their assignment Nov 20, 2020

manos-saratsis added the cloud-backend label Nov 30, 2020

manos-saratsis added priority/high priority/medium and removed priority/high labels Dec 10, 2020

cakrit added mgmt-navigation-team-bugs and removed cloud-backend labels Feb 1, 2021

cakrit assigned lvrach and nktsitas Feb 1, 2021

cakrit added the performance-stability-team-bugs label Feb 1, 2021

cakrit assigned fracasula Feb 1, 2021

cakrit mentioned this issue Mar 26, 2021

[BUG] Reachability notifications shouldn't be sent due to cloud maintenance actions #75

Closed

davidwadddell mentioned this issue Sep 23, 2021

[BUG] Coring issues created from netdata on redhat 8.3 #137

Closed

dimko closed this as completed Nov 24, 2021

hugovalente-pm added performance-stability-team and removed performance-stability-team-bugs labels Apr 4, 2022

shyamvalsan mentioned this issue Apr 18, 2022

[BUG] in anomaly advisor, the error conditions for metric data is missing #358

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect reachability status #3

Incorrect reachability status #3

manos-saratsis commented Nov 20, 2020 •

edited

ghost commented Nov 20, 2020

netdata-community-bot commented Nov 25, 2020

luisj1983 commented Nov 25, 2020

luisj1983 commented Dec 4, 2020

cakrit commented Dec 4, 2020

luisj1983 commented Dec 4, 2020

cakrit commented Dec 7, 2020

luisj1983 commented Dec 7, 2020

manos-saratsis commented Dec 14, 2020 •

edited

manos-saratsis commented Jan 22, 2021

netdata-community-bot commented Jan 22, 2021

cakrit commented Jan 23, 2021

cakrit commented Feb 1, 2021

ringe commented Feb 24, 2021

odyslam commented Feb 24, 2021

cakrit commented Mar 26, 2021

dimko commented Nov 24, 2021

Incorrect reachability status #3

Incorrect reachability status #3

Comments

manos-saratsis commented Nov 20, 2020 • edited

ghost commented Nov 20, 2020

netdata-community-bot commented Nov 25, 2020

luisj1983 commented Nov 25, 2020

luisj1983 commented Dec 4, 2020

cakrit commented Dec 4, 2020

luisj1983 commented Dec 4, 2020

cakrit commented Dec 7, 2020

luisj1983 commented Dec 7, 2020

manos-saratsis commented Dec 14, 2020 • edited

manos-saratsis commented Jan 22, 2021

netdata-community-bot commented Jan 22, 2021

cakrit commented Jan 23, 2021

cakrit commented Feb 1, 2021

ringe commented Feb 24, 2021

odyslam commented Feb 24, 2021

cakrit commented Mar 26, 2021

dimko commented Nov 24, 2021

manos-saratsis commented Nov 20, 2020 •

edited

manos-saratsis commented Dec 14, 2020 •

edited