Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect reachability status #3

Closed
manos-saratsis opened this issue Nov 20, 2020 · 17 comments
Closed

Incorrect reachability status #3

manos-saratsis opened this issue Nov 20, 2020 · 17 comments

Comments

@manos-saratsis
Copy link

manos-saratsis commented Nov 20, 2020

Describe the bug
Sometimes in Cloud the reachability status is not correct:

  • the node appears reachable when is not
  • the node appears unreachable when is reachable

To Reproduce
Steps to reproduce the behavior: No clear steps to reproduce

Expected behavior
The reachability status of the nodes should always be correct

Screenshots

Incorrect reachable status
incorect_reachable_status

Incorrect unreachable status
incorrect_unreachable_status

@manos-saratsis manos-saratsis added the bug Something isn't working label Nov 20, 2020
@manos-saratsis manos-saratsis self-assigned this Nov 20, 2020
@manos-saratsis manos-saratsis removed their assignment Nov 20, 2020
@ghost
Copy link

ghost commented Nov 20, 2020

Indeed there is an edge case that could cause the system to report this false-positive reachability. 🙁

A quick fix would be to restart the Netdata agent. Unfortunately, if this was an ephemeral agent then there is nothing that you can do at the moment.

Having said that, the team is currently working on both preventing this behaviour, as well as in the ability to permanently remove an agent from your space.

@netdata-community-bot
Copy link

This issue has been mentioned on Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/cloud-service-outage/517/2

@luisj1983
Copy link

OK, I have this issue and there is also what looks like a gap in my historical data.
This may have been since I changed the "dbengine multihost disk space" from '256' to '1200'.
I've restarted agents etc but still some nodes aren't appearing.
Tell me what you need from me.

@luisj1983
Copy link

@harrisbz and @manos-saratsis
Any chance of getting some love for this issue? My cloud dashboard now shows ZERO nodes and so is completely useless and has been so for quite some time now.

image

@cakrit
Copy link
Contributor

cakrit commented Dec 4, 2020

We are working on this as we speak @luisj1983 , we see it got much worse after yesterday's issues and are taking some drastic actions :)

@luisj1983
Copy link

@cakrit OK, no worries. Let me know if I can help with anything from my side! :)
Good luck!

@cakrit
Copy link
Contributor

cakrit commented Dec 7, 2020

We got this working on Friday with a workaround, but we're still having issues and today we lost the connections again. It's top priority for us!

@luisj1983
Copy link

@cakrit OK, thanks for the update! 👍

@manos-saratsis
Copy link
Author

manos-saratsis commented Dec 14, 2020

Reducing the priority as the state has been improved a lot

@manos-saratsis
Copy link
Author

@lvrach a user mentioned that we might be sending unreachable notifications too early https://community.netdata.cloud/t/revisit-unreachable-hosts-alarm-netdata-cloud/814/3

@netdata-community-bot
Copy link

This issue has been mentioned on the Netdata Community. There might be relevant details there:

https://community.netdata.cloud/t/revisit-unreachable-hosts-alarm-netdata-cloud/814/6

@cakrit
Copy link
Contributor

cakrit commented Jan 23, 2021

An update on this:

  • We identified two main sources of the discrepancies we have been seeing in reachability status, both of which result in dropped messages. Other discrepancies are also caused by the same two sources: VerneMQ and debouncing.
  • We are very close to delivering changes in production that will address the issues with VermeMQ.
  • We are also fully focused on resolving the debouncing issues, starting with the connection/disconnection messages.

I expect we'll be able to close this issue very soon.

@cakrit
Copy link
Contributor

cakrit commented Feb 1, 2021

Part of this will be resolved tomorrow, with the improved handling of the load resulting from reconnections. The refactored way of handling agent and node connections/disconnections is what will resolve this completely, but the performance/stability team is still involved in ensuring we don't drop any messages.

@ringe
Copy link

ringe commented Feb 24, 2021

Thank you for your efforts here. We recently started using Netdata Cloud, and really like it.
There's a lot of default warnings levels that we don't care about, so we adjust them. Others have really improved our performance and reliability.

This issue is really annoying, though. We are getting unreachable warnings a lot, while there's no downtime on the servers being monitored.

This issue is still open, and the problem persists for us.

@odyslam
Copy link
Contributor

odyslam commented Feb 24, 2021

Hi @ringe,

I understand that this is frustrating. Having used the feature ourselves, we understand the noise it produces and we are racing to understand why the system believes that some nodes get disconnected.

In the meantime, please do tell me more about the alarms that have improved your performance and reliability. We want to improve our default alarms so that every user will get maximum value with no configuration. Your feedback is gold to us!

Cheers :)

@cakrit
Copy link
Contributor

cakrit commented Mar 26, 2021

I opened a new bug specifically for the notifications that go out, which are really annoying.

@dimko
Copy link

dimko commented Nov 24, 2021

Has been addressed in the new architecture. Closing it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests