Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dmq: wait for a 2nd failed ping before deleting a node #1840

Merged
merged 1 commit into from Feb 7, 2019

Conversation

grumvalski
Copy link
Contributor

Pre-Submission Checklist

  • Commit message has the format required by CONTRIBUTING guide
  • Commits are split per component (core, individual modules, libs, utils, ...)
  • Each component has a single commit (if not, squash them into one commit)
  • No commits to README files for modules (changes must be done to docbook files
    in doc/ subfolder, the README file is autogenerated)

Type Of Change

  • Small bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would change existing functionality)

Checklist:

  • PR should be backported to stable branches
  • Tested changes locally
  • Related to issue #XXXX (replace XXXX with an open issue number)

Description

I've recently being experiencing a loop in nodes removal/addition leading to "ghost nodes".
Suppose to have three servers A,B,C.
Server C goes down not cleanly, so DMQ doesn't notify the other nodes. Server A is the first to send its ping, with a nodelist including node C. After fr_timer, the transaction for the message to node C times out and the node is removed from node A nodelist.
Then node B sends its ping with a nodelist including node C (still alive for A), node A sees node C as a new node and adds it back to its nodelist. Now node B reaching fr_timer timeout removes node C, until next node's A ping, and so on. This does not occur if the delta between node A and node B pings is less than fr_timer.
What I propose here is that, upon a failed ping, the failing node is put in disabled state and we wait a 2nd failed ping before removing it from the nodelist. This should prevent dead nodes to come back.

@charlesrchance
Copy link
Member

Looks good - thanks!

Would it be sensible to add a module parameter to configure the number of failed pings before a node is considered dead (defaulting to 1 to preserve current behaviour)?

@grumvalski
Copy link
Contributor Author

Definitely :)
Should I merge and backport this one?

@charlesrchance
Copy link
Member

I am against backporting because it changes current behaviour and is not strictly a bug fix.

For master, I would still prefer the option to preserve current behaviour (with modparam) in case it is essential for some users to fail quickly.

@grumvalski
Copy link
Contributor Author

I discovered the problem exactly because I wanted to fail quickly, having a global fr_timer of 5 s.
IMHO the patch as it is doesn't affect the capability to fail quickly since a node is taken off the nodelist for message broadcasting immediately, except for the ping. I would say that it affects the recovery speed, since a two pings cycle is needed before a node is added back in case of a temporary failure.
A modparam can be added to enable this, along with the caveat in the doc that a race condition can happen due to the combination of fr_timer and dmq ping interval.

@charlesrchance
Copy link
Member

Okay - can be merged in that case.

Thanks again!

@grumvalski
Copy link
Contributor Author

Mergin, thank you :)

@grumvalski grumvalski merged commit a537bc3 into master Feb 7, 2019
@grumvalski grumvalski deleted the grumvalski/fix_dmq_nodelist branch February 7, 2019 13:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants