dmq: wait for a 2nd failed ping before deleting a node #1840

grumvalski · 2019-02-07T10:41:29Z

Pre-Submission Checklist

Commit message has the format required by CONTRIBUTING guide
Commits are split per component (core, individual modules, libs, utils, ...)
Each component has a single commit (if not, squash them into one commit)
No commits to README files for modules (changes must be done to docbook files
in doc/ subfolder, the README file is autogenerated)

Type Of Change

Small bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds new functionality)
Breaking change (fix or feature that would change existing functionality)

Checklist:

PR should be backported to stable branches
Tested changes locally
Related to issue #XXXX (replace XXXX with an open issue number)

Description

I've recently being experiencing a loop in nodes removal/addition leading to "ghost nodes".
Suppose to have three servers A,B,C.
Server C goes down not cleanly, so DMQ doesn't notify the other nodes. Server A is the first to send its ping, with a nodelist including node C. After fr_timer, the transaction for the message to node C times out and the node is removed from node A nodelist.
Then node B sends its ping with a nodelist including node C (still alive for A), node A sees node C as a new node and adds it back to its nodelist. Now node B reaching fr_timer timeout removes node C, until next node's A ping, and so on. This does not occur if the delta between node A and node B pings is less than fr_timer.
What I propose here is that, upon a failed ping, the failing node is put in disabled state and we wait a 2nd failed ping before removing it from the nodelist. This should prevent dead nodes to come back.

charlesrchance · 2019-02-07T11:00:26Z

Looks good - thanks!

Would it be sensible to add a module parameter to configure the number of failed pings before a node is considered dead (defaulting to 1 to preserve current behaviour)?

grumvalski · 2019-02-07T11:06:51Z

Definitely :)
Should I merge and backport this one?

charlesrchance · 2019-02-07T12:21:42Z

I am against backporting because it changes current behaviour and is not strictly a bug fix.

For master, I would still prefer the option to preserve current behaviour (with modparam) in case it is essential for some users to fail quickly.

grumvalski · 2019-02-07T12:31:35Z

I discovered the problem exactly because I wanted to fail quickly, having a global fr_timer of 5 s.
IMHO the patch as it is doesn't affect the capability to fail quickly since a node is taken off the nodelist for message broadcasting immediately, except for the ping. I would say that it affects the recovery speed, since a two pings cycle is needed before a node is added back in case of a temporary failure.
A modparam can be added to enable this, along with the caveat in the doc that a race condition can happen due to the combination of fr_timer and dmq ping interval.

charlesrchance · 2019-02-07T13:14:34Z

Okay - can be merged in that case.

Thanks again!

grumvalski · 2019-02-07T13:27:02Z

Mergin, thank you :)

dmq: wait for a 2nd failed ping before deleting a node

8ffcb5f

grumvalski requested a review from charlesrchance February 7, 2019 10:41

grumvalski merged commit a537bc3 into master Feb 7, 2019

grumvalski deleted the grumvalski/fix_dmq_nodelist branch February 7, 2019 13:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dmq: wait for a 2nd failed ping before deleting a node #1840

dmq: wait for a 2nd failed ping before deleting a node #1840

grumvalski commented Feb 7, 2019

charlesrchance commented Feb 7, 2019

grumvalski commented Feb 7, 2019

charlesrchance commented Feb 7, 2019

grumvalski commented Feb 7, 2019

charlesrchance commented Feb 7, 2019

grumvalski commented Feb 7, 2019

dmq: wait for a 2nd failed ping before deleting a node #1840

dmq: wait for a 2nd failed ping before deleting a node #1840

Conversation

grumvalski commented Feb 7, 2019

Pre-Submission Checklist

Type Of Change

Checklist:

Description

charlesrchance commented Feb 7, 2019

grumvalski commented Feb 7, 2019

charlesrchance commented Feb 7, 2019

grumvalski commented Feb 7, 2019

charlesrchance commented Feb 7, 2019

grumvalski commented Feb 7, 2019