Heartbeat only removes a member if it is related to the master node #5253
Is posted this on the mailing list but I got impatient and decided to create an issue: https://groups.google.com/d/msg/hazelcast/WiA1xOYf-ys/ZAlytzRAGyoJ
While investigating #5209 I was curious why the hearbeat doesn't repair the channel that stops working.
Although I don't have a test case (since it is hard to duplicate simply) when looking through the code it appears that the heartbeat only repairs a channel if the current node is the master  or if the node a slave is checking the heartbeat against is the master . Perhaps I'm missing something?
It appears that this is why my channel doesn't self repair when it gets in the messed up state like the state in #5209.
The text was updated successfully, but these errors were encountered:
Actually I just realized my test didn't really exercise the original issue.
This is still an issue.
To duplicate the original issue again I created a hazelcast cluster of 3 or more nodes. On one of the non master nodes I use iptables to block communication with one of the other non master nodes.
Notice that hazelcast never makes any attempts to reset the connection or even detect that there is an issue other that operations failing/timing out.
Though this brings up a more complex question. In my original issue resetting the connection fixed the problem I had run into. If resetting the connection doesn't fix the issue what is the appropriate action for Hazelcast to take?
I think attempting to connection reset and logging the problem is at least a good first step.
This issue should have been fixed by #10137.
Summary of related change is, master node removes the non-heartbeating node directly but non-master node just suspects from it (which is used to elect the new master when current master is suspected too) and closes/resets the connection.