Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat only removes a member if it is related to the master node #5253

Closed
youngm opened this issue May 6, 2015 · 6 comments
Closed

Heartbeat only removes a member if it is related to the master node #5253

youngm opened this issue May 6, 2015 · 6 comments
Assignees
Milestone

Comments

@youngm
Copy link

@youngm youngm commented May 6, 2015

Is posted this on the mailing list but I got impatient and decided to create an issue: https://groups.google.com/d/msg/hazelcast/WiA1xOYf-ys/ZAlytzRAGyoJ

While investigating #5209 I was curious why the hearbeat doesn't repair the channel that stops working.

Although I don't have a test case (since it is hard to duplicate simply) when looking through the code it appears that the heartbeat only repairs a channel if the current node is the master [0] or if the node a slave is checking the heartbeat against is the master [1]. Perhaps I'm missing something?

It appears that this is why my channel doesn't self repair when it gets in the messed up state like the state in #5209.

Thoughts?

Mike

[0] https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/java/com/hazelcast/cluster/impl/ClusterServiceImpl.java#L344

[1] https://github.com/hazelcast/hazelcast/blob/master/hazelcast/src/main/java/com/hazelcast/cluster/impl/ClusterServiceImpl.java#L391

@youngm
Copy link
Author

@youngm youngm commented May 22, 2016

Ran into another issue today where working heart beats may have resolved the issue sooner than me manually restarting the server. It would be nice if this could get addressed for 3.7. Doesn't seem like a hard fix.

@youngm
Copy link
Author

@youngm youngm commented Aug 5, 2016

@pveentjer I'd like to attempt to submit a PR for this issue. Is the general assumption about the issue correct? Should heartbeat work for connections between any node not just from node to master and visa versa?

Thanks.

@youngm
Copy link
Author

@youngm youngm commented Aug 5, 2016

Hmm....Looks like this code has changed quite a bit. Perhaps it is not longer an issue. I'll investigate. Thanks.

Mike

@youngm
Copy link
Author

@youngm youngm commented Aug 26, 2016

I cannot duplicate this issue with 3.7. Nice job fixing it. :)

@youngm youngm closed this Aug 26, 2016
@youngm
Copy link
Author

@youngm youngm commented Aug 26, 2016

Actually I just realized my test didn't really exercise the original issue.

This is still an issue.

To duplicate the original issue again I created a hazelcast cluster of 3 or more nodes. On one of the non master nodes I use iptables to block communication with one of the other non master nodes.

iptables -A OUTPUT -d 'ip of other non master node' -j DROP

Notice that hazelcast never makes any attempts to reset the connection or even detect that there is an issue other that operations failing/timing out.

Though this brings up a more complex question. In my original issue resetting the connection fixed the problem I had run into. If resetting the connection doesn't fix the issue what is the appropriate action for Hazelcast to take?

I think attempting to connection reset and logging the problem is at least a good first step.

@youngm youngm reopened this Aug 26, 2016
@metanet metanet self-assigned this Aug 27, 2016
@mdogan mdogan added this to the Backlog milestone Nov 23, 2016
@mdogan mdogan self-assigned this Mar 20, 2017
@mdogan mdogan modified the milestones: 3.9, Backlog Mar 20, 2017
@mdogan
Copy link
Contributor

@mdogan mdogan commented May 25, 2017

This issue should have been fixed by #10137.

Summary of related change is, master node removes the non-heartbeating node directly but non-master node just suspects from it (which is used to elect the new master when current master is suspected too) and closes/resets the connection.

@mdogan mdogan closed this May 25, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants