-
Notifications
You must be signed in to change notification settings - Fork 37
Join timing out in handover scenario #168
Comments
Hello @amarouni -- can you please elaborate a bit on the exact moments of time (in what particular order they do this) your nodes perform:
And what you define by "node is started" -- does it mean the node joined the cluster? |
@aleksandr-vin Here's a rough timeline of what we're seeing :
That's where we noticed this race condition, sometimes node2 will start before node1 is completely shutdown and it picks node1's address (as seed node) from zookeeper.
Hope this helps |
I can also confirm the issue. My team is considering to adopt constructR as an akka-cluster discovery solution, and to this end we have examined the implementation and have come up with a couple of suggestions for improvement (will create corresponding issues soon), which of course we are more that happy to implement and contribute back to the project. Back to the issue at hand: the problem here is that when a node gets the list of nodes from the coordination service, it performs a According to @hseeberger, this choice is made in order to fail fast, however for some users (us included), it may not be acceptable to terminate the actor-system unless the error is catastrophic indeed. This case however is about something that can easily occur, and can be worked around by transitioning the ConstructrMachine back to the GettingNodes state. This will allow it to either eventually receive an updated list of nodes that are actually online, or obtain a lock to add itself as the first seed node and join itself to formulate the cluster. Of course, given that some users may actually want the original behavior (i.e. terminate the actor-system on timeout), the choice on how to handle the joining "timeout" should probably be configuration driven. Also, there could be some extra configuration which would impose an upper limit of how many times to perform this loop in presence of consecutive failures (i.e. get nodes -> join seed nodes -> join timeout -> Your thoughts on the above? |
The problem here is that |
Obviously, I will add an integration test that showcases this scenario. |
What will happen when calling joinSeedNodes for a second Time? (Sorry, I have no access to source code right now). And what if while transitioning the first attempt finally succeeded concurrently? |
No problem :) This is a glimpse of the internal joinSeedNodes implementation:
calling the method multiple times has the effect of stopping the previous attempt, and starting a new one with the new seed sequence. Now in the case of successful connection while transitioning, what must have happened is that the the first node was down for a while, but has successfully restarted and has also joined the cluster (i.e. has joined itself and formed a cluster). If that is the case, then there is no meaning into "joining again" (I also don't know what effect it would have, but I will try it out since we mentioned it :P ), so we will make use of the MemberUp event reception by the FSM to guard against this corner case (i.e. if a node has become up but has also previously transitioned to |
I was working on the solution similar to @nick-nachos but didn't have enough time to test all corner cases. So if you can link this issue to your future PR, I'll be happy to take a look. |
We're experimenting with akka cluster & constructr (with zookeeper) with a single node.
One of our test scenarios is the handover scenario :
We start with a single node that adds itself and joins the cluster as expected then a second node is started and the old one is stopped just after the first finishes starting up.This leads to a join time out by the new node as it tries to join the old (removed) node.
Did anyone face any issues with scenarios like this one ?
Isn't the new node supposed to time out the first time then get the new list of nodes (empty at this time) then adds itself to the list of nodes and join itself later on ?
The text was updated successfully, but these errors were encountered: