Join timing out in handover scenario #168

amarouni · 2017-11-14T15:15:38Z

We're experimenting with akka cluster & constructr (with zookeeper) with a single node.

One of our test scenarios is the handover scenario :
We start with a single node that adds itself and joins the cluster as expected then a second node is started and the old one is stopped just after the first finishes starting up.This leads to a join time out by the new node as it tries to join the old (removed) node.

Did anyone face any issues with scenarios like this one ?

Isn't the new node supposed to time out the first time then get the new list of nodes (empty at this time) then adds itself to the list of nodes and join itself later on ?

aleksandr-vin · 2017-11-14T18:46:10Z

Hello @amarouni -- can you please elaborate a bit on the exact moments of time (in what particular order they do this) your nodes perform:

join
start
stop

And what you define by "node is started" -- does it mean the node joined the cluster?

amarouni · 2017-11-15T17:41:45Z

@aleksandr-vin Here's a rough timeline of what we're seeing :

node1 is started (docker container started)
node1 node adds itself as a seed node and then joins itself
Akka cluster started with this single node, and everything works as expected
handover started node2 is started (docker container started) & node1 is shutdown (docker container killed)

That's where we noticed this race condition, sometimes node2 will start before node1 is completely shutdown and it picks node1's address (as seed node) from zookeeper.

node2 tries to join the non existing node1
node2 times out
Akka cluster is shut down

Hope this helps

nick-nachos · 2017-11-20T13:09:35Z

I can also confirm the issue. My team is considering to adopt constructR as an akka-cluster discovery solution, and to this end we have examined the implementation and have come up with a couple of suggestions for improvement (will create corresponding issues soon), which of course we are more that happy to implement and contribute back to the project.

Back to the issue at hand: the problem here is that when a node gets the list of nodes from the coordination service, it performs a joinSeedNodes on that list and waits until it joins the cluster. If the operation times out (specifically, if the Joining FSM state times out) then the ConstructrMachine stops it self, which will lead to the termination of the actor-system. This will be the case in a scenario such as the one described by the OP, where a first node joins, then it fails, and before the expiration of the TTL of its DB entry, a second node tries to join. The second node will receive a list of size 1, containing the node that is "dead", and will try to join it unsuccessfully until the FSM state timeout ends up killing it as well.

According to @hseeberger, this choice is made in order to fail fast, however for some users (us included), it may not be acceptable to terminate the actor-system unless the error is catastrophic indeed. This case however is about something that can easily occur, and can be worked around by transitioning the ConstructrMachine back to the GettingNodes state. This will allow it to either eventually receive an updated list of nodes that are actually online, or obtain a lock to add itself as the first seed node and join itself to formulate the cluster.

Of course, given that some users may actually want the original behavior (i.e. terminate the actor-system on timeout), the choice on how to handle the joining "timeout" should probably be configuration driven. Also, there could be some extra configuration which would impose an upper limit of how many times to perform this loop in presence of consecutive failures (i.e. get nodes -> join seed nodes -> join timeout ->
get nodes again etc)

Your thoughts on the above?

hseeberger · 2017-11-20T17:20:10Z

The problem here is that joinSeedNodes repeatedly continues trying to join in the back (by default every 5 seconds or so). So transitioning back into GettingNodes leads to an inconsistent state. This would only work if we used some other way (non-repeatedly) to join.

nick-nachos · 2017-11-20T21:11:05Z

joinSeedNodes restarts the repeated attempts with its given input every time it is called; i.e. executing joinSeedNodes(A) and then joinSeedNodes(B) will have the effect that the actor-system will stop trying connect to A and start trying to connect to B. I have confirmed this by going through the akka source code and I also made a proof of concept earlier this day that demonstrates the desired scenario. By transitioning back to GettingNodes the FSM will receive the node list from the backend once again, and eventually switch to Joining where it will re-execute the joinSeedNodes. If the list is not yet refreshed (i.e. the downed node's key has not yet expired) the joining process will timeout again, thus re-triggering the process from scratch. This loop will stop as soon as the downed node's key expires, and the second node registers itself as the new seed.

Obviously, I will add an integration test that showcases this scenario.

hseeberger · 2017-11-20T21:31:38Z

What will happen when calling joinSeedNodes for a second Time? (Sorry, I have no access to source code right now). And what if while transitioning the first attempt finally succeeded concurrently?

nick-nachos · 2017-11-20T21:56:47Z

No problem :) This is a glimpse of the internal joinSeedNodes implementation:

def joinSeedNodes(newSeedNodes: immutable.IndexedSeq[Address]): Unit = {
    if (newSeedNodes.nonEmpty) {
      stopSeedNodeProcess()
      seedNodes = newSeedNodes // keep them for retry
      seedNodeProcess = // ...

calling the method multiple times has the effect of stopping the previous attempt, and starting a new one with the new seed sequence.

Now in the case of successful connection while transitioning, what must have happened is that the the first node was down for a while, but has successfully restarted and has also joined the cluster (i.e. has joined itself and formed a cluster). If that is the case, then there is no meaning into "joining again" (I also don't know what effect it would have, but I will try it out since we mentioned it :P ), so we will make use of the MemberUp event reception by the FSM to guard against this corner case (i.e. if a node has become up but has also previously transitioned to GettingNodes, then upon MemberUp reception it will transition immediately to AddingSelf as it would normally do after Joining).

amarouni · 2017-11-21T16:34:25Z

I was working on the solution similar to @nick-nachos but didn't have enough time to test all corner cases. So if you can link this issue to your future PR, I'll be happy to take a look.

nick-nachos mentioned this issue Nov 20, 2017

Delete key-value entries from backend if possible #170

Open

nick-nachos added a commit to nick-nachos/constructr that referenced this issue Nov 23, 2017

Restart FSM process on seed node failures (fixes hseeberger#168)

ddd5a23

nick-nachos mentioned this issue Nov 23, 2017

Restart FSM process on seed node failures (fixes #168) #173

Merged

nick-nachos added a commit to nick-nachos/constructr that referenced this issue Mar 11, 2018

Restart FSM process on seed node failures (fixes hseeberger#168)

477d3d5

hseeberger closed this as completed in 56e50f0 Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Join timing out in handover scenario #168

Join timing out in handover scenario #168

amarouni commented Nov 14, 2017

aleksandr-vin commented Nov 14, 2017 •

edited

Loading

amarouni commented Nov 15, 2017 •

edited

Loading

nick-nachos commented Nov 20, 2017 •

edited

Loading

hseeberger commented Nov 20, 2017

nick-nachos commented Nov 20, 2017 •

edited

Loading

hseeberger commented Nov 20, 2017

nick-nachos commented Nov 20, 2017 •

edited

Loading

amarouni commented Nov 21, 2017

Join timing out in handover scenario #168

Join timing out in handover scenario #168

Comments

amarouni commented Nov 14, 2017

aleksandr-vin commented Nov 14, 2017 • edited Loading

amarouni commented Nov 15, 2017 • edited Loading

nick-nachos commented Nov 20, 2017 • edited Loading

hseeberger commented Nov 20, 2017

nick-nachos commented Nov 20, 2017 • edited Loading

hseeberger commented Nov 20, 2017

nick-nachos commented Nov 20, 2017 • edited Loading

amarouni commented Nov 21, 2017

aleksandr-vin commented Nov 14, 2017 •

edited

Loading

amarouni commented Nov 15, 2017 •

edited

Loading

nick-nachos commented Nov 20, 2017 •

edited

Loading

nick-nachos commented Nov 20, 2017 •

edited

Loading

nick-nachos commented Nov 20, 2017 •

edited

Loading