Orderly shutdown and recovery of nodes #58

tstojecki · 2017-07-14T08:15:14Z

Shutting down a node results in the cluster entering a state that it can't recover from. This affects all the nodes/roles including lighthouse. From what I can tell, this has been brought up in gitter chat room a number of times.

To recreate you follow the instructions described in the readme for how to start the crawl. Then kill the cluster node (or tracker), which results in errors across all the remaining services. Lighthouse starts reporting AssociationErrors, then it reports Leader can currently not perform its duties. Starting the crawler node again doesn't recover. LH reports that the node is joining, but it keeps showing AssociationErrors. The solution isn't working anymore. The only way to recover from this is to stop and start everything again, including LH.

Shouldn't the cluster recover from failed nodes better? This seems too fragile.

I am running Win10 x64, VS 2015, IIS express. No HOCON changes, everything runs on the default ports and addresses.

I wasn't sure if this should be filed here or under akka.net github repo. I am happy to move it.

Here are the screnshots after the crawler node has been restarted:

Aaronontheweb · 2017-07-14T14:14:46Z

@tstojecki currently there's an issue where the Topshelf services don't properly allow the CoordinatedShutdown to run for WebCrawler. I noticed that this week after we did the upgrade to 1.2 earlier. One thing you have to do is Control + C instead of just clicking X, otherwise the ServiceStop method for Topshelf never runs (it says this at startup. )

tstojecki · 2017-07-17T08:09:54Z

Thanks for the feedback @Aaronontheweb but it looks like this goes further than that.
I removed Topshelf from WebCrawler.CrawlService and wired-up console shut down event, calling ClusterSystem.Terminate(). Console shutdown is captured as described in this post (ugly, but it does work as verified in debugger) https://stackoverflow.com/questions/474679/capture-console-exit-c-sharp?noredirect=1&lq=1

This is still happening. I have shut down the crawler console app either through ctrl+c or by clicking X. In both cases after a few seconds it reports

Akka.Cluster.ClusterCoreDaemon: Leader can currently not perform its duties, reachability status: [Reachability([akka.tcp://webcrawler@127.0.0.1:56374 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:16666, 689080532): Reachable [Reachable] (4)], [akka.tcp://webcrawler@127.0.0.1:56374 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:4053, 894445096): Reachable [Reachable] (5)], [akka.tcp://webcrawler@127.0.0.1:56374 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:56373, 1593674278): Unreachable [Unreachable] (3)][akka.tcp://webcrawler@127.0.0.1:16666 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:56374, 310250411): Unreachable [Unreachable] (7)][akka.tcp://webcrawler@127.0.0.1:4053 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:56374, 310250411): Unreachable [Unreachable] (7)][akka.tcp://webcrawler@127.0.0.1:56373 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:56374, 310250411): Unreachable [Unreachable] (7)])], member status: [$akka.tcp://webcrawler@127.0.0.1:4053 $Up seen=$True, $akka.tcp://webcrawler@127.0.0.1:16666 $Up seen=$True, $akka.tcp://webcrawler@127.0.0.1:56373 $Up seen=$False, $akka.tcp://webcrawler@127.0.0.1:56374 $Up seen=$False]

It never recovers after that. Starting a crawler again, LH will report the node joining, but then it keeps throwing association errors and nothing works anymore.

Had this scenario worked before switching to dotnetty? Is that scenario (nodes being shut-down, crashing) something that is being tested regularly in akka.cluster? Sorry for asking naive questions like that, but I have been out of the loop for a bit and I am a little surprised that this behaves this way as I would consider this a show stopper for production use.

Also, even if the node (in this case crawler) isn't doing a clean up work properly, why is it having such damaging effects on the rest of the cluster? This can't be by design, can it? LH could report that there is an issue with a node, but why is it entering this "cannot perform its duties" state? Can something be done to make that more resilient?

Please let me know how I can be of help in troubleshooting this further.

tstojecki mentioned this issue Jul 18, 2017

Go through CoordinatedShutdown when nodes are stopped or terminate #59

Merged

Aaronontheweb closed this as completed Mar 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orderly shutdown and recovery of nodes #58

Orderly shutdown and recovery of nodes #58

tstojecki commented Jul 14, 2017

Aaronontheweb commented Jul 14, 2017

tstojecki commented Jul 17, 2017

Orderly shutdown and recovery of nodes #58

Orderly shutdown and recovery of nodes #58

Comments

tstojecki commented Jul 14, 2017

Aaronontheweb commented Jul 14, 2017

tstojecki commented Jul 17, 2017