Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orderly shutdown and recovery of nodes #58

Closed
tstojecki opened this issue Jul 14, 2017 · 2 comments
Closed

Orderly shutdown and recovery of nodes #58

tstojecki opened this issue Jul 14, 2017 · 2 comments

Comments

@tstojecki
Copy link
Contributor

Shutting down a node results in the cluster entering a state that it can't recover from. This affects all the nodes/roles including lighthouse. From what I can tell, this has been brought up in gitter chat room a number of times.

To recreate you follow the instructions described in the readme for how to start the crawl. Then kill the cluster node (or tracker), which results in errors across all the remaining services. Lighthouse starts reporting AssociationErrors, then it reports Leader can currently not perform its duties. Starting the crawler node again doesn't recover. LH reports that the node is joining, but it keeps showing AssociationErrors. The solution isn't working anymore. The only way to recover from this is to stop and start everything again, including LH.

Shouldn't the cluster recover from failed nodes better? This seems too fragile.

I am running Win10 x64, VS 2015, IIS express. No HOCON changes, everything runs on the default ports and addresses.

I wasn't sure if this should be filed here or under akka.net github repo. I am happy to move it.

Here are the screnshots after the crawler node has been restarted:

image

image

@Aaronontheweb
Copy link
Member

@tstojecki currently there's an issue where the Topshelf services don't properly allow the CoordinatedShutdown to run for WebCrawler. I noticed that this week after we did the upgrade to 1.2 earlier. One thing you have to do is Control + C instead of just clicking X, otherwise the ServiceStop method for Topshelf never runs (it says this at startup. )

@tstojecki
Copy link
Contributor Author

Thanks for the feedback @Aaronontheweb but it looks like this goes further than that.
I removed Topshelf from WebCrawler.CrawlService and wired-up console shut down event, calling ClusterSystem.Terminate(). Console shutdown is captured as described in this post (ugly, but it does work as verified in debugger) https://stackoverflow.com/questions/474679/capture-console-exit-c-sharp?noredirect=1&lq=1

This is still happening. I have shut down the crawler console app either through ctrl+c or by clicking X. In both cases after a few seconds it reports

Akka.Cluster.ClusterCoreDaemon: Leader can currently not perform its duties, reachability status: [Reachability([akka.tcp://webcrawler@127.0.0.1:56374 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:16666, 689080532): Reachable [Reachable] (4)], [akka.tcp://webcrawler@127.0.0.1:56374 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:4053, 894445096): Reachable [Reachable] (5)], [akka.tcp://webcrawler@127.0.0.1:56374 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:56373, 1593674278): Unreachable [Unreachable] (3)][akka.tcp://webcrawler@127.0.0.1:16666 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:56374, 310250411): Unreachable [Unreachable] (7)][akka.tcp://webcrawler@127.0.0.1:4053 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:56374, 310250411): Unreachable [Unreachable] (7)][akka.tcp://webcrawler@127.0.0.1:56373 -> UniqueAddress: (akka.tcp://webcrawler@127.0.0.1:56374, 310250411): Unreachable [Unreachable] (7)])], member status: [$akka.tcp://webcrawler@127.0.0.1:4053 $Up seen=$True, $akka.tcp://webcrawler@127.0.0.1:16666 $Up seen=$True, $akka.tcp://webcrawler@127.0.0.1:56373 $Up seen=$False, $akka.tcp://webcrawler@127.0.0.1:56374 $Up seen=$False]

It never recovers after that. Starting a crawler again, LH will report the node joining, but then it keeps throwing association errors and nothing works anymore.
image

Had this scenario worked before switching to dotnetty? Is that scenario (nodes being shut-down, crashing) something that is being tested regularly in akka.cluster? Sorry for asking naive questions like that, but I have been out of the loop for a bit and I am a little surprised that this behaves this way as I would consider this a show stopper for production use.

Also, even if the node (in this case crawler) isn't doing a clean up work properly, why is it having such damaging effects on the rest of the cluster? This can't be by design, can it? LH could report that there is an issue with a node, but why is it entering this "cannot perform its duties" state? Can something be done to make that more resilient?

Please let me know how I can be of help in troubleshooting this further.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants