-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orderly shutdown and recovery of nodes #58
Comments
@tstojecki currently there's an issue where the Topshelf services don't properly allow the |
Thanks for the feedback @Aaronontheweb but it looks like this goes further than that. This is still happening. I have shut down the crawler console app either through ctrl+c or by clicking X. In both cases after a few seconds it reports
It never recovers after that. Starting a crawler again, LH will report the node joining, but then it keeps throwing association errors and nothing works anymore. Had this scenario worked before switching to dotnetty? Is that scenario (nodes being shut-down, crashing) something that is being tested regularly in akka.cluster? Sorry for asking naive questions like that, but I have been out of the loop for a bit and I am a little surprised that this behaves this way as I would consider this a show stopper for production use. Also, even if the node (in this case crawler) isn't doing a clean up work properly, why is it having such damaging effects on the rest of the cluster? This can't be by design, can it? LH could report that there is an issue with a node, but why is it entering this "cannot perform its duties" state? Can something be done to make that more resilient? Please let me know how I can be of help in troubleshooting this further. |
Shutting down a node results in the cluster entering a state that it can't recover from. This affects all the nodes/roles including lighthouse. From what I can tell, this has been brought up in gitter chat room a number of times.
To recreate you follow the instructions described in the readme for how to start the crawl. Then kill the cluster node (or tracker), which results in errors across all the remaining services. Lighthouse starts reporting AssociationErrors, then it reports Leader can currently not perform its duties. Starting the crawler node again doesn't recover. LH reports that the node is joining, but it keeps showing AssociationErrors. The solution isn't working anymore. The only way to recover from this is to stop and start everything again, including LH.
Shouldn't the cluster recover from failed nodes better? This seems too fragile.
I am running Win10 x64, VS 2015, IIS express. No HOCON changes, everything runs on the default ports and addresses.
I wasn't sure if this should be filed here or under akka.net github repo. I am happy to move it.
Here are the screnshots after the crawler node has been restarted:
The text was updated successfully, but these errors were encountered: