You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the head node, the health check has some notion of handling the loss of a data or service node, but the logic is focused on replacing that failed node only. I believe that this can be safely expanded to also scale down from a larger to a smaller cluster, e.g. for DCOS.
The text was updated successfully, but these errors were encountered:
FYI, I created a "chaos monkey" config that causes any node to randomly die: 1ba2d36.
Also a status_check tool that just make s request to /about and notes the server state is "READY" or not.
Anyway, trying this out I can occasionally see the server go to a non-recoverable state. Still need to investigate what's trigging this.
I think this checkin might fix the problem with recovering from a failed node: 91901e7.
I ran the server in "chaos monkey" mode for several hours and never got stuck in the INITIALIZING state.
Kubernetes, will be a different matter, but I'm planning to put the head node back in the K8s deployment, so will do that first before working on scale up/scale down (and failed nodes) in K8s.
In the head node, the health check has some notion of handling the loss of a data or service node, but the logic is focused on replacing that failed node only. I believe that this can be safely expanded to also scale down from a larger to a smaller cluster, e.g. for DCOS.
The text was updated successfully, but these errors were encountered: