Implement smooth scale downs of a DCOS cluster #51

s004pmg · 2020-04-02T17:33:52Z

In the head node, the health check has some notion of handling the loss of a data or service node, but the logic is focused on replacing that failed node only. I believe that this can be safely expanded to also scale down from a larger to a smaller cluster, e.g. for DCOS.

jreadey · 2020-04-03T23:47:52Z

FYI, I created a "chaos monkey" config that causes any node to randomly die: 1ba2d36.
Also a status_check tool that just make s request to /about and notes the server state is "READY" or not.

Anyway, trying this out I can occasionally see the server go to a non-recoverable state. Still need to investigate what's trigging this.

jreadey · 2020-04-06T02:37:21Z

I think this checkin might fix the problem with recovering from a failed node: 91901e7.
I ran the server in "chaos monkey" mode for several hours and never got stuck in the INITIALIZING state.

Kubernetes, will be a different matter, but I'm planning to put the head node back in the K8s deployment, so will do that first before working on scale up/scale down (and failed nodes) in K8s.

jreadey · 2020-06-12T17:16:01Z

Hopefully the checkin above resolved this issue. Please reopen if it pops up again.

jreadey closed this as completed Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement smooth scale downs of a DCOS cluster #51

Implement smooth scale downs of a DCOS cluster #51

s004pmg commented Apr 2, 2020

jreadey commented Apr 3, 2020

jreadey commented Apr 6, 2020

jreadey commented Jun 12, 2020

Implement smooth scale downs of a DCOS cluster #51

Implement smooth scale downs of a DCOS cluster #51

Comments

s004pmg commented Apr 2, 2020

jreadey commented Apr 3, 2020

jreadey commented Apr 6, 2020

jreadey commented Jun 12, 2020