Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement smooth scale downs of a DCOS cluster #51

Closed
s004pmg opened this issue Apr 2, 2020 · 3 comments
Closed

Implement smooth scale downs of a DCOS cluster #51

s004pmg opened this issue Apr 2, 2020 · 3 comments

Comments

@s004pmg
Copy link
Contributor

s004pmg commented Apr 2, 2020

In the head node, the health check has some notion of handling the loss of a data or service node, but the logic is focused on replacing that failed node only. I believe that this can be safely expanded to also scale down from a larger to a smaller cluster, e.g. for DCOS.

@jreadey
Copy link
Member

jreadey commented Apr 3, 2020

FYI, I created a "chaos monkey" config that causes any node to randomly die: 1ba2d36.
Also a status_check tool that just make s request to /about and notes the server state is "READY" or not.

Anyway, trying this out I can occasionally see the server go to a non-recoverable state. Still need to investigate what's trigging this.

@jreadey
Copy link
Member

jreadey commented Apr 6, 2020

I think this checkin might fix the problem with recovering from a failed node: 91901e7.
I ran the server in "chaos monkey" mode for several hours and never got stuck in the INITIALIZING state.

Kubernetes, will be a different matter, but I'm planning to put the head node back in the K8s deployment, so will do that first before working on scale up/scale down (and failed nodes) in K8s.

@jreadey
Copy link
Member

jreadey commented Jun 12, 2020

Hopefully the checkin above resolved this issue. Please reopen if it pops up again.

@jreadey jreadey closed this as completed Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants