Automatic replacement of failed nodes #20

JohnStrunk · 2018-06-27T18:27:53Z

Describe the feature you'd like to have.
When a gluster pod fails, kube will attempt to restart it; if it was a simple crash or other transient problem, this should be sufficient to repair the system (plus automatic heal). However, if the node's state becomes corrupt or is lost, it may be necessary to remove the failed node from the cluster and potentially spawn a new one to take its place.

What is the value to the end user? (why is it a priority?)
If a gluster node (pod) remains offline, the associated bricks will have a reduced level of availability & reliability. Being able to automatically repair failures will help increase system availability and protect users' data.

How will we know we have a good solution? (acceptance criteria)

Kubernetes will act as the 1st line of defense, restarting failed Gluster pods
A Gluster pod that remains offline from the gluster cluster for an extended period of time will have its bricks moved to other Gluster nodes (by GD2). Permissible downtime should be configurable.
Gluster nodes that have been "abandoned" by GD2 should be removed from the TSP and destroyed by the operator
Ability to mark a node via the CR such that it will not be subject to replacement (abandonment by GD2 nor destruction by the operator). This is necessary in cases where a Gluster node is expected to be temporarily unavailable (i.e., scheduled downtime or other maintenance).

Additional context
This relies on the node state machine (#17) and an, as yet, unimplemented GD2 automigration plugin.

JohnStrunk added epic Large, multi-issue feature set needs-subtasks Issue needs to be sub-divided into smaller items labels Jun 27, 2018

JohnStrunk added this to the 1.0 milestone Jun 27, 2018

JohnStrunk added this to Incoming in Planning via automation Jun 27, 2018

JohnStrunk moved this from Incoming to Epics in Planning Jun 28, 2018

JohnStrunk removed this from the 1.0 milestone Sep 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic replacement of failed nodes #20

Automatic replacement of failed nodes #20

JohnStrunk commented Jun 27, 2018 •

edited

Automatic replacement of failed nodes #20

Automatic replacement of failed nodes #20

Comments

JohnStrunk commented Jun 27, 2018 • edited

JohnStrunk commented Jun 27, 2018 •

edited