[feasibility-research] Handle machine failure #900

gaocegege · 2018-12-20T05:04:26Z

According to the docs about restart policy here: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy, once bound to a node, a Pod will never be rebound to another node.

Thus when the machine node is down, we should try to handle such a failure.

There are some options to achieve it:

Use descheduler
Implement the logic in the operator.

I will try to investigate the cost of this feature.

jlewi · 2019-02-04T18:32:08Z

@gaocegege Doesn't the CR already handle this? Won't the reconcile logic detect that the pod isn't running and create a new one.

@johnugeorge @richardsliu

gaocegege · 2019-07-29T08:15:52Z

I think it cannot solve the Split brain problem.

jtfogarty · 2020-01-14T20:35:24Z

/area engprod
/priority p2

stale · 2020-04-25T21:14:29Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gaocegege added the addition/feature label Dec 20, 2018

johnugeorge mentioned this issue Mar 12, 2019

Delete pod with unknown status in reconcilePods #956

Closed

jlewi added kind/feature and removed addition/feature labels Aug 28, 2019

jlewi added this to To Do in Needs Triage Nov 26, 2019

k8s-ci-robot added area/engprod priority/p2 labels Jan 14, 2020

jtfogarty moved this from To Do to Assigned to Area Owner For Triage in Needs Triage Jan 14, 2020

jbottum added area/tfjob and removed area/engprod labels Jan 26, 2020

stale bot added the lifecycle/stale label Apr 25, 2020

stale bot closed this as completed May 2, 2020

Needs Triage automation moved this from Assigned to Area Owner For Triage to Closed May 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feasibility-research] Handle machine failure #900

[feasibility-research] Handle machine failure #900

gaocegege commented Dec 20, 2018

jlewi commented Feb 4, 2019

gaocegege commented Jul 29, 2019

jtfogarty commented Jan 14, 2020

stale bot commented Apr 25, 2020

[feasibility-research] Handle machine failure #900

[feasibility-research] Handle machine failure #900

Comments

gaocegege commented Dec 20, 2018

jlewi commented Feb 4, 2019

gaocegege commented Jul 29, 2019

jtfogarty commented Jan 14, 2020

stale bot commented Apr 25, 2020