Delete pod with unknown status in reconcilePods #956

zionwu · 2019-03-11T07:30:01Z

When a node is down, the status of the pod running on this node becomes unknown. The training job could not recover because the pod is not scheduled to a healthy node.

According to this issue kubernetes/kubernetes#51333, k8s would not delete the pod on the node if the node is down. I think tf-operator should take care of this case by deleting the pod with unknown status so that a new pod will be created and scheduled to healthy node.

I think the change is to add a check for unknown status of pod in reconcilePods function
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/pod.go#L103

The text was updated successfully, but these errors were encountered:

zionwu · 2019-03-12T01:52:05Z

@johnugeorge @richardsliu @gaocegege any thoughts?

johnugeorge · 2019-03-12T12:00:38Z

Related: #900

stale · 2020-04-20T07:56:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

zionwu mentioned this issue Mar 13, 2019

Delete pod on lost node #959

Closed

richardsliu added the area/0.6.0 label Mar 27, 2019

zionwu mentioned this issue Oct 31, 2019

Add new condition JobReconcileFinished to fix performance issue #1097

Closed

stale bot added the lifecycle/stale label Apr 20, 2020

stale bot closed this as completed Apr 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete pod with unknown status in reconcilePods #956

Delete pod with unknown status in reconcilePods #956

zionwu commented Mar 11, 2019

zionwu commented Mar 12, 2019

johnugeorge commented Mar 12, 2019

stale bot commented Apr 20, 2020

Delete pod with unknown status in reconcilePods #956

Delete pod with unknown status in reconcilePods #956

Comments

zionwu commented Mar 11, 2019

zionwu commented Mar 12, 2019

johnugeorge commented Mar 12, 2019

stale bot commented Apr 20, 2020