You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a node is down, the status of the pod running on this node becomes unknown. The training job could not recover because the pod is not scheduled to a healthy node.
According to this issue kubernetes/kubernetes#51333, k8s would not delete the pod on the node if the node is down. I think tf-operator should take care of this case by deleting the pod with unknown status so that a new pod will be created and scheduled to healthy node.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
When a node is down, the status of the pod running on this node becomes unknown. The training job could not recover because the pod is not scheduled to a healthy node.
According to this issue kubernetes/kubernetes#51333, k8s would not delete the pod on the node if the node is down. I think tf-operator should take care of this case by deleting the pod with unknown status so that a new pod will be created and scheduled to healthy node.
I think the change is to add a check for unknown status of pod in
reconcilePods
functionhttps://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/pod.go#L103
The text was updated successfully, but these errors were encountered: