Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delete pod with unknown status in reconcilePods #956

Closed
zionwu opened this issue Mar 11, 2019 · 3 comments
Closed

Delete pod with unknown status in reconcilePods #956

zionwu opened this issue Mar 11, 2019 · 3 comments

Comments

@zionwu
Copy link
Contributor

zionwu commented Mar 11, 2019

When a node is down, the status of the pod running on this node becomes unknown. The training job could not recover because the pod is not scheduled to a healthy node.

According to this issue kubernetes/kubernetes#51333, k8s would not delete the pod on the node if the node is down. I think tf-operator should take care of this case by deleting the pod with unknown status so that a new pod will be created and scheduled to healthy node.

I think the change is to add a check for unknown status of pod in reconcilePods function
https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta2/tensorflow/pod.go#L103

@zionwu
Copy link
Contributor Author

zionwu commented Mar 12, 2019

@johnugeorge @richardsliu @gaocegege any thoughts?

@johnugeorge
Copy link
Member

Related: #900

@stale
Copy link

stale bot commented Apr 20, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants