Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to handle long pending pods in a TF-job? #1282

Closed
merlintang opened this issue Jun 8, 2021 · 9 comments
Closed

How to handle long pending pods in a TF-job? #1282

merlintang opened this issue Jun 8, 2021 · 9 comments

Comments

@merlintang
Copy link

In the production environment, we would run into some kinds of pod scheduling problems.

For example, one pod fails to get the volume, one pod fails to get the image, or the pod fails to schedule because the related node is broken. In these cases, these pods run into the pending stage. As a result, we have to ask users to delete the current TFjob, and restart a new Job.

However, this would waste resources. For example, we have a TFjob with 100 workers. 99 work come to start, and only one pod is pending. After a period of time, besides the pending pod, other pods already spent resources to train. it is not a good idea to restart this job.

Therefore, we hope the TF operator can retry these long-pending pods if these pending pods meet a certain rule. Hope to learn advice from you. what do you think? Thanks in advance

@johnugeorge
Copy link
Member

Eventually, pod should get into running state? why does it remain in pending?

@merlintang
Copy link
Author

Thanks for your reply, the pod would run into the pending stage (a.k.a hang ) forever. we have to run a monitor script to find these poods and clean it, as a result, the tfjob would fail as a result. thus, we hope tf operator can find these pods and restart it.

@johnugeorge
Copy link
Member

Interesting. Does Tensorflow support restarting one or more workers?

/cc @gaocegege

@gaocegege
Copy link
Member

gaocegege commented Jun 16, 2021

Interesting. Does Tensorflow support restarting one or more workers?

It depends on the logic of the training script.

@gaocegege
Copy link
Member

I think the problem is how to find these bad pods. We cannot tell if the pod is actually pending or hanging.

@merlintang
Copy link
Author

can we have a white list to record the cases where these pods are hanging there forever? then, tf operator can find pod to restart based on the white list

@johnugeorge
Copy link
Member

@merlintang Can you elaborate more on your proposal?

@merlintang
Copy link
Author

For these pending pod , we can find the related container state.

For example, one pod for its container (init container or job container) is running into the "ImagePullBackOff" / "CreateContainerConfigError"/ "CreateContainerError", and the restart_count > upper_bound, we would say this pod can not resume to work. In this case, we need to restart this pod, then the scheduler would allocate this pod to a new node.

By this way, we can avoid one pending jobs forever.

@stale
Copy link

stale bot commented Mar 2, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Mar 2, 2022
@stale stale bot closed this as completed Apr 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants