How to handle long pending pods in a TF-job? #1282

merlintang · 2021-06-08T13:02:41Z

In the production environment, we would run into some kinds of pod scheduling problems.

For example, one pod fails to get the volume, one pod fails to get the image, or the pod fails to schedule because the related node is broken. In these cases, these pods run into the pending stage. As a result, we have to ask users to delete the current TFjob, and restart a new Job.

However, this would waste resources. For example, we have a TFjob with 100 workers. 99 work come to start, and only one pod is pending. After a period of time, besides the pending pod, other pods already spent resources to train. it is not a good idea to restart this job.

Therefore, we hope the TF operator can retry these long-pending pods if these pending pods meet a certain rule. Hope to learn advice from you. what do you think? Thanks in advance

johnugeorge · 2021-06-09T05:44:32Z

Eventually, pod should get into running state? why does it remain in pending?

merlintang · 2021-06-09T08:42:28Z

Thanks for your reply, the pod would run into the pending stage (a.k.a hang ) forever. we have to run a monitor script to find these poods and clean it, as a result, the tfjob would fail as a result. thus, we hope tf operator can find these pods and restart it.

johnugeorge · 2021-06-10T19:58:20Z

Interesting. Does Tensorflow support restarting one or more workers?

/cc @gaocegege

gaocegege · 2021-06-16T02:24:03Z

Interesting. Does Tensorflow support restarting one or more workers?

It depends on the logic of the training script.

gaocegege · 2021-06-16T02:24:58Z

I think the problem is how to find these bad pods. We cannot tell if the pod is actually pending or hanging.

merlintang · 2021-06-16T05:48:29Z

can we have a white list to record the cases where these pods are hanging there forever? then, tf operator can find pod to restart based on the white list

johnugeorge · 2021-06-17T14:30:05Z

@merlintang Can you elaborate more on your proposal?

merlintang · 2021-06-18T17:33:41Z

For these pending pod , we can find the related container state.

For example, one pod for its container (init container or job container) is running into the "ImagePullBackOff" / "CreateContainerConfigError"/ "CreateContainerError", and the restart_count > upper_bound, we would say this pod can not resume to work. In this case, we need to restart this pod, then the scheduler would allocate this pod to a new node.

By this way, we can avoid one pending jobs forever.

stale · 2022-03-02T11:12:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the lifecycle/stale label Mar 2, 2022

stale bot closed this as completed Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle long pending pods in a TF-job? #1282

How to handle long pending pods in a TF-job? #1282

merlintang commented Jun 8, 2021

johnugeorge commented Jun 9, 2021

merlintang commented Jun 9, 2021

johnugeorge commented Jun 10, 2021

gaocegege commented Jun 16, 2021 •

edited

gaocegege commented Jun 16, 2021

merlintang commented Jun 16, 2021

johnugeorge commented Jun 17, 2021

merlintang commented Jun 18, 2021

stale bot commented Mar 2, 2022

How to handle long pending pods in a TF-job? #1282

How to handle long pending pods in a TF-job? #1282

Comments

merlintang commented Jun 8, 2021

johnugeorge commented Jun 9, 2021

merlintang commented Jun 9, 2021

johnugeorge commented Jun 10, 2021

gaocegege commented Jun 16, 2021 • edited

gaocegege commented Jun 16, 2021

merlintang commented Jun 16, 2021

johnugeorge commented Jun 17, 2021

merlintang commented Jun 18, 2021

stale bot commented Mar 2, 2022

gaocegege commented Jun 16, 2021 •

edited