Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed tensorflow training job stuck in 'Running' state, model is done training #4186

Closed
l-baker opened this issue Sep 25, 2019 · 3 comments

Comments

@l-baker
Copy link

l-baker commented Sep 25, 2019

/kind bug

What steps did you take and what happened:

I am doing distributed training using Tensorflow (on GKE), and the job does not go from a running state to a successful state because the chief and parameter server pods never stop running.

I am using the object_detection library provided in tensorflow/models v1.11 (commit 23b5b42), and the provided pets example. (I am using pets as a minimal working example, issue is the same in my object detection use case.) The TFJob goes through the training process (reaches the max number of steps, saves checkpoints), but does not complete. The Workers reach the 'success' state, Chief and PS stay active indefinitely, and Evaluator will succeed then return to an active state, repeatedly.

The steps I took:

  • Packaging of object detection code and dataset procurement as instructed by the object_detection library example. Made one change to code as noted in additional info section.

  • Deployed cluster on GKE by the CLI instructions.

  • Created a Docker container for the pets example to run, wrote and applied yaml for this training job.

What did you expect to happen:

After the max number of training steps was reached, and the evaluation for the last model checkpoint has finished, I expected the pods that are still Active (Chief, Evaluator, PS) would move to the Success state and the job would complete.

Anything else you would like to add:

To get the pets example to run, I had to make one change to the object_detection/model_lib.py file on line 390, from 'category_index.values(),' to 'list(category_index.values()),' to fix bug #4780 in the tensorflow/models repository.

One accidental discovery while I was trying to fix this - if I delete and re-apply the TFJob, but the model directory for training already has a completed run (i.e., I forget to empty or change the directory from a previous test, so there are model.ckpt and event files where max number of steps was reached), then all pods will go to a Success state and the TFJob ends with a 'Succeeded' message.

In case it is helpful, I attached the Dockerfile for the CPU image, the GPU image is the same except the first line is: 'from tensorflow/tensorflow:1.9.0-gpu-py3', and the dockers were built from the models/research directory.

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): Unsure, can't view the UI (ERR_CONNECTION_CLOSED), is there a way to get this by CLI?

  • kfctl version: (use kfctl version): v0.6.2-0-g47a0e4c7

  • Kubernetes platform: GKE

  • Kubernetes version: (use kubectl version): Server version is v1.12.10-gke.5, client version is v1.12.9-gke.7

  • OS (e.g. from /etc/os-release): Ubuntu 16.04.4 LTS

  • Tensorflow version: 1.9

kubectl-describe-tfjob.txt
kubectl-logs-chief.txt
kubectl-logs-evaluator.txt
kubectl-logs-tfoperator.txt
Dockerfile-cpu.txt

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.99. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor

jlewi commented Sep 27, 2019

/cc @johnugeorge
/cc @richardsliu

@jlewi jlewi added area/tfjob Issues related to TFJobs. priority/p1 labels Sep 27, 2019
@jlewi jlewi added this to To do in KF1.0 via automation Sep 27, 2019
@stale
Copy link

stale bot commented Dec 26, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Jan 2, 2020
KF1.0 automation moved this from To do to Done Jan 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
KF1.0
  
Done
Development

No branches or pull requests

3 participants