Distributed tensorflow training job stuck in 'Running' state, model is done training #4186

l-baker · 2019-09-25T20:20:10Z

/kind bug

What steps did you take and what happened:

I am doing distributed training using Tensorflow (on GKE), and the job does not go from a running state to a successful state because the chief and parameter server pods never stop running.

I am using the object_detection library provided in tensorflow/models v1.11 (commit 23b5b42), and the provided pets example. (I am using pets as a minimal working example, issue is the same in my object detection use case.) The TFJob goes through the training process (reaches the max number of steps, saves checkpoints), but does not complete. The Workers reach the 'success' state, Chief and PS stay active indefinitely, and Evaluator will succeed then return to an active state, repeatedly.

The steps I took:

Packaging of object detection code and dataset procurement as instructed by the object_detection library example. Made one change to code as noted in additional info section.
Deployed cluster on GKE by the CLI instructions.
Created a Docker container for the pets example to run, wrote and applied yaml for this training job.

What did you expect to happen:

After the max number of training steps was reached, and the evaluation for the last model checkpoint has finished, I expected the pods that are still Active (Chief, Evaluator, PS) would move to the Success state and the job would complete.

Anything else you would like to add:

To get the pets example to run, I had to make one change to the object_detection/model_lib.py file on line 390, from 'category_index.values(),' to 'list(category_index.values()),' to fix bug #4780 in the tensorflow/models repository.

One accidental discovery while I was trying to fix this - if I delete and re-apply the TFJob, but the model directory for training already has a completed run (i.e., I forget to empty or change the directory from a previous test, so there are model.ckpt and event files where max number of steps was reached), then all pods will go to a Success state and the TFJob ends with a 'Succeeded' message.

In case it is helpful, I attached the Dockerfile for the CPU image, the GPU image is the same except the first line is: 'from tensorflow/tensorflow:1.9.0-gpu-py3', and the dockers were built from the models/research directory.

Environment:

Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): Unsure, can't view the UI (ERR_CONNECTION_CLOSED), is there a way to get this by CLI?
kfctl version: (use kfctl version): v0.6.2-0-g47a0e4c7
Kubernetes platform: GKE
Kubernetes version: (use kubectl version): Server version is v1.12.10-gke.5, client version is v1.12.9-gke.7
OS (e.g. from /etc/os-release): Ubuntu 16.04.4 LTS
Tensorflow version: 1.9

kubectl-describe-tfjob.txt
kubectl-logs-chief.txt
kubectl-logs-evaluator.txt
kubectl-logs-tfoperator.txt
Dockerfile-cpu.txt

The text was updated successfully, but these errors were encountered:

issue-label-bot · 2019-09-25T20:20:13Z

Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.99. Please mark this comment with 👍 or 👎 to give our bot feedback!

Links: app homepage, dashboard and code for this bot.

jlewi · 2019-09-27T03:30:35Z

/cc @johnugeorge
/cc @richardsliu

stale · 2019-12-26T04:16:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

k8s-ci-robot added the kind/bug label Sep 25, 2019

jlewi added area/tfjob Issues related to TFJobs. priority/p1 labels Sep 27, 2019

jlewi added this to To do in KF1.0 via automation Sep 27, 2019

stale bot added the lifecycle/stale label Dec 26, 2019

stale bot closed this as completed Jan 2, 2020

KF1.0 automation moved this from To do to Done Jan 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed tensorflow training job stuck in 'Running' state, model is done training #4186

Distributed tensorflow training job stuck in 'Running' state, model is done training #4186

l-baker commented Sep 25, 2019

issue-label-bot bot commented Sep 25, 2019

jlewi commented Sep 27, 2019

stale bot commented Dec 26, 2019

Distributed tensorflow training job stuck in 'Running' state, model is done training #4186

Distributed tensorflow training job stuck in 'Running' state, model is done training #4186

Comments

l-baker commented Sep 25, 2019

issue-label-bot bot commented Sep 25, 2019

jlewi commented Sep 27, 2019

stale bot commented Dec 26, 2019