You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am doing distributed training using Tensorflow (on GKE), and the job does not go from a running state to a successful state because the chief and parameter server pods never stop running.
I am using the object_detection library provided in tensorflow/models v1.11 (commit 23b5b42), and the provided pets example. (I am using pets as a minimal working example, issue is the same in my object detection use case.) The TFJob goes through the training process (reaches the max number of steps, saves checkpoints), but does not complete. The Workers reach the 'success' state, Chief and PS stay active indefinitely, and Evaluator will succeed then return to an active state, repeatedly.
The steps I took:
Packaging of object detection code and dataset procurement as instructed by the object_detection library example. Made one change to code as noted in additional info section.
Deployed cluster on GKE by the CLI instructions.
Created a Docker container for the pets example to run, wrote and applied yaml for this training job.
What did you expect to happen:
After the max number of training steps was reached, and the evaluation for the last model checkpoint has finished, I expected the pods that are still Active (Chief, Evaluator, PS) would move to the Success state and the job would complete.
Anything else you would like to add:
To get the pets example to run, I had to make one change to the object_detection/model_lib.py file on line 390, from 'category_index.values(),' to 'list(category_index.values()),' to fix bug #4780 in the tensorflow/models repository.
One accidental discovery while I was trying to fix this - if I delete and re-apply the TFJob, but the model directory for training already has a completed run (i.e., I forget to empty or change the directory from a previous test, so there are model.ckpt and event files where max number of steps was reached), then all pods will go to a Success state and the TFJob ends with a 'Succeeded' message.
In case it is helpful, I attached the Dockerfile for the CPU image, the GPU image is the same except the first line is: 'from tensorflow/tensorflow:1.9.0-gpu-py3', and the dockers were built from the models/research directory.
Environment:
Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): Unsure, can't view the UI (ERR_CONNECTION_CLOSED), is there a way to get this by CLI?
Issue-Label Bot is automatically applying the label kind/bug to this issue, with a confidence of 0.99. Please mark this comment with 👍 or 👎 to give our bot feedback!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/kind bug
What steps did you take and what happened:
I am doing distributed training using Tensorflow (on GKE), and the job does not go from a running state to a successful state because the chief and parameter server pods never stop running.
I am using the object_detection library provided in tensorflow/models v1.11 (commit 23b5b42), and the provided pets example. (I am using pets as a minimal working example, issue is the same in my object detection use case.) The TFJob goes through the training process (reaches the max number of steps, saves checkpoints), but does not complete. The Workers reach the 'success' state, Chief and PS stay active indefinitely, and Evaluator will succeed then return to an active state, repeatedly.
The steps I took:
Packaging of object detection code and dataset procurement as instructed by the object_detection library example. Made one change to code as noted in additional info section.
Deployed cluster on GKE by the CLI instructions.
Created a Docker container for the pets example to run, wrote and applied yaml for this training job.
What did you expect to happen:
After the max number of training steps was reached, and the evaluation for the last model checkpoint has finished, I expected the pods that are still Active (Chief, Evaluator, PS) would move to the Success state and the job would complete.
Anything else you would like to add:
To get the pets example to run, I had to make one change to the object_detection/model_lib.py file on line 390, from 'category_index.values(),' to 'list(category_index.values()),' to fix bug #4780 in the tensorflow/models repository.
One accidental discovery while I was trying to fix this - if I delete and re-apply the TFJob, but the model directory for training already has a completed run (i.e., I forget to empty or change the directory from a previous test, so there are model.ckpt and event files where max number of steps was reached), then all pods will go to a Success state and the TFJob ends with a 'Succeeded' message.
In case it is helpful, I attached the Dockerfile for the CPU image, the GPU image is the same except the first line is: 'from tensorflow/tensorflow:1.9.0-gpu-py3', and the dockers were built from the models/research directory.
Environment:
Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard): Unsure, can't view the UI (ERR_CONNECTION_CLOSED), is there a way to get this by CLI?
kfctl version: (use
kfctl version
): v0.6.2-0-g47a0e4c7Kubernetes platform: GKE
Kubernetes version: (use
kubectl version
): Server version is v1.12.10-gke.5, client version is v1.12.9-gke.7OS (e.g. from
/etc/os-release
): Ubuntu 16.04.4 LTSTensorflow version: 1.9
kubectl-describe-tfjob.txt
kubectl-logs-chief.txt
kubectl-logs-evaluator.txt
kubectl-logs-tfoperator.txt
Dockerfile-cpu.txt
The text was updated successfully, but these errors were encountered: