Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFJob controller cannot terminate job #193

Closed
bleachzk opened this issue Feb 1, 2018 · 5 comments
Closed

TFJob controller cannot terminate job #193

bleachzk opened this issue Feb 1, 2018 · 5 comments

Comments

@bleachzk
Copy link

bleachzk commented Feb 1, 2018

I test kubeflow/tf-controller-examples/tf-cnn/tf_job_gpu.yaml by running kubectl:

kubectl apply -f tf_job_gpu.yaml
kubectl get job
NAME DESIRED SUCCESSFUL AGE
inception-171202-163257-gpu-1-ps-272u-0 1 0 7m
inception-171202-163257-gpu-1-worker-272u-0 1 0 7m

The TFJob controller cannot terminate job when the WORKER is done based on the TerminationPolicy。

WORKER log:

logs-from-tensorflow-in-inception-171202-163257-gpu-1-worker-272u-0-fdx5x.txt

@gaocegege
Copy link
Member

@bleachzk Thanks for your issue, and I am not sure if we support TerminationPolicy now, maybe jlewi@ could give us more info

@jlewi
Copy link
Contributor

jlewi commented Feb 2, 2018

@bleachzk Can you please provide the spec/status for your TFJob? Can you also clarify what your expected behavior is and what the observed behavior is?

Its possible you're hitting kubeflow/training-operator#128

@bleachzk
Copy link
Author

bleachzk commented Feb 2, 2018

@jlewi
Worker ran successfully,job should be terminated,but it is still running...

Pod status:
pod status

worker logs:
worker-pod-logs

Job status:
job-status

@jlewi
Copy link
Contributor

jlewi commented Feb 4, 2018

That's kubeflow/training-operator#128

We originally let the jobs continue to run until the TFJob is deleted to make the logs accessible after the job terminated.

We are working on fixing that.

@jlewi
Copy link
Contributor

jlewi commented Mar 9, 2018

Duplicate of kubeflow/training-operator#128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants