Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFJob pods deleted on completion/failure impairing debugging #1039

Closed
cwbeitel opened this issue Jun 20, 2018 · 1 comment
Closed

TFJob pods deleted on completion/failure impairing debugging #1039

cwbeitel opened this issue Jun 20, 2018 · 1 comment

Comments

@cwbeitel
Copy link
Contributor

It looks like there's been a change with the current version of TFJob from the earlier versions where replica pods are being deleted on job completion or failure. This makes dev/debugging difficult (i.e. needing to catch the logs as they happen but before the pod is deleted). Indeed solutions like StackDriver are a partial alternative to this (i.e. where some central service collects logs, avoiding the need to fetch them from pods directly) but still missing from that solution is the ability to kubectl describe pod .... Furthermore the norm established by Job's seems to be to retain Pod's after their completion or failure allowing the aforementioned.

@jlewi
Copy link
Contributor

jlewi commented Jun 20, 2018

Duplicate of kubeflow/training-operator#536

@jlewi jlewi marked this as a duplicate of kubeflow/training-operator#536 Jun 20, 2018
@jlewi jlewi closed this as completed Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants