TFJob pods deleted on completion/failure impairing debugging #1039

cwbeitel · 2018-06-20T16:00:31Z

It looks like there's been a change with the current version of TFJob from the earlier versions where replica pods are being deleted on job completion or failure. This makes dev/debugging difficult (i.e. needing to catch the logs as they happen but before the pod is deleted). Indeed solutions like StackDriver are a partial alternative to this (i.e. where some central service collects logs, avoiding the need to fetch them from pods directly) but still missing from that solution is the ability to kubectl describe pod .... Furthermore the norm established by Job's seems to be to retain Pod's after their completion or failure allowing the aforementioned.

The text was updated successfully, but these errors were encountered:

jlewi · 2018-06-20T23:40:24Z

Duplicate of kubeflow/training-operator#536

jlewi marked this as a duplicate of kubeflow/training-operator#536 Jun 20, 2018

jlewi closed this as completed Jun 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFJob pods deleted on completion/failure impairing debugging #1039

TFJob pods deleted on completion/failure impairing debugging #1039

cwbeitel commented Jun 20, 2018

jlewi commented Jun 20, 2018

TFJob pods deleted on completion/failure impairing debugging #1039

TFJob pods deleted on completion/failure impairing debugging #1039

Comments

cwbeitel commented Jun 20, 2018

jlewi commented Jun 20, 2018