Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Surface Pod and other Errors that Prevent TFJob from starting #1131

Closed
jlewi opened this issue Feb 5, 2020 · 19 comments
Closed

Surface Pod and other Errors that Prevent TFJob from starting #1131

jlewi opened this issue Feb 5, 2020 · 19 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Feb 5, 2020

See kubeflow/kubeflow#4711

We need a good way to surface errors starting or running pods to the users. Right now it looks like users would have to look at the operator logs.

Users should be able to do

kubectl describe tfjobs ${MYJOB}

to see relevant errors problems

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.94

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@johnugeorge
Copy link
Member

Good point. Will take this up in the next release

@jlewi jlewi removed the feature label Mar 20, 2020
@jlewi
Copy link
Contributor Author

jlewi commented Apr 6, 2020

You might want to check out kubeflow/kubeflow#3637 to see how this was solved in the case of notebooks. I think the approach we followed was to replay events from the pod.

@jlewi
Copy link
Contributor Author

jlewi commented May 18, 2020

@johnugeorge How's this coming? Do you think this will land for 1.1?

@gaocegege
Copy link
Member

/cc @ChanYiLin

@ChanYiLin
Copy link
Member

ChanYiLin commented May 21, 2020

@gaocegege @johnugeorge
I am thinking should we put this in the common library?
It seems this is a feature that every operator needs.
/cc @Jeffwan

@Jeffwan
Copy link
Member

Jeffwan commented May 21, 2020

This is a reasonable request. I think the engineer story is if pods come into failed status, try to filter events of pods and create CR events along with error msg from pods side. It should catch 1.1 timeline once we move all operator to common (target for 1.1 as well)

@jlewi
Copy link
Contributor Author

jlewi commented Jun 15, 2020

@gaocegege @johnugeorge @Jeffwan Is this on track for 1.1? What is the likelihood it lands this week? If not should we downgrade it to P2 and remove from KF 1.1?

@gaocegege
Copy link
Member

Personally, I think it is not on track. @Jeffwan

@gaocegege
Copy link
Member

/cc @whalecold

@whalecold
Copy link

/assign

@whalecold
Copy link

The pod error event has be recorded in the common repo and I can find the pod error event in my kubernetes cluster. It seems that he didn't use kubectl describe to track the unexpected condition in the issue. What do you think?@gaocegege

@gaocegege
Copy link
Member

Sometimes the pod is created successfully, but it is failed to schedule.

@whalecold
Copy link

whalecold commented Aug 27, 2020

Sometimes the pod is created successfully, but it is failed to schedule.

OK, I have two ideas, One is using the active pod status which is False like spark operator. Another is collecting the events which were generated by the abnormal pods.
I think the first is better because of the second solution need stores all the events in memory, but the pod status may not be as detailed as the event.

@gaocegege
Copy link
Member

The first SGTM. I think it works for tf-operator and easy to implement.

@whalecold
Copy link

The first SGTM. I think it works for tf-operator and easy to implement.

Done, PTAL

@gaocegege
Copy link
Member

I think we can update the vendor to use the latest common.

@whalecold
Copy link

I think we can update the vendor to use the latest common.

As the tag v0.3.1 is not latest, we should release a new tag first.

@stale
Copy link

stale bot commented Jan 10, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot closed this as completed Jan 17, 2021
Kubeflow 1.1 automation moved this from To do to Done Jan 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Kubeflow 1.1
  
Done
Development

No branches or pull requests

6 participants