-
Notifications
You must be signed in to change notification settings - Fork 661
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[proposal]TFJob condition for v1alpha2 #562
Comments
@gaocegege @ScorpioCPH What are your thoughts on this proposal? Do you think we can complete it in time for our 0.2 release? |
Hi @jlewi When is the deadline of 0.2? |
We want to release end of June. So we should be code complete first half of
June.
…On Tue, May 8, 2018, 7:28 AM Ce Gao ***@***.***> wrote:
Hi @jlewi <https://github.com/jlewi>
What is the deadline of 0.2?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#562 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAvcAyDiBCHXUf1EbSsDrJfCh6YnyTX_ks5twaubgaJpZM4Tp389>
.
|
Personally, I think it has some potential risks WDYT @yph152 |
Do we want to include this into our API? We already have |
Do we still need this? |
yeah! I'm making it. |
OK, feel free to open a PR and assign me to review! |
OK. |
Why doesn't the Running condition depend on the workers? |
Why does success depend on PS workers being in running state? Why isn't chief exiting successfully sufficient? |
|
OK, I will modify it. |
How does this relate to support of termination policies as discussed in the original proposal? Per #634 it looks like termination policies were never implemented. |
I tried a job and it doesn't look like conditions indicating job is done are set. This is using #637
|
@gaocegege pointed out to me on slack that Failed is not a permanent state. This doesn't seem right to me. There should be terminal conditions corresponding to success and failure. If we have an error and are retrying I think it makes more sense to have a Condition like "Error" or "Crash" |
To summarize my discussion with @gaocegege in slack. At a minimum for 0.2 I think we need the following I think that's sufficient. For 0.2 I think we need the following. i) The conditions Failed and Succeeded should indicate the job is completely done Additional conditions can be added later and the implementation can change. |
For reference I checked what Job does It looks like it has conditions Complete and Failed. I think I prefer Succeeded and Failed. Its not obvious to me that Complete means succeded. |
Filed #673 for the immediate work needed for 0.2 |
@yph152 What work remains? Specifically what work is needed in 0.3? |
I think we can close it. Does v1alpha2 work for us now? Could I remove the redundant code? |
Yes, we can close it and remove the redundant code. @jlewi @gaocegege |
TFJob condition proposal
Documents
Author:Penghui Yan, Jingtian Peng
This document is mainly to discuss the management of the state of distributed model training for the tf-operator.
Background
It’s very important to clarify the state judgment of the TensorFlow training jobs. As we all know, the TensorFlow distributed mode is too flexible, so it becomes much more complicated to manage the state of TFJob. Analyzing the best practice of TensorFlow jobs, we finally established a condition transformation of TFJob as below.
Tf-operator condition logic
Noun explanation:
RestartPolicy:
Never: after successful or unsuccessful execution, quit directly, do not restart pod, and do not re-create pod;
OnFailed: after the execution fails, restart pod, after succeed, directly exit, do not restart pod, and do not re-create pod;
Always: restart the pod after successful or unsuccessful execution.
ExitCode: for 0, directly exit, do not restart pod, and do not recreate pod; When it is 1, restart pod;
Other status codes, exit directly, do not restart, and do not recreate pod.
@gaocegege @DjangoPeng @ddysher
The text was updated successfully, but these errors were encountered: