-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889
Comments
Is it possible this has already been fixed in 0.4? |
@jokerwenxiao Another option is to look at the Conditions field: https://github.com/kubeflow/tf-operator/blob/v0.3-branch/pkg/apis/tensorflow/v1alpha2/types.go#L141 The last condition should be one of the following values: https://github.com/kubeflow/tf-operator/blob/v0.3-branch/pkg/apis/tensorflow/v1alpha2/types.go#L194 I will take a look at why the tf replica status is missing. |
Looks like we are reinitializing the TF replica status after TFJob completes: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tensorflow/controller.go#L374 So after a TFJob completes (either success or fail), the replica statuses are reset. @gaocegege What is the reason for this? |
I think it is a bug, while there is something we need to care about: how to show the result if we delete pods after finished. In the current design, we get the status of all pods, then set in the TFReplicaStatuses field. After the job is finished, we cannot get the result anymore. For example, if the job is succeeded, and we delete all workers and ps. Then we cannot know the status of the previous workers/ps. |
Do we still need to reconcile pod status after the job is done? |
If there is no need, I think we could just remove the code about initialization, then it should work. WDYT @richardsliu @johnugeorge Personally, I think there is no need although there may be some PS/workers after finished. |
Few questions
What should be status of PS when job is completed and PS pods are deleted due to CleanPodPolicyRunning? eg: When I ran https://github.com/kubeflow/tf-operator/blob/master/examples/v1beta1/dist-mnist/tf_job_mnist.yaml (with reconcile code removed after job is done) , I got the following result replicaStatuses: This is also confusing as PS pods are missing but status shows that they are active |
TfReplicaStatus is being used internally to determine status of the TFJob: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta1/tensorflow/status.go#L46 I think it makes sense to keep the usage internal. JobCondition should suffice for identifying the job status. |
I think users may need to know the status of all replicas to know if there is any problem during the training process. |
This is fixed. Now if the job succeeds (regardless of clean pod policy), the replica statuses are:
And if a pod fails:
|
I usually use "tfReplicaStatuses" to judge the state of a tfjob. but after I deployed kubeflow 0.3.3, the information of "tfReplicaStatuses" is none, I can't get any useful information at all. The situation is as follows
"tfReplicaStatuses":{
"PS":{},
"Worker": {}
}
"tfReplicaStatuses":{
"Chief": {}
"Master": {}
"PS":{},
"Worker": {}
}
In addition, I would like to ask if there is any way to get the status of a tfjob directly, instead of judging by the information in "tfReplicaStatuses".
The text was updated successfully, but these errors were encountered: