Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889

Closed
jokerwenxiao opened this issue Nov 21, 2018 · 11 comments

Comments

@jokerwenxiao
Copy link

I usually use "tfReplicaStatuses" to judge the state of a tfjob. but after I deployed kubeflow 0.3.3, the information of "tfReplicaStatuses" is none, I can't get any useful information at all. The situation is as follows

  1. When tfjob runs successfully

"tfReplicaStatuses":{
"PS":{},
"Worker": {}
}

  1. When tfjob fails

"tfReplicaStatuses":{
"Chief": {}
"Master": {}
"PS":{},
"Worker": {}
}

In addition, I would like to ask if there is any way to get the status of a tfjob directly, instead of judging by the information in "tfReplicaStatuses".

@jlewi jlewi transferred this issue from kubeflow/kubeflow Dec 7, 2018
@jlewi
Copy link
Contributor

jlewi commented Dec 7, 2018

/cc @richardsliu @gaocegege

@jlewi
Copy link
Contributor

jlewi commented Dec 7, 2018

Is it possible this has already been fixed in 0.4?

@richardsliu
Copy link
Contributor

@jokerwenxiao Another option is to look at the Conditions field: https://github.com/kubeflow/tf-operator/blob/v0.3-branch/pkg/apis/tensorflow/v1alpha2/types.go#L141

The last condition should be one of the following values: https://github.com/kubeflow/tf-operator/blob/v0.3-branch/pkg/apis/tensorflow/v1alpha2/types.go#L194

I will take a look at why the tf replica status is missing.

@richardsliu
Copy link
Contributor

Looks like we are reinitializing the TF replica status after TFJob completes: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tensorflow/controller.go#L374

So after a TFJob completes (either success or fail), the replica statuses are reset.

@gaocegege What is the reason for this?

@gaocegege
Copy link
Member

gaocegege commented Dec 18, 2018

I think it is a bug, while there is something we need to care about: how to show the result if we delete pods after finished.

In the current design, we get the status of all pods, then set in the TFReplicaStatuses field. After the job is finished, we cannot get the result anymore.

For example, if the job is succeeded, and we delete all workers and ps. Then we cannot know the status of the previous workers/ps.

@richardsliu
Copy link
Contributor

Do we still need to reconcile pod status after the job is done?

@gaocegege
Copy link
Member

If there is no need, I think we could just remove the code about initialization, then it should work. WDYT @richardsliu @johnugeorge

Personally, I think there is no need although there may be some PS/workers after finished.

@johnugeorge
Copy link
Member

johnugeorge commented Dec 18, 2018

Few questions

  1. What is the real significance of tfReplicaStatus field currently? Can we treat it as an internal field which should not be used by developers?

  2. Wrt @gaocegege's last comment, what is the extra information that we provide with the status of the previous workers/ps.? Doesn't JobCondition suffice our needs?

  3. If we need to expose it, we have to note down the expected values of tfReplicaStatus field for various replica types.

What should be status of PS when job is completed and PS pods are deleted due to CleanPodPolicyRunning?
Currently PS workers are always active.

eg: When I ran https://github.com/kubeflow/tf-operator/blob/master/examples/v1beta1/dist-mnist/tf_job_mnist.yaml (with reconcile code removed after job is done) , I got the following result

replicaStatuses:
PS:
active: 2
Worker:
active: 1
succeeded: 3

This is also confusing as PS pods are missing but status shows that they are active

@richardsliu
Copy link
Contributor

TfReplicaStatus is being used internally to determine status of the TFJob: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta1/tensorflow/status.go#L46

I think it makes sense to keep the usage internal. JobCondition should suffice for identifying the job status.

@gaocegege
Copy link
Member

I think users may need to know the status of all replicas to know if there is any problem during the training process.

@richardsliu
Copy link
Contributor

This is fixed. Now if the job succeeds (regardless of clean pod policy), the replica statuses are:

  Replica Statuses:
    PS:
      Succeeded:  2
    Worker:
      Succeeded:  2

And if a pod fails:

  Replica Statuses:
    PS:
      Active:  1
      Failed:  1
    Worker:
      Active:  2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants