the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889

jokerwenxiao · 2018-11-21T13:12:06Z

I usually use "tfReplicaStatuses" to judge the state of a tfjob. but after I deployed kubeflow 0.3.3, the information of "tfReplicaStatuses" is none, I can't get any useful information at all. The situation is as follows

When tfjob runs successfully

"tfReplicaStatuses":{
"PS":{},
"Worker": {}
}

When tfjob fails

"tfReplicaStatuses":{
"Chief": {}
"Master": {}
"PS":{},
"Worker": {}
}

In addition, I would like to ask if there is any way to get the status of a tfjob directly, instead of judging by the information in "tfReplicaStatuses".

jlewi · 2018-12-07T20:20:27Z

/cc @richardsliu @gaocegege

jlewi · 2018-12-07T20:21:01Z

Is it possible this has already been fixed in 0.4?

richardsliu · 2018-12-17T23:32:06Z

@jokerwenxiao Another option is to look at the Conditions field: https://github.com/kubeflow/tf-operator/blob/v0.3-branch/pkg/apis/tensorflow/v1alpha2/types.go#L141

The last condition should be one of the following values: https://github.com/kubeflow/tf-operator/blob/v0.3-branch/pkg/apis/tensorflow/v1alpha2/types.go#L194

I will take a look at why the tf replica status is missing.

richardsliu · 2018-12-18T03:08:16Z

Looks like we are reinitializing the TF replica status after TFJob completes: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v2/tensorflow/controller.go#L374

So after a TFJob completes (either success or fail), the replica statuses are reset.

@gaocegege What is the reason for this?

gaocegege · 2018-12-18T06:56:35Z

I think it is a bug, while there is something we need to care about: how to show the result if we delete pods after finished.

In the current design, we get the status of all pods, then set in the TFReplicaStatuses field. After the job is finished, we cannot get the result anymore.

For example, if the job is succeeded, and we delete all workers and ps. Then we cannot know the status of the previous workers/ps.

richardsliu · 2018-12-18T07:09:59Z

Do we still need to reconcile pod status after the job is done?

gaocegege · 2018-12-18T07:40:37Z

If there is no need, I think we could just remove the code about initialization, then it should work. WDYT @richardsliu @johnugeorge

Personally, I think there is no need although there may be some PS/workers after finished.

johnugeorge · 2018-12-18T11:30:13Z

Few questions

What is the real significance of tfReplicaStatus field currently? Can we treat it as an internal field which should not be used by developers?
Wrt @gaocegege's last comment, what is the extra information that we provide with the status of the previous workers/ps.? Doesn't JobCondition suffice our needs?
If we need to expose it, we have to note down the expected values of tfReplicaStatus field for various replica types.

What should be status of PS when job is completed and PS pods are deleted due to CleanPodPolicyRunning?
Currently PS workers are always active.

eg: When I ran https://github.com/kubeflow/tf-operator/blob/master/examples/v1beta1/dist-mnist/tf_job_mnist.yaml (with reconcile code removed after job is done) , I got the following result

replicaStatuses:
PS:
active: 2
Worker:
active: 1
succeeded: 3

This is also confusing as PS pods are missing but status shows that they are active

richardsliu · 2018-12-18T17:57:36Z

TfReplicaStatus is being used internally to determine status of the TFJob: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1beta1/tensorflow/status.go#L46

I think it makes sense to keep the usage internal. JobCondition should suffice for identifying the job status.

gaocegege · 2018-12-19T01:28:52Z

I think users may need to know the status of all replicas to know if there is any problem during the training process.

richardsliu · 2019-01-09T18:50:05Z

This is fixed. Now if the job succeeds (regardless of clean pod policy), the replica statuses are:

  Replica Statuses:
    PS:
      Succeeded:  2
    Worker:
      Succeeded:  2

And if a pod fails:

  Replica Statuses:
    PS:
      Active:  1
      Failed:  1
    Worker:
      Active:  2

jlewi transferred this issue from kubeflow/kubeflow Dec 7, 2018

jlewi added priority/p1 kind/bug area/tfjob labels Dec 7, 2018

richardsliu self-assigned this Jan 8, 2019

richardsliu mentioned this issue Jan 9, 2019

Don't reinitialize replica statuses after TFJob completes #897

Merged

richardsliu closed this as completed Jan 9, 2019

johnugeorge mentioned this issue Feb 7, 2019

Skip status reinit when job is succeeded kubeflow/pytorch-operator#132

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889

the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889

jokerwenxiao commented Nov 21, 2018

jlewi commented Dec 7, 2018

jlewi commented Dec 7, 2018

richardsliu commented Dec 17, 2018

richardsliu commented Dec 18, 2018

gaocegege commented Dec 18, 2018 •

edited

Loading

richardsliu commented Dec 18, 2018

gaocegege commented Dec 18, 2018

johnugeorge commented Dec 18, 2018 •

edited

Loading

richardsliu commented Dec 18, 2018

gaocegege commented Dec 19, 2018

richardsliu commented Jan 9, 2019

the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889

the information of "tfReplicaStatuses" is none when tfjob is in In termination state #889

Comments

jokerwenxiao commented Nov 21, 2018

jlewi commented Dec 7, 2018

jlewi commented Dec 7, 2018

richardsliu commented Dec 17, 2018

richardsliu commented Dec 18, 2018

gaocegege commented Dec 18, 2018 • edited Loading

richardsliu commented Dec 18, 2018

gaocegege commented Dec 18, 2018

johnugeorge commented Dec 18, 2018 • edited Loading

richardsliu commented Dec 18, 2018

gaocegege commented Dec 19, 2018

richardsliu commented Jan 9, 2019

gaocegege commented Dec 18, 2018 •

edited

Loading

johnugeorge commented Dec 18, 2018 •

edited

Loading