Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PS failed but tfjob status is running #881

Closed
deepanshg opened this issue Dec 2, 2018 · 1 comment
Closed

PS failed but tfjob status is running #881

deepanshg opened this issue Dec 2, 2018 · 1 comment
Labels

Comments

@deepanshg
Copy link

Tfjob consists of 1 PS and 2 Workers. PS is failed, workers are running and status of tfjob is running, whereas it should be failed according to PR#690

Status:

  Conditions:
    Last Transition Time:  2018-12-02T12:52:38Z
    Last Update Time:      2018-12-02T12:52:38Z
    Message:               TFJob dtf-103553-223 is running.
    Reason:                TFJobRunning
    Status:                True
    Type:                  Running
  Start Time:              2018-12-02T13:06:55Z
  Tf Replica Statuses:
    PS:
      Failed:  1
    Worker:
      Active:  2

The status remains same even after hours.

Some parts of Yaml:

Namespace:    gpuexptazuregpu
Labels:       app=dtf-103553-223
Annotations:  <none>
API Version:  kubeflow.org/v1alpha2
Kind:         TFJob
Metadata:
  Cluster Name:        
  Creation Timestamp:  2018-12-02T12:52:35Z
  Generation:          0
  Resource Version:    64383446
  Self Link:           /apis/kubeflow.org/v1alpha2/namespaces/gpuexptazuregpu/tfjobs/dtf-103553-223
  UID:                 2436acaf-f631-11e8-8519-000d3a02fab7
Spec:
  Tf Replica Specs:
    PS:
      Replicas:        1
      Restart Policy:  Never
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec: ...
    Worker:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Metadata:
          Creation Timestamp:  <nil>
        Spec: ...

tf-opertor-image: gcr.io/kubeflow-images-public/tf_operator:v0.2.0
kubeflow version: v0.2.4

@richardsliu
Copy link
Contributor

This should be fixed in v0.3.0 and later.

@jlewi jlewi closed this as completed Feb 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants