Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proposal]TFJob condition for v1alpha2 #562

Closed
yph152 opened this issue Apr 27, 2018 · 22 comments
Closed

[proposal]TFJob condition for v1alpha2 #562

yph152 opened this issue Apr 27, 2018 · 22 comments

Comments

@yph152
Copy link
Contributor

yph152 commented Apr 27, 2018

TFJob condition proposal

Documents

Author:Penghui Yan, Jingtian Peng

    This document is mainly to discuss the management of the state of distributed model training for the tf-operator.

Background

    It’s very important to clarify the state judgment of the TensorFlow training jobs. As we all know, the TensorFlow distributed mode is too flexible, so it becomes much more complicated to manage the state of TFJob. Analyzing the best practice of TensorFlow jobs, we finally established a condition transformation of TFJob as below.

Tf-operator condition logic

tfjob condition proposal

Noun explanation:

  • Created: the default status of our default tfjob resource creation;
  • Running: when chief and all ps are Running;
  • Succeed: when chief and ps both run successfully
  • Failed: when chief or ps fails, the training mission fails;
  • Restarting: when all of the ps Running && (devoted to Pending) && RestartPolicy (OnFailed | ExitCode), shown as a restart, here feel, WORKER task should not be there are Always, and if there are Always options, even after the success, also will continue to train, here is not very reasonable.PS task should be there are Never.
  • Chief: if the user does not have a significant set Chief, the default is worker0;

RestartPolicy:

  • Never: after successful or unsuccessful execution, quit directly, do not restart pod, and do not re-create pod;

  • OnFailed: after the execution fails, restart pod, after succeed, directly exit, do not restart pod, and do not re-create pod;

  • Always: restart the pod after successful or unsuccessful execution.

  • ExitCode: for 0, directly exit, do not restart pod, and do not recreate pod; When it is 1, restart pod;

    Other status codes, exit directly, do not restart, and do not recreate pod.

Status code description
0 Succeeded
1 General unknown error
2 Misuse of shell commands.
126 Command not executable
127 Unable to find command
128 Invalid exit parameter
128+x Serious error in Linux signal x.
130 Command passed ctrl+c
255 Exit code out of bounds

@gaocegege @DjangoPeng @ddysher

@jlewi
Copy link
Contributor

jlewi commented May 8, 2018

@gaocegege @ScorpioCPH What are your thoughts on this proposal? Do you think we can complete it in time for our 0.2 release?

@gaocegege
Copy link
Member

gaocegege commented May 8, 2018

Hi @jlewi

When is the deadline of 0.2?

@jlewi
Copy link
Contributor

jlewi commented May 10, 2018 via email

@gaocegege
Copy link
Member

Personally, I think it has some potential risks

WDYT @yph152

@ddysher
Copy link
Member

ddysher commented May 13, 2018

Do we want to include this into our API? We already have Conditions and TFReplicaStatuses in our API, it occurs to me that user should be responsible to decide the status, either it is the developer running the job, or a higher level system launching tfjob on behalf of developer.

@gaocegege
Copy link
Member

@yph152

Do we still need this?

@yph152
Copy link
Contributor Author

yph152 commented Jun 4, 2018

yeah! I'm making it.

@gaocegege
Copy link
Member

OK, feel free to open a PR and assign me to review!

@yph152
Copy link
Contributor Author

yph152 commented Jun 4, 2018

OK.

@jlewi
Copy link
Contributor

jlewi commented Jun 11, 2018

Why doesn't the Running condition depend on the workers?

@jlewi
Copy link
Contributor

jlewi commented Jun 11, 2018

Why does success depend on PS workers being in running state? Why isn't chief exiting successfully sufficient?

@yph152
Copy link
Contributor Author

yph152 commented Jun 11, 2018

Why doesn't the Running condition depend on the workers?
@jlewi Our default user will use worker-0 as chief.

@yph152
Copy link
Contributor Author

yph152 commented Jun 11, 2018

Why does success depend on PS workers being in running state? Why isn't chief exiting successfully sufficient?

OK, I will modify it.

@jlewi
Copy link
Contributor

jlewi commented Jun 11, 2018

How does this relate to support of termination policies as discussed in the original proposal?
kubeflow/community#30 (comment)

Per #634 it looks like termination policies were never implemented.

@jlewi
Copy link
Contributor

jlewi commented Jun 12, 2018

I tried a job and it doesn't look like conditions indicating job is done are set.

This is using #637

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  annotations:
    ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/+yQz8oaMRTF932Ms44OSqGQbVH6B1vR0i5E5BrvOMHJH5I4RSTv/hHnE5Vv4wO4u3cm555zfmeQ1385RO0sJA7HLdet+z90YV91I2p9Q2MIHLTdQeLP9IfbQsBwoh0lgjzDkmFIGIqJw0DHgWo014M7cXkRPSm+M0AWiJ5VuZDqBftWK1p6VrF8+frt+2RahtD/iJAjgcAxUUhz12p1gsQv7jhAILHxLSUugutN5WwibTlEyFVZjaFSYQV/So2zEKg6CpVyO64aiptL6o0Lm77I0J+wFtCG9iV2L5Lj4Zf3PpBIbKMLlzICnWuPhmfuaFPvaMo4p9RA3pxu4suW13l9lV5z2lrvZ+SfIZs/XMtZYL58BDd+gXsO3L/fi5+TxSO8zy94z8DLOX96AwAA//8BAAD//6qI3IdKBAAA"}'
  clusterName: ""
  creationTimestamp: 2018-06-12T20:38:24Z
  generation: 0
  labels:
    app.kubernetes.io/deploy-manager: ksonnet
  name: master-is-chief-v1alpha2
  namespace: kubeflow
  resourceVersion: "1080051"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/master-is-chief-v1alpha2
  uid: 8d883088-6e80-11e8-996c-42010af000af
spec:
  tfReplicaSpecs:
    CHIEF:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
    WORKER:
      replicas: 4
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
status:
  conditions: null
  tfReplicaStatuses:
    CHIEF:
      succeeded: 1
    PS:
      active: 2
    WORKER:
      active: 4

@jlewi
Copy link
Contributor

jlewi commented Jun 15, 2018

@gaocegege pointed out to me on slack that Failed is not a permanent state. This doesn't seem right to me.

There should be terminal conditions corresponding to success and failure.

If we have an error and are retrying I think it makes more sense to have a Condition like "Error" or "Crash"

@jlewi
Copy link
Contributor

jlewi commented Jun 15, 2018

To summarize my discussion with @gaocegege in slack. At a minimum for 0.2 I think we need the following

I think that's sufficient. For 0.2 I think we need the following.

i) The conditions Failed and Succeeded should indicate the job is completely done
ii) When the job enters the conditions Failed and Succeeded the pods should be deleted so resources are released

Additional conditions can be added later and the implementation can change.

@jlewi
Copy link
Contributor

jlewi commented Jun 15, 2018

For reference I checked what Job does
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#jobcondition-v1-batch

It looks like it has conditions Complete and Failed. I think I prefer Succeeded and Failed. Its not obvious to me that Complete means succeded.

@jlewi
Copy link
Contributor

jlewi commented Jun 15, 2018

Filed #673 for the immediate work needed for 0.2

@jlewi
Copy link
Contributor

jlewi commented Jul 3, 2018

@yph152 What work remains? Specifically what work is needed in 0.3?

@gaocegege
Copy link
Member

I think we can close it.

@yph152

Does v1alpha2 work for us now? Could I remove the redundant code?

@yph152
Copy link
Contributor Author

yph152 commented Jul 4, 2018

Yes, we can close it and remove the redundant code. @jlewi @gaocegege

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants