[proposal]TFJob condition for v1alpha2 #562

yph152 · 2018-04-27T07:02:32Z

TFJob condition proposal

Author：Penghui Yan, Jingtian Peng

This document is mainly to discuss the management of the state of distributed model training for the tf-operator.

Background

It’s very important to clarify the state judgment of the TensorFlow training jobs. As we all know, the TensorFlow distributed mode is too flexible, so it becomes much more complicated to manage the state of TFJob. Analyzing the best practice of TensorFlow jobs, we finally established a condition transformation of TFJob as below.

Tf-operator condition logic

Noun explanation：

Created: the default status of our default tfjob resource creation;
Running: when chief and all ps are Running;
Succeed: when chief and ps both run successfully
Failed: when chief or ps fails, the training mission fails;
Restarting: when all of the ps Running && (devoted to Pending) && RestartPolicy (OnFailed | ExitCode), shown as a restart, here feel, WORKER task should not be there are Always, and if there are Always options, even after the success, also will continue to train, here is not very reasonable.PS task should be there are Never.
Chief: if the user does not have a significant set Chief, the default is worker0;

RestartPolicy:

Never: after successful or unsuccessful execution, quit directly, do not restart pod, and do not re-create pod;
OnFailed: after the execution fails, restart pod, after succeed, directly exit, do not restart pod, and do not re-create pod;
Always: restart the pod after successful or unsuccessful execution.
ExitCode: for 0, directly exit, do not restart pod, and do not recreate pod; When it is 1, restart pod;

Other status codes, exit directly, do not restart, and do not recreate pod.

Status code	description
0	Succeeded
1	General unknown error
2	Misuse of shell commands.
126	Command not executable
127	Unable to find command
128	Invalid exit parameter
128+x	Serious error in Linux signal x.
130	Command passed ctrl+c
255	Exit code out of bounds

@gaocegege @DjangoPeng @ddysher

jlewi · 2018-05-08T14:25:21Z

@gaocegege @ScorpioCPH What are your thoughts on this proposal? Do you think we can complete it in time for our 0.2 release?

gaocegege · 2018-05-08T14:28:41Z

Hi @jlewi

When is the deadline of 0.2?

jlewi · 2018-05-10T02:26:03Z

We want to release end of June. So we should be code complete first half of June.

…

On Tue, May 8, 2018, 7:28 AM Ce Gao ***@***.***> wrote: Hi @jlewi <https://github.com/jlewi> What is the deadline of 0.2? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#562 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAvcAyDiBCHXUf1EbSsDrJfCh6YnyTX_ks5twaubgaJpZM4Tp389> .

gaocegege · 2018-05-10T02:45:34Z

Personally, I think it has some potential risks

WDYT @yph152

ddysher · 2018-05-13T12:02:55Z

Do we want to include this into our API? We already have Conditions and TFReplicaStatuses in our API, it occurs to me that user should be responsible to decide the status, either it is the developer running the job, or a higher level system launching tfjob on behalf of developer.

gaocegege · 2018-06-04T04:55:50Z

@yph152

Do we still need this?

yph152 · 2018-06-04T05:06:12Z

yeah! I'm making it.

gaocegege · 2018-06-04T05:11:11Z

OK, feel free to open a PR and assign me to review!

yph152 · 2018-06-04T05:13:47Z

OK.

jlewi · 2018-06-11T00:53:52Z

Why doesn't the Running condition depend on the workers?

jlewi · 2018-06-11T00:55:24Z

Why does success depend on PS workers being in running state? Why isn't chief exiting successfully sufficient?

yph152 · 2018-06-11T02:08:59Z

Why doesn't the Running condition depend on the workers?
@jlewi Our default user will use worker-0 as chief.

yph152 · 2018-06-11T02:12:28Z

Why does success depend on PS workers being in running state? Why isn't chief exiting successfully sufficient?

OK, I will modify it.

jlewi · 2018-06-11T23:36:36Z

How does this relate to support of termination policies as discussed in the original proposal?
kubeflow/community#30 (comment)

Per #634 it looks like termination policies were never implemented.

jlewi · 2018-06-12T20:56:15Z

I tried a job and it doesn't look like conditions indicating job is done are set.

This is using #637

apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
  annotations:
    ksonnet.io/managed: '{"pristine":"H4sIAAAAAAAA/+yQz8oaMRTF932Ms44OSqGQbVH6B1vR0i5E5BrvOMHJH5I4RSTv/hHnE5Vv4wO4u3cm555zfmeQ1385RO0sJA7HLdet+z90YV91I2p9Q2MIHLTdQeLP9IfbQsBwoh0lgjzDkmFIGIqJw0DHgWo014M7cXkRPSm+M0AWiJ5VuZDqBftWK1p6VrF8+frt+2RahtD/iJAjgcAxUUhz12p1gsQv7jhAILHxLSUugutN5WwibTlEyFVZjaFSYQV/So2zEKg6CpVyO64aiptL6o0Lm77I0J+wFtCG9iV2L5Lj4Zf3PpBIbKMLlzICnWuPhmfuaFPvaMo4p9RA3pxu4suW13l9lV5z2lrvZ+SfIZs/XMtZYL58BDd+gXsO3L/fi5+TxSO8zy94z8DLOX96AwAA//8BAAD//6qI3IdKBAAA"}'
  clusterName: ""
  creationTimestamp: 2018-06-12T20:38:24Z
  generation: 0
  labels:
    app.kubernetes.io/deploy-manager: ksonnet
  name: master-is-chief-v1alpha2
  namespace: kubeflow
  resourceVersion: "1080051"
  selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/master-is-chief-v1alpha2
  uid: 8d883088-6e80-11e8-996c-42010af000af
spec:
  tfReplicaSpecs:
    CHIEF:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
    WORKER:
      replicas: 4
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - command:
            - python
            - /var/code/has_chief_or_master.py
            image: python:2.7
            name: tensorflow
            ports:
            - containerPort: 2222
              name: tfjob-port
            resources: {}
            volumeMounts:
            - mountPath: /var/code
              name: code
          volumes:
          - configMap:
              name: master-is-chief-v1alpha2
            name: code
status:
  conditions: null
  tfReplicaStatuses:
    CHIEF:
      succeeded: 1
    PS:
      active: 2
    WORKER:
      active: 4

jlewi · 2018-06-15T04:19:28Z

@gaocegege pointed out to me on slack that Failed is not a permanent state. This doesn't seem right to me.

There should be terminal conditions corresponding to success and failure.

If we have an error and are retrying I think it makes more sense to have a Condition like "Error" or "Crash"

jlewi · 2018-06-15T04:39:03Z

To summarize my discussion with @gaocegege in slack. At a minimum for 0.2 I think we need the following

I think that's sufficient. For 0.2 I think we need the following.

i) The conditions Failed and Succeeded should indicate the job is completely done
ii) When the job enters the conditions Failed and Succeeded the pods should be deleted so resources are released

Additional conditions can be added later and the implementation can change.

jlewi · 2018-06-15T04:44:19Z

For reference I checked what Job does
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#jobcondition-v1-batch

It looks like it has conditions Complete and Failed. I think I prefer Succeeded and Failed. Its not obvious to me that Complete means succeded.

jlewi · 2018-06-15T04:54:09Z

Filed #673 for the immediate work needed for 0.2

jlewi · 2018-07-03T12:52:48Z

@yph152 What work remains? Specifically what work is needed in 0.3?

gaocegege · 2018-07-03T12:54:41Z

I think we can close it.

@yph152

Does v1alpha2 work for us now? Could I remove the redundant code?

yph152 · 2018-07-04T06:29:24Z

Yes, we can close it and remove the redundant code. @jlewi @gaocegege

gaocegege added area/api priority/p1 area/operator api/v1alpha2 community/discussion labels Apr 27, 2018

gaocegege mentioned this issue Apr 27, 2018

[v1alpha2] The state of distributed model training. #544

Closed

yph152 mentioned this issue Jun 5, 2018

[v1alpha2] Add distributed state management #625

Merged

jlewi added the area/0.2.0 label Jun 5, 2018

gaocegege assigned gaocegege and yph152 Jun 14, 2018

jlewi mentioned this issue Jun 15, 2018

[v1alpha2] Need conditions Succeeded and Failed indicating when job is done #673

Closed

gaocegege mentioned this issue Jun 15, 2018

label: Remove namespace from labels #675

Merged

jlewi added the area/0.3.0 label Jul 3, 2018

gaocegege closed this as completed Jul 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proposal]TFJob condition for v1alpha2 #562

[proposal]TFJob condition for v1alpha2 #562

yph152 commented Apr 27, 2018 •

edited

Loading

jlewi commented May 8, 2018

gaocegege commented May 8, 2018 •

edited

Loading

jlewi commented May 10, 2018 via email

gaocegege commented May 10, 2018

ddysher commented May 13, 2018

gaocegege commented Jun 4, 2018

yph152 commented Jun 4, 2018

gaocegege commented Jun 4, 2018

yph152 commented Jun 4, 2018 •

edited

Loading

jlewi commented Jun 11, 2018

jlewi commented Jun 11, 2018

yph152 commented Jun 11, 2018

yph152 commented Jun 11, 2018

jlewi commented Jun 11, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 15, 2018

jlewi commented Jun 15, 2018

jlewi commented Jun 15, 2018

jlewi commented Jun 15, 2018

jlewi commented Jul 3, 2018

gaocegege commented Jul 3, 2018

yph152 commented Jul 4, 2018

[proposal]TFJob condition for v1alpha2 #562

[proposal]TFJob condition for v1alpha2 #562

Comments

yph152 commented Apr 27, 2018 • edited Loading

TFJob condition proposal

Author：Penghui Yan, Jingtian Peng

Background

Tf-operator condition logic

Noun explanation：

RestartPolicy:

jlewi commented May 8, 2018

gaocegege commented May 8, 2018 • edited Loading

jlewi commented May 10, 2018 via email

gaocegege commented May 10, 2018

ddysher commented May 13, 2018

gaocegege commented Jun 4, 2018

yph152 commented Jun 4, 2018

gaocegege commented Jun 4, 2018

yph152 commented Jun 4, 2018 • edited Loading

jlewi commented Jun 11, 2018

jlewi commented Jun 11, 2018

yph152 commented Jun 11, 2018

yph152 commented Jun 11, 2018

jlewi commented Jun 11, 2018

jlewi commented Jun 12, 2018

jlewi commented Jun 15, 2018

jlewi commented Jun 15, 2018

jlewi commented Jun 15, 2018

jlewi commented Jun 15, 2018

jlewi commented Jul 3, 2018

gaocegege commented Jul 3, 2018

yph152 commented Jul 4, 2018

yph152 commented Apr 27, 2018 •

edited

Loading

gaocegege commented May 8, 2018 •

edited

Loading

yph152 commented Jun 4, 2018 •

edited

Loading