[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

u2takey · 2018-04-22T05:15:46Z

add ActiveDeadlineSeconds to tfjob.Spec like batch.Job's JobSpec.ActiveDeadlineSeconds, which control the max active time for a job.
add BackoffLimitt to tfjob.Spec like batch.Job's JobSpec.BackoffLimitt, which control restart limit for a failing job.
add CleanPolicy to tfjob.Spec for control clean up tfjob's pod after tfjob is done(success/fail), clean up on time is very important for a cloud user.

gaocegege · 2018-04-22T05:23:36Z

Thanks for the issue. We have some discussions about the restartpolicy here #544

As for the clean policy, I am not sure that adding a clean policy is the solution. We have some discussions in #536

We prefer to cleanup the pods ASAP, while log is another thing that we should consider.

u2takey · 2018-04-22T06:23:17Z

ok thanks @gaocegege , i will add comment there.

jlewi · 2018-07-03T12:55:32Z

We added CleanPodPolicy to v1alpha2 and it should be in 0.2.0.

Do we still need/want ActiveDeadlineSeconds and BackoffLimit with the other changes to RestartPolicy?

Is this something we want to fix in 0.3?

We should try to nail down the API in order to get to v1.

gaocegege · 2018-07-13T03:31:19Z

I think we could add ActiveDeadlineSeconds and BackoffLimit, while it is not in high priority

jlewi · 2019-02-04T18:40:37Z

@richardsliu @johnugeorge Can we make a decision about whether to include these fields as part of the API?

gaocegege · 2019-02-06T08:17:01Z

I think we should add it as API, it is useful.

johnugeorge · 2019-02-06T12:10:00Z

/remove-area 0.4.0
/area 0.5.0

k8s-ci-robot · 2019-02-06T12:10:21Z

@johnugeorge: Those labels are not set on the issue: area/0.4.0

In response to this:

/remove-area 0.4.0
/area 0.5.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

richardsliu · 2019-02-08T19:39:40Z

/assign @ChanYiLin

ChanYiLin · 2019-02-08T19:39:50Z

I can take this issue
Thanks!

gaocegege · 2019-02-11T03:26:57Z

/assign @ChanYiLin

richardsliu · 2019-02-25T19:10:41Z

@ChanYiLin Any update on this?

ChanYiLin · 2019-02-25T22:03:36Z

I have finished ActiveDeadlineSeconds feature and now thinking how to implement BackoffLimit.
According to the implementation of Job controller in K8s, they count the restart count based on the restart count of the pod/container.
However, in tf-operator, when the pod restart because of exit code, we delete the pod and recreate it using api, so there will no restart count in the pod.
I have to try another way, sorry for the late.

richardsliu · 2019-03-04T21:51:48Z

@ChanYiLin Is this still on track to be finished in 0.5.0? We are trying to reach code complete by 3/15.

ChanYiLin · 2019-03-04T23:42:56Z

Yes it can be finished this week. I am testing it.

richardsliu · 2019-03-04T23:46:29Z

Thanks!

jlewi · 2019-03-26T04:40:25Z

@ChanYiLin Any update on this?

ChanYiLin · 2019-03-26T10:06:56Z

@jlewi almost done
I and @richardsliu just discussed how to implement the unit test.
Also, it seems that it failed the e2e test sometimes and randomly.

ChanYiLin · 2019-03-26T22:37:02Z

@jlewi
the PR has been merged.

gaocegege added kind/feature api/v1alpha2 labels Apr 22, 2018

gaocegege added the priority/p3 label Apr 22, 2018

gaocegege changed the title ~~[v1alpha2] proposal for enchancing tfjob~~ [v1alpha2] Add ActiveDeadlineSeconds and BackoffLimitt Apr 27, 2018

gaocegege changed the title ~~[v1alpha2] Add ActiveDeadlineSeconds and BackoffLimitt~~ [v1alpha2] Add ActiveDeadlineSeconds and BackoffLimit Jun 5, 2018

jlewi added the area/0.3.0 label Jul 3, 2018

gaocegege mentioned this issue Jul 9, 2018

[proposal] cleanup jobs after finished #718

Closed

richardsliu added area/0.4.0 and removed area/0.3.0 labels Oct 11, 2018

carmine added this to the 0.4.0 milestone Nov 6, 2018

jlewi removed this from the 0.4.0 milestone Feb 4, 2019

jlewi added this to New in 0.5.0 via automation Feb 4, 2019

jlewi moved this from New to TFJob/PyTorch 1.0 in 0.5.0 Feb 4, 2019

johnugeorge mentioned this issue Feb 6, 2019

TF operator v1beta2 API #935

Closed

4 tasks

k8s-ci-robot removed the area/0.4.0 label Feb 6, 2019

k8s-ci-robot added the area/0.5.0 label Feb 6, 2019

k8s-ci-robot assigned ChanYiLin Feb 8, 2019

richardsliu changed the title ~~[v1alpha2] Add ActiveDeadlineSeconds and BackoffLimit~~ [v1beta2] Add ActiveDeadlineSeconds and BackoffLimit Feb 8, 2019

johnugeorge mentioned this issue Feb 22, 2019

tolerate a worker fails kubeflow/katib#390

Closed

This was referenced Mar 11, 2019

add ActiveDeadlineSeconds and BackoffLimit features #955

Closed

add ActiveDeadlineSeconds and BackoffLimit features #958

Closed

johnugeorge mentioned this issue Mar 19, 2019

Checking failed Trials by Katib Controller kubeflow/katib#433

Closed

kunmingg mentioned this issue Mar 19, 2019

Ship v0.5.0 release kubeflow/kubeflow#2716

Closed

22 tasks

ChanYiLin mentioned this issue Mar 21, 2019

add ActiveDeadlineSeconds and BackoffLimit features #963

Merged

johnugeorge mentioned this issue Mar 26, 2019

Pytorch operator v1beta2 API kubeflow/pytorch-operator#134

Closed

4 tasks

k8s-ci-robot closed this as completed in #963 Mar 26, 2019

0.5.0 automation moved this from TFJob/PyTorch 1.0 to Done Mar 26, 2019

johnugeorge mentioned this issue Mar 27, 2019

Implement ActiveDeadlineSeconds and BackoffLimit kubeflow/pytorch-operator#151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

u2takey commented Apr 22, 2018

gaocegege commented Apr 22, 2018

u2takey commented Apr 22, 2018

jlewi commented Jul 3, 2018

gaocegege commented Jul 13, 2018

jlewi commented Feb 4, 2019

gaocegege commented Feb 6, 2019

johnugeorge commented Feb 6, 2019 •

edited

Loading

k8s-ci-robot commented Feb 6, 2019

richardsliu commented Feb 8, 2019

ChanYiLin commented Feb 8, 2019

gaocegege commented Feb 11, 2019

richardsliu commented Feb 25, 2019

ChanYiLin commented Feb 25, 2019

richardsliu commented Mar 4, 2019

ChanYiLin commented Mar 4, 2019

richardsliu commented Mar 4, 2019

jlewi commented Mar 26, 2019

ChanYiLin commented Mar 26, 2019

ChanYiLin commented Mar 26, 2019

[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

Comments

u2takey commented Apr 22, 2018

gaocegege commented Apr 22, 2018

u2takey commented Apr 22, 2018

jlewi commented Jul 3, 2018

gaocegege commented Jul 13, 2018

jlewi commented Feb 4, 2019

gaocegege commented Feb 6, 2019

johnugeorge commented Feb 6, 2019 • edited Loading

k8s-ci-robot commented Feb 6, 2019

richardsliu commented Feb 8, 2019

ChanYiLin commented Feb 8, 2019

gaocegege commented Feb 11, 2019

richardsliu commented Feb 25, 2019

ChanYiLin commented Feb 25, 2019

richardsliu commented Mar 4, 2019

ChanYiLin commented Mar 4, 2019

richardsliu commented Mar 4, 2019

jlewi commented Mar 26, 2019

ChanYiLin commented Mar 26, 2019

ChanYiLin commented Mar 26, 2019

johnugeorge commented Feb 6, 2019 •

edited

Loading