Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

Closed
u2takey opened this issue Apr 22, 2018 · 19 comments
Closed

[v1beta2] Add ActiveDeadlineSeconds and BackoffLimit #550

u2takey opened this issue Apr 22, 2018 · 19 comments

Comments

@u2takey
Copy link
Contributor

u2takey commented Apr 22, 2018

  1. add ActiveDeadlineSeconds to tfjob.Spec like batch.Job's JobSpec.ActiveDeadlineSeconds, which control the max active time for a job.
  2. add BackoffLimitt to tfjob.Spec like batch.Job's JobSpec.BackoffLimitt, which control restart limit for a failing job.
  3. add CleanPolicy to tfjob.Spec for control clean up tfjob's pod after tfjob is done(success/fail), clean up on time is very important for a cloud user.
@gaocegege
Copy link
Member

Thanks for the issue. We have some discussions about the restartpolicy here #544

As for the clean policy, I am not sure that adding a clean policy is the solution. We have some discussions in #536

We prefer to cleanup the pods ASAP, while log is another thing that we should consider.

@u2takey
Copy link
Contributor Author

u2takey commented Apr 22, 2018

ok thanks @gaocegege , i will add comment there.

@gaocegege gaocegege changed the title [v1alpha2] proposal for enchancing tfjob [v1alpha2] Add ActiveDeadlineSeconds and BackoffLimitt Apr 27, 2018
@gaocegege gaocegege changed the title [v1alpha2] Add ActiveDeadlineSeconds and BackoffLimitt [v1alpha2] Add ActiveDeadlineSeconds and BackoffLimit Jun 5, 2018
@jlewi
Copy link
Contributor

jlewi commented Jul 3, 2018

We added CleanPodPolicy to v1alpha2 and it should be in 0.2.0.

Do we still need/want ActiveDeadlineSeconds and BackoffLimit with the other changes to RestartPolicy?

Is this something we want to fix in 0.3?

We should try to nail down the API in order to get to v1.

@gaocegege
Copy link
Member

I think we could add ActiveDeadlineSeconds and BackoffLimit, while it is not in high priority

@carmine carmine added this to the 0.4.0 milestone Nov 6, 2018
@jlewi
Copy link
Contributor

jlewi commented Feb 4, 2019

@richardsliu @johnugeorge Can we make a decision about whether to include these fields as part of the API?

@jlewi jlewi removed this from the 0.4.0 milestone Feb 4, 2019
@jlewi jlewi added this to New in 0.5.0 via automation Feb 4, 2019
@jlewi jlewi moved this from New to TFJob/PyTorch 1.0 in 0.5.0 Feb 4, 2019
@johnugeorge johnugeorge mentioned this issue Feb 6, 2019
4 tasks
@gaocegege
Copy link
Member

I think we should add it as API, it is useful.

@johnugeorge
Copy link
Member

johnugeorge commented Feb 6, 2019

/remove-area 0.4.0
/area 0.5.0

@k8s-ci-robot
Copy link

@johnugeorge: Those labels are not set on the issue: area/0.4.0

In response to this:

/remove-area 0.4.0
/area 0.5.0

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@richardsliu
Copy link
Contributor

/assign @ChanYiLin

@ChanYiLin
Copy link
Member

I can take this issue
Thanks!

@richardsliu richardsliu changed the title [v1alpha2] Add ActiveDeadlineSeconds and BackoffLimit [v1beta2] Add ActiveDeadlineSeconds and BackoffLimit Feb 8, 2019
@gaocegege
Copy link
Member

/assign @ChanYiLin

@richardsliu
Copy link
Contributor

@ChanYiLin Any update on this?

@ChanYiLin
Copy link
Member

I have finished ActiveDeadlineSeconds feature and now thinking how to implement BackoffLimit.
According to the implementation of Job controller in K8s, they count the restart count based on the restart count of the pod/container.
However, in tf-operator, when the pod restart because of exit code, we delete the pod and recreate it using api, so there will no restart count in the pod.
I have to try another way, sorry for the late.

@richardsliu
Copy link
Contributor

@ChanYiLin Is this still on track to be finished in 0.5.0? We are trying to reach code complete by 3/15.

@ChanYiLin
Copy link
Member

Yes it can be finished this week. I am testing it.

@richardsliu
Copy link
Contributor

Thanks!

@jlewi
Copy link
Contributor

jlewi commented Mar 26, 2019

@ChanYiLin Any update on this?

@ChanYiLin
Copy link
Member

@jlewi almost done
I and @richardsliu just discussed how to implement the unit test.
Also, it seems that it failed the e2e test sometimes and randomly.

0.5.0 automation moved this from TFJob/PyTorch 1.0 to Done Mar 26, 2019
@ChanYiLin
Copy link
Member

@jlewi
the PR has been merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
0.5.0
  
Done
Development

No branches or pull requests

8 participants