Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set a default value for restartPolicy #55

Closed
jlewi opened this issue Oct 17, 2017 · 3 comments
Closed

Set a default value for restartPolicy #55

jlewi opened this issue Oct 17, 2017 · 3 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Oct 17, 2017

Users shouldn't need to explicitly set restart policy. We should be able to pick sensible values.

@paramgo
Copy link

paramgo commented Nov 22, 2017

Yes, not sure about the severity of the issue, however if you leave out the restartPolicy, it seems K8s 1.8 defaults to Always which results in the creation of master/workers but not ps (error below):

I  Creating Service: master-xyzz-0 
I  Service master-xyzz-0 already exists. 
I  Creating Job: master-xyzz-0 
I  master-xyzz-0 already exists. 
I  Creating Service: worker-xyzz-0 
I  Service worker-xyzz-0 already exists. 
I  Creating Job: worker-xyzz-0 
I  worker-xyzz-0 already exists. 
I  Creating Service: ps-xyzz-0 
I  Service ps-xyzz-0 already exists. 
I  Creating Job: ps-xyzz-0 
E  trainingJobCreateReplicas() error; [Creating Job ps-xyzz-0 returned error., Job.batch "ps-xyzz-0" is invalid: spec.template.spec.restartPolicy: Unsupported value: "Always": supported values: OnFailure, Never] 
  undefined
A simple TfJob configuration with one master, one worker, and one ps server would suffice where the restartPolicy for ps server is omitted (while it is present for master/worker).

@ScorpioCPH
Copy link
Member

After discuss, We will extend kubernetes built-in RestartPolicy by adding new policy ExitCode:

    RestartPolicyAlways    RestartPolicy = "Always"
    RestartPolicyOnFailure RestartPolicy = "OnFailure"
    RestartPolicyNever     RestartPolicy = "Never"
    RestartPolicyExitCode  RestartPolicy = "ExitCode"

We let users set this field according to their model code.

  • If set RestartPolicy to OnFailure/Always, user should add reloading checkpoint code by themselves.
  • Otherwise restarting will take no effect.

ExitCode policy means that user should add exit code by themselves, tf-operator will check these exit codes to determine the behavior when an error occurs:

  • 1-127: permanent error, do not restart.
  • 128-255: retryable error, will restart the pod.

@gaocegege
Copy link
Member

dup with #524

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants