Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't require a master or chief #192

Closed
jlewi opened this issue Dec 1, 2017 · 2 comments
Closed

Don't require a master or chief #192

jlewi opened this issue Dec 1, 2017 · 2 comments
Labels

Comments

@jlewi
Copy link
Contributor

jlewi commented Dec 1, 2017

We should consider removing the requirement to have a master.

A lot of TF code just uses worker 0 as the chief. An example is inception using slim. Adapting such code to work with the existing TfJob API can be awkward. Inception for example uses worker 0 as the chief. If you try to treat the master as the chief and a worker you have problems because the inception code assigns the ops to "'/job:worker/device:GPU:0" with the expectation that worker maps to localhost since no task is specified. But this won't work on the master because the TF job name is "master" not "worker". The easiest work around I found was to just spin up a standard TF gRPC server for the master; this satisfied the TfJob requirement to have a master and also ensured all TF jobs in the ClusterSpec corresponded to valid gRPC endpoints.

TfJob could probably support code like this just by dropping the requirement that there be a master for a job. We'd have to adjust the exit criterion; we'd probably want to run until all workers finished.

Related to #61

@jlewi
Copy link
Contributor Author

jlewi commented Dec 4, 2017

Proposal. Add a termination policy to the TfJob; something like

terminationPolicy:
  chief:
     replicaName: MASTER
     replicaIndex: 0 

The chief policy corresponds to waiting for a particular process which is the chief to exit. By letting the user specify the replica and replica index to use we can easily accomodate the cases where

  • Chief/Master runs as a dedicated replica type
  • Chief is just worker index 0
    • This appears to the pattern used by the TensorFlow estimator API starting with 1.4

The reason for using replicaName and not replicaType is I suspect we will eventually want to replace replicaType with a set of attributes controlling various aspects of the replica such as termination policy (This was one of the suggestions that came up in the internal review see #64).

An example where that might be useful is if we want to run model evaluation in a separate set of processes and terminate training when evaluation metrics satisfy some criterion.

For compatibility with the current use of replicaType's, the replicaName will just be the string version of the enum.

/cc @jhseu @mrry

@jlewi jlewi added the area/api label Dec 4, 2017
@jlewi
Copy link
Contributor Author

jlewi commented Dec 6, 2017

Some getting started pointers in case someone is considering taking this on

  • terminationPolicy would be defined by modifying the TfJobSpec
  • Modify SetDefaults to automatically set a default terminationPolicy if one isn't set
    • The default policy should correspond to the existing behavior.
  • The previous changes above would probably make a good first PR.

Next PR would be to add support for a policy which used the WORKER replica 0 as the chief.

  • Update GetStatus to determine the state based on the replica and index specified in the termination policy
  • Update Validate
    • master should no longer be required
    • Ensure the replica corresponding to the chief is defined; (i.e if chief is worker 0 then job should include a worker replica).

@jlewi jlewi closed this as completed in 08ec97f Dec 11, 2017
jlewi pushed a commit that referenced this issue Dec 20, 2017
Allow training jobs to work without a master by treating one of the workers as the chiefs.

* Fixes #192

* This will allow TfJobs to be used with a lot of existing TensorFlow programs without modification since using worker 0 as the chief is a common pattern. Currently to run these programs using TfJob's  you need to spin up a dummy TensorFlow gRPC server just to serve as the master.

* This is also necessary to support changes in estimator API with TF 1.4 (#61)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant