Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Worker Distributed Training #16

Open
sahilpatelsp opened this issue Apr 4, 2022 · 2 comments
Open

Multi-Worker Distributed Training #16

sahilpatelsp opened this issue Apr 4, 2022 · 2 comments

Comments

@sahilpatelsp
Copy link

I was wondering what exactly is the appropriate way to launch multi-worker distributed training jobs with xmanager. Based on my current understanding, it seems that a Job must be created for each worker pool and all theseJobs must be combined into a single JobGroup, which is then added to the experiment. There also seems to be an option to add Constraints for the JobGroup, however, I cannot seem to find what specific form these constraints may be able to take on besides the provided example of xm_impl.SameMachine(). Furthermore, my current attempt at launching a multi-worker distributing training job raises the following error when creating the distributed strategy with strategy = tf.distribute.MultiWorkerMirroredStrategy(): RuntimeError: Collective ops must be configured at program startup. Both the CLUSTER_SPEC and TF_CONFIG environment variables seem to be set correctly and the distributed strategy is created at the very beginning of the main function, so I was curious if this error might be due to the lack of setting appropriate Constraints for the JobGroup.

@andrewluchen
Copy link
Collaborator

We should include a TF distributed experiment to the examples.

There is a PyTorch one that you can look at:
https://github.com/deepmind/xmanager/blob/main/examples/cifar10_torch/launcher.py

@sahilpatelsp
Copy link
Author

Thanks for pointing that out! I modified my code to align with the provided example, with the main changes being utilizing the async/await syntax of the python asyncio library. However, I am still getting RuntimeError: Collective ops must be configured at program startup when calling strategy = tf.distribute.MultiWorkerMirroredStrategy() at the very beginning of the main function of the training file. I am unsure as to what exactly might be responsible for this error, given that I am creating the strategy before calling any other Tensorflow API, as per https://github.com/tensorflow/tensorflow/blob/3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/python/distribute/collective_all_reduce_strategy.py#L155.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants