Multi-Worker Distributed Training #16

sahilpatelsp · 2022-04-04T00:45:56Z

I was wondering what exactly is the appropriate way to launch multi-worker distributed training jobs with xmanager. Based on my current understanding, it seems that a Job must be created for each worker pool and all theseJobs must be combined into a single JobGroup, which is then added to the experiment. There also seems to be an option to add Constraints for the JobGroup, however, I cannot seem to find what specific form these constraints may be able to take on besides the provided example of xm_impl.SameMachine(). Furthermore, my current attempt at launching a multi-worker distributing training job raises the following error when creating the distributed strategy with strategy = tf.distribute.MultiWorkerMirroredStrategy(): RuntimeError: Collective ops must be configured at program startup. Both the CLUSTER_SPEC and TF_CONFIG environment variables seem to be set correctly and the distributed strategy is created at the very beginning of the main function, so I was curious if this error might be due to the lack of setting appropriate Constraints for the JobGroup.

The text was updated successfully, but these errors were encountered:

andrewluchen · 2022-04-04T23:03:01Z

We should include a TF distributed experiment to the examples.

There is a PyTorch one that you can look at:
https://github.com/deepmind/xmanager/blob/main/examples/cifar10_torch/launcher.py

sahilpatelsp · 2022-04-05T20:18:17Z

Thanks for pointing that out! I modified my code to align with the provided example, with the main changes being utilizing the async/await syntax of the python asyncio library. However, I am still getting RuntimeError: Collective ops must be configured at program startup when calling strategy = tf.distribute.MultiWorkerMirroredStrategy() at the very beginning of the main function of the training file. I am unsure as to what exactly might be responsible for this error, given that I am creating the strategy before calling any other Tensorflow API, as per https://github.com/tensorflow/tensorflow/blob/3f878cff5b698b82eea85db2b60d65a2e320850e/tensorflow/python/distribute/collective_all_reduce_strategy.py#L155.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-Worker Distributed Training #16

Multi-Worker Distributed Training #16

sahilpatelsp commented Apr 4, 2022

andrewluchen commented Apr 4, 2022

sahilpatelsp commented Apr 5, 2022

Multi-Worker Distributed Training #16

Multi-Worker Distributed Training #16

Comments

sahilpatelsp commented Apr 4, 2022

andrewluchen commented Apr 4, 2022

sahilpatelsp commented Apr 5, 2022