Support multinode training on GPU #2731

py4 · 2020-04-15T23:53:04Z

I don't have a node with 8 gpus. I have two nodes each with 4 gpus. So is it possible to train a model on multiple nodes?

hawkinsp · 2020-04-16T00:39:06Z

This is actually something that does work right now but it's still experimental. There's also no real public-facing API for it yet; you have to type in some obscure and fairly magical things to set it all up correctly.

We should polish it off and document it!

hawkinsp · 2020-04-16T00:40:05Z

Can you say a bit more about your model, though? Would gradient all-reductions across multiple nodes suffice?

py4 · 2020-04-16T00:46:49Z

@hawkinsp Technically, I'm training a reformer model using Trax library.

hawkinsp · 2020-04-16T00:49:44Z

And I assume you're just looking for data parallelism, i.e., partitioning a minibatch across GPUs, not partitioning in any other way (e.g., model parallelism)?

py4 · 2020-04-16T00:58:06Z

@hawkinsp yeah my concern is data parallelism

powderluv · 2020-06-20T06:28:00Z

@hawkinsp Can you please share your notes on this (don't need a stable api) ? We are trying some hybrid data/model/pipeline parallelism so it is a little different from @py4 but would love to get started with data parallelism

brettkoonce · 2020-12-08T15:26:05Z

Data parallelism would of value to other projects that use XLA as well (eg https://www.tensorflow.org/swift). Exposing this functionality in a standardized way would help drive progress in the broader ecosystem!

yxd886 · 2020-12-09T05:03:29Z

I don't have a node with 8 gpus. I have two nodes each with 4 gpus. So is it possible to train a model on multiple nodes?

Hello py4, I am meeting the same problem, have you found some solutions?

yxd886 · 2020-12-09T05:04:58Z

This is actually something that does work right now but it's still experimental. There's also no real public-facing API for it yet; you have to type in some obscure and fairly magical things to set it all up correctly.

We should polish it off and document it!

Hello hawkinsp, Could you please provide more details about how to run data parallel with multi node GPUs?

connection-on-fiber-bundles · 2021-02-01T12:43:48Z

@hawkinsp We are also interested in running JAX code on multiple nodes. Anything (hacky or not) that you can share would be appreciated. Thanks!

jramapuram · 2021-02-21T14:58:42Z

I really enjoyed Jax during my DM internship and wanted to use it on my university SLURM cluster, but the lack of a clear (official) data parallel (multi-node) solution is a huge blocker to increasing Jax adoption outside of Google where you cant just grab a TPU pod and pmap across the pod. A single 8 (GPU) replica setup can barely train a Resnet50 imagenet classifier. Training SimCLR or any other large SOTA model is currently impossible without multi-node data parallelism.

StellaAthena · 2021-05-19T03:42:34Z

I would love this feature! I enjoy Jax, but I've been largely using DeepSpeed due to its ability to distribute across clusters.

jrabary · 2021-09-17T09:49:40Z

Any progress on this issue ? Using JAX to train a model on multi-node, multi-GPU is becoming a very important features for us.

sudhakarsingh27 · 2021-11-18T23:46:03Z

@hawkinsp This is a significant bottleneck for scaling on multi-node GPU clusters. Is there any update on this issue?
Also, there was a recent pjit tutorial that explains multi-node TPU scaling but doesn't mention about GPUs. Is that planned to be updated in the future?

cloudhan · 2021-11-19T03:46:03Z

@sudhakarsingh27 I constantly monitoring the jax releases, and there is something WIP that you might be interested in #8364

brettkoonce · 2022-03-08T16:46:35Z

See also: #9582

hawkinsp · 2022-03-08T17:19:29Z

Yes indeed. We haven't advertised it that much yet, but (a) you need to initialize the cluster using that API, and (b) you need to follow the same rules of multi-host programming that also apply on TPU, documented here: https://jax.readthedocs.io/en/latest/multi_process.html

I suspect we can consider this issue closed when we've documented (a) in the document (b).

sudhakarsingh27 · 2022-05-20T22:48:57Z

@hawkinsp @zhangqiaorjc
Multinode (or multiprocess) doesn't seem to work with the following jax(lib) versions:

jax                           0.3.13                                                                                                                                                                       
jaxlib                        0.3.10+cuda11.cudnn82

Ran the attached minimal code on single node with 8 V100 GPUs as follows (2 processes with 4 GPUs each):

CUDA_VISIBLE_DEVICES="0,1,2,3" python jax_multi_node_experiment.py 0 &
CUDA_VISIBLE_DEVICES="4,5,6,7" python jax_multi_node_experiment.py 1

I could check that multi process(host/node) first fails with jax[cuda]==0.3.12 installed with following command

pip install jax[cuda]==0.3.12 -f https://storage.googleapis.com/jax-releases/jax_releases.html

I get the following error when I run the multi-process jax commands above:

127.0.0.1:65432 2 1
I0525 00:05:16.228919 139978761119552 distributed.py:59] Connecting to JAX distributed service on 127.0.0.1:65432
I0525 00:05:16.245648 139978761119552 xla_bridge.py:330] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I0525 00:05:16.246569 139742444975936 xla_bridge.py:330] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I0525 00:05:18.227763 139978761119552 xla_bridge.py:330] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
I0525 00:05:18.228022 139978761119552 xla_bridge.py:330] Unable to initialize backend 'cuda': make_gpu_client() got an unexpected keyword argument 'platform_name'
I0525 00:05:18.228085 139978761119552 xla_bridge.py:330] Unable to initialize backend 'rocm': make_gpu_client() got an unexpected keyword argument 'platform_name'
global devices= [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0), GpuDevice(id=2, process_index=0), GpuDevice(id=3, process_index=0)]
local devices= [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0), GpuDevice(id=2, process_index=0), GpuDevice(id=3, process_index=0)]
I0525 00:05:18.246024 139742444975936 xla_bridge.py:330] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
I0525 00:05:18.246273 139742444975936 xla_bridge.py:330] Unable to initialize backend 'cuda': make_gpu_client() got an unexpected keyword argument 'platform_name'
I0525 00:05:18.246334 139742444975936 xla_bridge.py:330] Unable to initialize backend 'rocm': make_gpu_client() got an unexpected keyword argument 'platform_name'
global devices= [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0), GpuDevice(id=2, process_index=0), GpuDevice(id=3, process_index=0)]
local devices= [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0), GpuDevice(id=2, process_index=0), GpuDevice(id=3, process_index=0)]

For reference, here's the ouput from jax[cuda]==0.3.10 where multi-process seems to be working okay:

127.0.0.1:65432 2 1
I0525 00:09:03.394093 140366043674432 distributed.py:59] Connecting to JAX distributed service on 127.0.0.1:65432
I0525 00:09:03.410755 140366043674432 xla_bridge.py:263] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I0525 00:09:03.410994 140588577851200 xla_bridge.py:263] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker: 
I0525 00:09:05.517608 140366043674432 xla_bridge.py:263] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
global devices= [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0), GpuDevice(id=2, process_index=0), GpuDevice(id=3, process_index=0), GpuDevice(id=4, process_index=1), GpuDevice(id=5, process_index=1), GpuDevice(id=6, process_index=1), GpuDevice(id=7, process_index=1)]
I0525 00:09:05.517817 140588577851200 xla_bridge.py:263] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
local devices= [GpuDevice(id=4, process_index=1), GpuDevice(id=5, process_index=1), GpuDevice(id=6, process_index=1), GpuDevice(id=7, process_index=1)]
global devices= [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0), GpuDevice(id=2, process_index=0), GpuDevice(id=3, process_index=0), GpuDevice(id=4, process_index=1), GpuDevice(id=5, process_index=1), GpuDevice(id=6, process_index=1), GpuDevice(id=7, process_index=1)]
local devices= [GpuDevice(id=0, process_index=0), GpuDevice(id=1, process_index=0), GpuDevice(id=2, process_index=0), GpuDevice(id=3, process_index=0)]

To run T5x on multi-node and multi-GPUs, `jax.distributed.initialize` needs to be called with appropriate setup as mentioned here: google/jax#8364. Added a command line flag - `multiprocess` to enable multiprocess T5x run on GPUs. Also, added command line flags for the arguments to `jax.distributed.initialize`, namely - `coordinator_address`, `num_processes` and `process_id`. Example usage 1 (2 processes, running on 2 separate nodes, 8 GPUs each): On the first node: ``` python3 ${T5X_DIR}/t5x/train.py \ --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \ --gin.MODEL_DIR=\"${MODEL_DIR}\" \ --tfds_data_dir=${TFDS_DATA_DIR} \ --multiprocess \ --coordinator_address=i.p.ad.dr:port \ --num_processes=2 \ --process_id=0 ``` On the second node: ``` python3 ${T5X_DIR}/t5x/train.py \ --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \ --gin.MODEL_DIR=\"${MODEL_DIR}\" \ --tfds_data_dir=${TFDS_DATA_DIR} \ --multiprocess \ --coordinator_address=i.p.ad.dr:port \ --num_processes=2 \ --process_id=1 ``` Notice that the `process_id` is different for the two processes. Also, substitute the appropriate coordinator_address in `i.p.ad.dr:port` Example usage 2 (1 node, 2 processes, 4 GPUs each): ``` CUDA_VISIBLE_DEVICES=0,1,2,3 python3 ${T5X_DIR}/t5x/train.py \ --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \ --gin.MODEL_DIR=\"${MODEL_DIR}\" \ --tfds_data_dir=${TFDS_DATA_DIR} \ --multiprocess \ --coordinator_address=127.0.0.1:12345 \ --num_processes=2 \ --process_id=0 & \ && CUDA_VISIBLE_DEVICES=4,5,6,7 python3 ${T5X_DIR}/t5x/train.py \ --gin_file="t5x/examples/t5/t5_1_1/examples/base_wmt_from_scratch.gin" \ --gin.MODEL_DIR=\"${MODEL_DIR}\" \ --tfds_data_dir=${TFDS_DATA_DIR} \ --multiprocess \ --coordinator_address=127.0.0.1:12345 \ --num_processes=2 \ --process_id=1 ``` More information about multiprocess JAX runs: google/jax#2731 Note: T5x partitioning fix: google-research#608 complements this change. Fixes google-research#410/google-research#89

Fixes google#2731

mattjj assigned hawkinsp Apr 16, 2020

mattjj added the question Questions for the JAX team label Apr 16, 2020

hawkinsp added the enhancement New feature or request label Apr 16, 2020

hawkinsp changed the title ~~training on multiple nodes?~~ Support multinode training on GPU Apr 16, 2020

hawkinsp removed the question Questions for the JAX team label Apr 16, 2020

gnecula mentioned this issue Apr 27, 2020

How to run an XLA compute server? #2837

Closed

brettkoonce mentioned this issue Dec 9, 2020

How to Run JAX on multi-host GPU platforms #5143

Closed

jsspencer mentioned this issue Feb 1, 2021

An issue regarding multi-node training with TF code google-deepmind/ferminet#15

Closed

connection-on-fiber-bundles mentioned this issue Feb 7, 2021

Recipe on multi-node training with JAX #5667

Closed

merrymercy mentioned this issue Mar 26, 2021

Run Jax on multiple hosts alpa-projects/alpa#4

Closed

1 task

ymjiang mentioned this issue Mar 31, 2021

[XLA PjRT] Ongoing progress and future plan of PjRT distributed runtime tensorflow/tensorflow#48210

Open

hawkinsp mentioned this issue May 25, 2022

Fix distributed system initialization. #10822

Merged

sudhakarsingh27 mentioned this issue Jun 23, 2022

Add support for Multiprocess (multi-host/multi-node) T5x runs google-research/t5x#626

Closed

hawkinsp added a commit to hawkinsp/jax that referenced this issue Aug 8, 2022

Document multiprocess GPU support.

44c4962

Fixes google#2731

hawkinsp mentioned this issue Aug 8, 2022

Document multiprocess GPU support. #11803

Merged

copybara-service bot closed this as completed in a2c2195 Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support multinode training on GPU #2731

Support multinode training on GPU #2731

py4 commented Apr 15, 2020

hawkinsp commented Apr 16, 2020

hawkinsp commented Apr 16, 2020

py4 commented Apr 16, 2020

hawkinsp commented Apr 16, 2020

py4 commented Apr 16, 2020

powderluv commented Jun 20, 2020

brettkoonce commented Dec 8, 2020

yxd886 commented Dec 9, 2020

yxd886 commented Dec 9, 2020

connection-on-fiber-bundles commented Feb 1, 2021

jramapuram commented Feb 21, 2021 •

edited

Loading

StellaAthena commented May 19, 2021

jrabary commented Sep 17, 2021

sudhakarsingh27 commented Nov 18, 2021 •

edited

Loading

cloudhan commented Nov 19, 2021

brettkoonce commented Mar 8, 2022

hawkinsp commented Mar 8, 2022

sudhakarsingh27 commented May 20, 2022 •

edited

Loading

Support multinode training on GPU #2731

Support multinode training on GPU #2731

Comments

py4 commented Apr 15, 2020

hawkinsp commented Apr 16, 2020

hawkinsp commented Apr 16, 2020

py4 commented Apr 16, 2020

hawkinsp commented Apr 16, 2020

py4 commented Apr 16, 2020

powderluv commented Jun 20, 2020

brettkoonce commented Dec 8, 2020

yxd886 commented Dec 9, 2020

yxd886 commented Dec 9, 2020

connection-on-fiber-bundles commented Feb 1, 2021

jramapuram commented Feb 21, 2021 • edited Loading

StellaAthena commented May 19, 2021

jrabary commented Sep 17, 2021

sudhakarsingh27 commented Nov 18, 2021 • edited Loading

cloudhan commented Nov 19, 2021

brettkoonce commented Mar 8, 2022

hawkinsp commented Mar 8, 2022

sudhakarsingh27 commented May 20, 2022 • edited Loading

jramapuram commented Feb 21, 2021 •

edited

Loading

sudhakarsingh27 commented Nov 18, 2021 •

edited

Loading

sudhakarsingh27 commented May 20, 2022 •

edited

Loading