Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated torchrun instructions #2096

Merged
merged 21 commits into from
Nov 20, 2023
Merged
Changes from 1 commit
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 26 additions & 20 deletions examples/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ To run it in each of these various modes, use the following commands:
```
* With traditional PyTorch launcher (`torch.distributed.launch` can be used with older versions of PyTorch)
```bash
python -m torchrun --nproc_per_node 2 --use_env ./nlp_example.py
torchrun --nproc_per_node 2 ./nlp_example.py
```
- multi GPUs, multi node (several machines, using PyTorch distributed mode)
* With Accelerate config and launcher, on each machine:
Expand All @@ -76,16 +76,19 @@ To run it in each of these various modes, use the following commands:
```
* With PyTorch launcher only (`torch.distributed.launch` can be used in older versions of PyTorch)
```bash
python -m torchrun --nproc_per_node 2 \
--use_env \
--node_rank 0 \
--master_addr master_node_ip_address \
./nlp_example.py # On the first server
python -m torchrun --nproc_per_node 2 \
--use_env \
--node_rank 1 \
--master_addr master_node_ip_address \
./nlp_example.py # On the second server
torchrun --nproc_per_node 2 \
--nnodes 2
TJ-Solergibert marked this conversation as resolved.
Show resolved Hide resolved
--rdzv_id 2299 \ # A unique job id
--rdzv_backend c10d \
--rdzv_endpoint master_node_ip_address:29500 \
./nlp_example.py # On the first server

torchrun --nproc_per_node 2 \
TJ-Solergibert marked this conversation as resolved.
Show resolved Hide resolved
--nnodes 2
TJ-Solergibert marked this conversation as resolved.
Show resolved Hide resolved
--rdzv_id 2299 \ # A unique job id
--rdzv_backend c10d \
--rdzv_endpoint master_node_ip_address:29500 \
./nlp_example.py # On the second server
```
- (multi) TPUs
* With Accelerate config and launcher
Expand Down Expand Up @@ -154,7 +157,7 @@ To run it in each of these various modes, use the following commands:
```
* With traditional PyTorch launcher (`torch.distributed.launch` can be used with older versions of PyTorch)
```bash
python -m torchrun --nproc_per_node 2 --use_env ./cv_example.py --data_dir path_to_data
torchrun --nproc_per_node 2 ./cv_example.py --data_dir path_to_data
```
- multi GPUs, multi node (several machines, using PyTorch distributed mode)
* With Accelerate config and launcher, on each machine:
Expand All @@ -164,15 +167,18 @@ To run it in each of these various modes, use the following commands:
```
* With PyTorch launcher only (`torch.distributed.launch` can be used with older versions of PyTorch)
```bash
python -m torchrun --nproc_per_node 2 \
--use_env \
--node_rank 0 \
--master_addr master_node_ip_address \
torchrun --nproc_per_node 2 \
--nnodes 2
TJ-Solergibert marked this conversation as resolved.
Show resolved Hide resolved
--rdzv_id 2299 \ # A unique job id
--rdzv_backend c10d \
--rdzv_endpoint master_node_ip_address:29500 \
./cv_example.py --data_dir path_to_data # On the first server
python -m torchrun --nproc_per_node 2 \
--use_env \
--node_rank 1 \
--master_addr master_node_ip_address \

torchrun --nproc_per_node 2 \
TJ-Solergibert marked this conversation as resolved.
Show resolved Hide resolved
--nnodes 2
TJ-Solergibert marked this conversation as resolved.
Show resolved Hide resolved
--rdzv_id 2299 \ # A unique job id
--rdzv_backend c10d \
--rdzv_endpoint master_node_ip_address:29500 \
./cv_example.py --data_dir path_to_data # On the second server
```
- (multi) TPUs
Expand Down
Loading