This repository contains code that goes along with the Primer on Distributed Training with PyTorch
python -m torch.distributed.launch --nnodes=1 --node_rank=0 --nproc_per_node=4 --use_env src/main.py
- Step 1: Choose a node as master and find an available high port (here, in range 49000-65535) on it for communication with worker nodes (https://unix.stackexchange.com/a/423052):
MASTER_PORT=`comm -23 <(seq 49000 65535 | sort) <(ss -tan | awk '{print $4}' | cut -d':' -f2 | grep '[0-9]{1,5}' | sort -u)| shuf | head -n 1` - Step 2: Set MASTER_ADDR and MASTER_PORT on all nodes for launch utility:
export MASTER_ADDR=<MASTER_ADDR> MASTER_PORT=$MASTER_PORT - Step 3: Launch master node process:
python -m torch.distributed.launch --nnodes= --node_rank=0 --nproc_per_node=<num_gpus_per_node> --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --use_env src/main.py --distributed true - Step 4: Launch worker nodes' processes (run on each node, setting appropriate node_rank):
python -m torch.distributed.launch --nnodes=<num_nodes> --node_rank= --nproc_per_node=4 --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --use_env src/main.py --distributed true