-
-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Description
Hi Ross,
I tried to use 2 aws machines to run distributed learning with the command you provided.
python -m torch.distributed.launch --nproc_per_node=1 --master_addr=MASTER_ADDR --master_port=MASTER_PORT --nnodes=2 --node_rank=0 train.py "$@"
python -m torch.distributed.launch --nproc_per_node=1 --master_addr=MASTER_ADDR --master_port=MASTER_PORT --nnodes=2 --node_rank=1 train.py "$@"
MASTER_ADDR and MASTER_PORT is using ip address and port from the first machine and the second machine is using that master address and master port.
When I ran those commands, the first machine is waiting for the second machine, but the second machine can not reach to the first machine. I guess I need to contain .pem key in the command. But I do not know how to add it.
Do you have any idea how to deal with this issue?
Best,
Yunyang