-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
🐛 Describe the bug
The current implementation in init_global_dist() is as follows:
def init_global_dist(self, rank: int, world_size: int, backend: str, host: str, port: int):
init_method = f'tcp://{host}:{port}'
dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)
However, this implementation cannot support IPv6's address. For instance, an IPv6 address can be like this:
fdbd:dc03:9:222:4300::55. That is, the init_method will be like this following:
tcp://fdbd:dc03:9:222:4300::55:29500.
However, pytorch may fail to analyze this address, which part is the ip address and which part is the port number. It ends up to analyze the whole part as a port "fdbd:dc03:9:222:4300::55:29500".
The following callstack will have the issue:
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 579, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 169, in _tcp_rendezvous_handler
if not result.port:
File "/usr/lib/python3.7/urllib/parse.py", line 169, in port
port = int(port, 10)
ValueError: invalid literal for int() with base 10: 'fdbd:dc03:9:222:4300::55:29500'.
To solve this issue, it is relatively easy with one line change.
We will change this one from
"init_method = f'tcp://{host}:{port}'"
to
init_method = f'tcp://[{host}]:{port}'
Environment
No response