Skip to content

[BUG]: Colossalai fails to support IPv6 in init_global_dist() #2199

@tongping

Description

@tongping

🐛 Describe the bug

The current implementation in init_global_dist() is as follows:
def init_global_dist(self, rank: int, world_size: int, backend: str, host: str, port: int):
init_method = f'tcp://{host}:{port}'
dist.init_process_group(rank=rank, world_size=world_size, backend=backend, init_method=init_method)

However, this implementation cannot support IPv6's address. For instance, an IPv6 address can be like this:
fdbd:dc03:9:222:4300::55. That is, the init_method will be like this following:
tcp://fdbd:dc03:9:222:4300::55:29500.

However, pytorch may fail to analyze this address, which part is the ip address and which part is the port number. It ends up to analyze the whole part as a port "fdbd:dc03:9:222:4300::55:29500".

The following callstack will have the issue:
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 579, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.7/dist-packages/torch/distributed/rendezvous.py", line 169, in _tcp_rendezvous_handler
if not result.port:
File "/usr/lib/python3.7/urllib/parse.py", line 169, in port
port = int(port, 10)
ValueError: invalid literal for int() with base 10: 'fdbd:dc03:9:222:4300::55:29500'.

To solve this issue, it is relatively easy with one line change.

We will change this one from
"init_method = f'tcp://{host}:{port}'"

to

init_method = f'tcp://[{host}]:{port}'

Environment

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions