Skip to content

Commit

Permalink
Fix error report of DSElasticAgent._set_master_addr_port() (#4985)
Browse files Browse the repository at this point in the history
**The error**
Fixes #4459 
```
rogpt1: Traceback (most recent call last):      
rogpt1:   File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main
rogpt1:     return _run_code(code, main_globals, None,                                                                
rogpt1:   File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code       
rogpt1:     exec(code, run_globals)
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/launcher/launch.py", line 355, in <module>
rogpt1:     main()
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/launcher/launch.py", line 308, in main
rogpt1:     agent.run()
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
rogpt1:     result = f(*args, **kwargs)
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
rogpt1:     result = self._invoke_run(role)
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/deepspeed/elasticity/elastic_agent.py", line 126, in _invoke_run
rogpt1:     self._initialize_workers(self._worker_group)
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
rogpt1:     result = f(*args, **kwargs)
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
rogpt1:     self._rendezvous(worker_group)
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
rogpt1:     result = f(*args, **kwargs)
rogpt1:   File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 556, in _rendezvous
rogpt1:     self._set_master_addr_port(
rogpt1: TypeError: DSElasticAgent._set_master_addr_port() takes 3 positional arguments but 4 were given
```

**The reason**
PyTorch will use four arguments to call function
[_set_master_addr_port()](https://github.com/pytorch/pytorch/blob/e732adf0a7f31159dc827e563106279bf969144a/torch/distributed/elastic/agent/server/api.py#L513)
but the "_set_master_addr_port()" of "DSElasticAgent" only implement
three arguments.

**The solution**
Add the last "local_addr" argument by following PyTorch. And, to avoid
changing the current behavior, set its default value to "None".

---------

Co-authored-by: Robin Dong <rdong@woolworths.com.au>
Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Co-authored-by: Logan Adams <loadams@microsoft.com>
  • Loading branch information
4 people committed Jan 26, 2024
1 parent 62afafe commit d2e9adc
Showing 1 changed file with 4 additions and 1 deletion.
5 changes: 4 additions & 1 deletion deepspeed/elasticity/elastic_agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,10 @@ def __init__(
self.ds_env = env

@staticmethod
def _set_master_addr_port(store: Store, master_addr: Optional[str], master_port: Optional[int]):
def _set_master_addr_port(store: Store,
master_addr: Optional[str],
master_port: Optional[int],
local_addr: Optional[str] = None):
if master_port is None:
sock = _get_socket_with_port()
with closing(sock):
Expand Down

0 comments on commit d2e9adc

Please sign in to comment.