Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix error report of DSElasticAgent._set_master_addr_port() (#4985)
**The error** Fixes #4459 ``` rogpt1: Traceback (most recent call last): rogpt1: File "/opt/conda/lib/python3.10/runpy.py", line 196, in _run_module_as_main rogpt1: return _run_code(code, main_globals, None, rogpt1: File "/opt/conda/lib/python3.10/runpy.py", line 86, in _run_code rogpt1: exec(code, run_globals) rogpt1: File "/opt/conda/lib/python3.10/site-packages/deepspeed/launcher/launch.py", line 355, in <module> rogpt1: main() rogpt1: File "/opt/conda/lib/python3.10/site-packages/deepspeed/launcher/launch.py", line 308, in main rogpt1: agent.run() rogpt1: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper rogpt1: result = f(*args, **kwargs) rogpt1: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run rogpt1: result = self._invoke_run(role) rogpt1: File "/opt/conda/lib/python3.10/site-packages/deepspeed/elasticity/elastic_agent.py", line 126, in _invoke_run rogpt1: self._initialize_workers(self._worker_group) rogpt1: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper rogpt1: result = f(*args, **kwargs) rogpt1: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers rogpt1: self._rendezvous(worker_group) rogpt1: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper rogpt1: result = f(*args, **kwargs) rogpt1: File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 556, in _rendezvous rogpt1: self._set_master_addr_port( rogpt1: TypeError: DSElasticAgent._set_master_addr_port() takes 3 positional arguments but 4 were given ``` **The reason** PyTorch will use four arguments to call function [_set_master_addr_port()](https://github.com/pytorch/pytorch/blob/e732adf0a7f31159dc827e563106279bf969144a/torch/distributed/elastic/agent/server/api.py#L513) but the "_set_master_addr_port()" of "DSElasticAgent" only implement three arguments. **The solution** Add the last "local_addr" argument by following PyTorch. And, to avoid changing the current behavior, set its default value to "None". --------- Co-authored-by: Robin Dong <rdong@woolworths.com.au> Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com> Co-authored-by: Logan Adams <loadams@microsoft.com>
- Loading branch information