Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cannot import name '_get_socket_with_port' from 'torch.distributed.elastic.agent.server.api' #5603

Closed
fahadh4ilyas opened this issue Jun 3, 2024 · 4 comments · Fixed by #5654
Assignees
Labels
bug Something isn't working training

Comments

@fahadh4ilyas
Copy link

Describe the bug
When trying to call ds_report this error comes out

[2024-06-03 12:37:57,636] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0+45fff310c8), only 1.0.0 is known to be compatible
/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py:62: UserWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead.
  warnings.warn(
/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/torch/cuda/amp/autocast_mode.py:75: UserWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead.
  warnings.warn(
Traceback (most recent call last):
  File "/home/fahadh/anaconda3/envs/easycontext/bin/ds_report", line 3, in <module>
    from deepspeed.env_report import cli_main
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/__init__.py", line 26, in <module>
    from . import module_inject
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/module_inject/__init__.py", line 6, in <module>
    from .replace_module import replace_transformer_layer, revert_transformer_layer, ReplaceWithTensorSlicing, GroupQuantizer, generic_injection
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/module_inject/replace_module.py", line 607, in <module>
    from ..pipe import PipelineModule
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/pipe/__init__.py", line 6, in <module>
    from ..runtime.pipe import PipelineModule, LayerSpec, TiedLayerSpec
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/runtime/pipe/__init__.py", line 6, in <module>
    from .module import PipelineModule, LayerSpec, TiedLayerSpec
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/runtime/pipe/module.py", line 19, in <module>
    from ..activation_checkpointing import checkpointing
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 26, in <module>
    from deepspeed.runtime.config import DeepSpeedConfig
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/runtime/config.py", line 42, in <module>
    from ..elasticity import (
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/elasticity/__init__.py", line 10, in <module>
    from .elastic_agent import DSElasticAgent
  File "/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/deepspeed/elasticity/elastic_agent.py", line 9, in <module>
    from torch.distributed.elastic.agent.server.api import _get_socket_with_port
ImportError: cannot import name '_get_socket_with_port' from 'torch.distributed.elastic.agent.server.api' (/home/fahadh/anaconda3/envs/easycontext/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py)

To Reproduce
Steps to reproduce the behavior:

  1. Install pytorch nightly pip install --pre torch==2.4.0.dev20240602 --index-url https://download.pytorch.org/whl/nightly/cu121
  2. Install deepspeed pip install deepspeed
  3. Call ds_report
  4. See error

Expected behavior
ds_report show up

System info (please complete the following information):

  • OS: Ubuntu 22.04
  • one machine with 1 A100
  • Python 3.10
@fahadh4ilyas fahadh4ilyas added bug Something isn't working training labels Jun 3, 2024
@fivejjs
Copy link

fivejjs commented Jun 4, 2024

Got same error when run unit tests on cross encoder in sentence transformer.

@saurabh-kataria
Copy link

facing the same issue

@saurabh-kataria
Copy link

saurabh-kataria commented Jun 6, 2024

Quick solution is to modify the following file:
<path-to-anaconda-environment>/lib/python3.8/site-packages/deepspeed/elasticity/elastic_agent.py

  1. Comment out the following
    from torch.distributed.elastic.agent.server.api import _get_socket_with_port

  2. Then add the following which I just lifted from old version of pytorch

import socket
def _get_socket_with_port() -> socket.socket:
    """Return a free port on localhost.

    The free port is "reserved" by binding a temporary socket on it.
    Close the socket before passing the port to the entity that
    requires it. Usage example::

    sock = _get_socket_with_port()
    with closing(sock):
        port = sock.getsockname()[1]
        sock.close()
        # there is still a race-condition that some other process
        # may grab this port before func() runs
        func(port)
    """
    addrs = socket.getaddrinfo(
        host="localhost", port=None, family=socket.AF_UNSPEC, type=socket.SOCK_STREAM
    )
    for addr in addrs:
        family, type, proto, _, _ = addr
        s = socket.socket(family, type, proto)
        try:
            s.bind(("localhost", 0))
            s.listen(0)
            return s
        except OSError as e:
            s.close()
            log.info("Socket creation attempt failed.", exc_info=e)
    raise RuntimeError("Failed to create a socket")

@loadams
Copy link
Contributor

loadams commented Jun 18, 2024

Hi @fahadh4ilyas - can you test with the linked PR?

mauryaavinash95 pushed a commit to DataStates/DeepSpeed that referenced this issue Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants