Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't pip install horovod for rocm 5.0+ #3537

Closed
xiaoyu-work opened this issue May 6, 2022 · 11 comments · Fixed by #3588
Closed

Can't pip install horovod for rocm 5.0+ #3537

xiaoyu-work opened this issue May 6, 2022 · 11 comments · Fixed by #3588
Labels

Comments

@xiaoyu-work
Copy link

xiaoyu-work commented May 6, 2022

Environment:

  1. Framework: (TensorFlow, Keras, PyTorch, MXNet): PyTorch
  2. Framework version: 1.12.0.dev
  3. Horovod version: 0.24.2
  4. MPI version: 3.1
  5. Rocm version: ROCM 5.0+
  6. NCCL version:
  7. Python version: 3.8
  8. Spark / PySpark version:
  9. Ray version:
  10. OS and version: Ubuntu2004
  11. GCC version:
  12. CMake version:

Checklist:

  1. Did you search issues to find if somebody asked this question before?
  2. If your question is about hang, did you read this doc?
  3. If your question is about docker, did you read this doc?
  4. Did you check if you question is answered in the troubleshooting guide?

Bug report:
When I "pip install horovod" for rocm 5.0.1 and rocm 5.1.1, got error:

Stacktrace:

[pip-requirements.txt]     Found existing installation: numpy 1.    ERROR: Command errored out with exit status 1:
     command: /opt/conda/envs/ptca/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-h7ut125g/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/envs/ptca/include/python3.8/horovod
         cwd: /tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/
    Complete output (280 lines):
    running install
    /opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
      warnings.warn(
    running build
    running build_py
    creating build
    creating build/lib.linux-x86_64-3.8
    creating build/lib.linux-x86_64-3.8/horovod
    copying horovod/__init__.py -> build/lib.linux-x86_64-3.8/horovod
    creating build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/__init__.py -> build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/mpi_run.py -> build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/runner.py -> build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/gloo_run.py -> build/lib.linux-x86_64-3.8/horovod/spark
    copying horovod/spark/conf.py -> build/lib.linux-x86_64-3.8/horovod/spark
........ (skip copying)
    copying horovod/runner/common/util/safe_shell_exec.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
    copying horovod/runner/common/util/settings.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
    copying horovod/runner/common/util/secret.py -> build/lib.linux-x86_64-3.8/horovod/runner/common/util
    creating build/lib.linux-x86_64-3.8/horovod/torch/elastic
    copying horovod/torch/elastic/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
    copying horovod/torch/elastic/sampler.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
    copying horovod/torch/elastic/state.py -> build/lib.linux-x86_64-3.8/horovod/torch/elastic
    creating build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib
    copying horovod/torch/mpi_lib/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib
    creating build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib_impl
    copying horovod/torch/mpi_lib_impl/__init__.py -> build/lib.linux-x86_64-3.8/horovod/torch/mpi_lib_impl
    running build_ext
    Running CMake in build/temp.linux-x86_64-3.8/RelWithDebInfo:
    cmake /tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c -DCMAKE_BUILD_TYPE=RelWithDebInfo -DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/build/lib.linux-x86_64-3.8 -DPYTHON_EXECUTABLE:FILEPATH=/opt/conda/envs/ptca/bin/python
    cmake --build . --config RelWithDebInfo -- -j8 VERBOSE=1

    -- Could not find CCache. Consider installing CCache to speed up compilation.
    -- The CXX compiler identification is GNU 9.4.0
    -- Check for working CXX compiler: /usr/bin/c++
    -- Check for working CXX compiler: /usr/bin/c++ -- works
    -- Detecting CXX compiler ABI info
    -- Detecting CXX compiler ABI info - done
    -- Detecting CXX compile features
    -- Detecting CXX compile features - done
    -- Build architecture flags: -mf16c -mavx -mfma
    -- Using command /opt/conda/envs/ptca/bin/python
    -- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so (found version "3.1")
    -- Found MPI: TRUE (found version "3.1")
    -- Looking for a CUDA compiler
    -- Looking for a CUDA compiler - NOTFOUND
    -- Looking for a CUDA host compiler - /usr/bin/c++
    -- Could not find nvcc, please set CUDAToolkit_ROOT.
    -- Could NOT find NVTX (missing: NVTX_INCLUDE_DIR)
    -- The C compiler identification is GNU 9.4.0
    -- Check for working C compiler: /usr/bin/cc
    -- Check for working C compiler: /usr/bin/cc -- works
    -- Detecting C compiler ABI info
    -- Detecting C compiler ABI info - done
    -- Detecting C compile features
    -- Detecting C compile features - done
    -- Gloo build as STATIC library
    -- Found MPI_C: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so (found version "3.1")
    -- Found MPI: TRUE (found version "3.1")
    -- MPI include path: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/usr/lib/x86_64-linux-gnu/openmpi/include
    -- MPI libraries: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ModuleNotFoundError: No module named 'tensorflow'
    -- Could NOT find Tensorflow (missing: Tensorflow_LIBRARIES) (Required is at least version "1.15.0")
    -- Found Pytorch: 1.12.0.dev20220505+rocm5.0 (found suitable version "1.12.0.dev20220505+rocm5.0", minimum required is "1.2.0")
    Successfully preprocessed all matching files.
    Total number of unsupported CUDA function calls: 0
    
    Total number of replaced kernel launches: 0
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    ModuleNotFoundError: No module named 'mxnet'
    -- Could NOT find Mxnet (missing: Mxnet_LIBRARIES) (Required is at least version "1.4.0")
    -- Gloo build as STATIC library
    -- MPI include path: /usr/lib/x86_64-linux-gnu/openmpi/include/openmpi/usr/lib/x86_64-linux-gnu/openmpi/include
    -- MPI libraries: /usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi_cxx.so/usr/lib/x86_64-linux-gnu/openmpi/lib/libmpi.so
    -- Configuring done
    CMake Error at horovod/torch/CMakeLists.txt:81 (add_library):
      Cannot find source file:
    
        /tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/horovod/torch/ready_event_hip.cc
    
      Tried extensions .c .C .c++ .cc .cpp .cxx .cu .m .M .mm .h .hh .h++ .hm
      .hpp .hxx .in .txx
    
    
    CMake Error at horovod/torch/CMakeLists.txt:81 (add_library):
      No SOURCES given to target: pytorch
   
    CMake Generate step failed.  Build files cannot be regenerated correctly.
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py", line 209, in <module>
        setup(name='horovod',
      File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/__init__.py", line 153, in setup
        return distutils.core.setup(**attrs)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/core.py", line 148, in setup
        dist.run_commands()
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 966, in run_commands
        self.run_command(cmd)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/install.py", line 68, in run
        return orig.install.run(self)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/install.py", line 545, in run
        self.run_command('build')
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/build.py", line 135, in run
        self.run_command(cmd_name)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/cmd.py", line 313, in run_command
        self.distribution.run_command(command)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/dist.py", line 985, in run_command
        cmd_obj.run()
      File "/opt/conda/envs/ptca/lib/python3.8/site-packages/setuptools/command/build_ext.py", line 79, in run
        _build_ext.run(self)
      File "/opt/conda/envs/ptca/lib/python3.8/distutils/command/build_ext.py", line 340, in run
        self.build_extensions()
      File "/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py", line 144, in build_extensions
        subprocess.check_call(command, cwd=cmake_build_dir)
      File "/opt/conda/envs/ptca/lib/python3.8/subprocess.py", line 364, in check_call
        raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c', '-DCMAKE_BUILD_TYPE=RelWithDebInfo', '-DCMAKE_LIBRARY_OUTPUT_DIRECTORY_RELWITHDEBINFO=/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/build/lib.linux-x86_64-3.8', '-DPYTHON_EXECUTABLE:FILEPATH=/opt/conda/envs/ptca/bin/python']' returned non-zero exit status 1.
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/conda/envs/ptca/bin/python -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"'; __file__='"'"'/tmp/pip-install-vd9tp1oy/horovod_8df28208658e45d2bcf35f4cc2a6010c/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-h7ut125g/install-record.txt --single-version-externally-managed --compile --install-headers /opt/conda/envs/ptca/include/python3.8/horovod Check the logs for full command output.
22.3

Error as above. I tried ROCM 5.0.1 and ROCM 5.1.1, and both failed.

Can you please take a look?

Thanks

@xiaoyu-work xiaoyu-work added the bug label May 6, 2022
@xiaoyu-work
Copy link
Author

Any update on this?

@maxhgerlach
Copy link
Collaborator

Hi @xiaoyu-work, there's been an AMD-compatibility commit to Horovod master recently (via PR #3486), but I'm not sure if it covers your problem. I believe you are also supposed to set the environment variable HOROVOD_GPU=ROCM to use ROCM instead of CUDA (from https://github.com/horovod/horovod/blob/master/CMakeLists.txt#L204)

So you could give this one a shot: HOROVOD_GPU=ROCM pip install -v git+https://github.com/horovod/horovod.git@master

@xiaoyu-work
Copy link
Author

Hi @maxhgerlach, Thanks for your reply. No, it doesn't work. I tried HOROVOD_GPU=ROCM pip install -v git+https://github.com/horovod/horovod.git@master but got the same error.

@maxhgerlach
Copy link
Collaborator

Thanks for checking. Maybe the mechanism at

if (Pytorch_ROCM)
does not work?

@weihanmines, is the build with PyTorch and ROCM also something you are going to look into perspectively?

@weihanmines
Copy link
Contributor

Thanks for checking. Maybe the mechanism at

if (Pytorch_ROCM)

does not work?
@weihanmines, is the build with PyTorch and ROCM also something you are going to look into perspectively?

Hi @maxhgerlach ,
Let me look into this issue. I will get back to you as soon as I can. I mainly work on TensorFlow. If I cannot solve this problem, I will bring this issue to our PyTorch team's attention. Thank you.

@weihanmines
Copy link
Contributor

Hi @xiaoyu-work, would you mind sharing your commands for installation and envs?

@xiaoyu-work
Copy link
Author

Hi @weihanmines, my env is a little bit complex, but you can repro this issue by:

  1. Pull the latest official released rocm 5.1.1 image: docker pull rocm/dev-ubuntu-20.04:5.1.1-complete.
  2. docker run -it rocm/dev-ubuntu-20.04:5.1.1-complete.
  3. Install git: sudo apt update && sudo apt install git.
  4. Install nightly rocm pytorch by pip install -f https://download.pytorch.org/whl/nightly/rocm5.1.1/torch_nightly.html --pre torch==1.13.0.dev20220609.
  5. Then try to install horovod by HOROVOD_GPU=ROCM pip install -v git+https://github.com/horovod/horovod.git@master.

@maxhgerlach
Copy link
Collaborator

maxhgerlach commented Jun 29, 2022

@xiaoyu-work, there's a proposed fix in PR #3588. If you like, you could test if it fixes your problem. I think

HOROVOD_GPU=ROCM pip install -v git+https://github.com/horovod/horovod.git@refs/pull/3588/head

should work to install Horovod from that branch.

@ronakmal
Copy link
Contributor

One thing to note: I think it's also necessary to define HOROVOD_GPU_OPERATIONS=NCCL in addition to HOROVOD_GPU, otherwise the FindROCM.cmake file won't be included.

@xiaoyu-work
Copy link
Author

@maxhgerlach @ronakmal Thanks for the help! That PR works on ROCm 5.0+!

@maxhgerlach
Copy link
Collaborator

Great, thanks for confirming that!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.

4 participants