Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Installation error #67

Closed
youth123 opened this issue Jul 30, 2021 · 7 comments
Closed

Installation error #67

youth123 opened this issue Jul 30, 2021 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@youth123
Copy link

Describe the bug
Failed to build fastmoe in the docker images that megatron provides.
https://ngc.nvidia.com/catalog/containers/nvidia:pytorch

To Reproduce
Steps to reproduce the behavior:
USE_NCCL=1 python setup.py install
Expected behavior
Installed successfully.
Logs
FAILED: /root/paddlejob/toyer_switch/fastmoe/build/temp.linux-x86_64-3.8/cuda/global_exchange.o
error: no matching function for call to ‘HackNCCLGroup::broadcastUniqueNCCLID(ncclUniqueId*)’
91 | broadcastUniqueNCCLID(&ncclID);
Platform

  • Device: NVIDIA V100
  • OS:Ubuntu
  • CUDA version: 11.1
  • NCCL version: 2.8.3
@youth123
Copy link
Author

when I git reset --hard b861e92, it can be installed successfully.
But it failed in 9170835, maybe some commit between these commit ids caused.

@laekov
Copy link
Owner

laekov commented Jul 30, 2021

I will look into this next week.

@laekov laekov self-assigned this Jul 30, 2021
@xptree xptree added the bug Something isn't working label Jul 30, 2021
@laekov
Copy link
Owner

laekov commented Aug 2, 2021

What is your PyTorch version? @youth123

@youth123
Copy link
Author

youth123 commented Aug 3, 2021

What is your PyTorch version? @youth123

1.8.0

@laekov
Copy link
Owner

laekov commented Aug 4, 2021

So, the difference is that we check pytorch's version using its macros here in order to be compatible to older PyTorch versions.
Can you check if your pytorch header files correctly define these macros?

@youth123
Copy link
Author

Sorry to reply so late. My pytorch header files don't have TORCH_VERSION_MAJOR.

@laekov
Copy link
Owner

laekov commented Aug 18, 2021

The macro is defined here in PyTorch 1.8.0. Its absense indicates that your PyTorch may be incorrectly installed.

@laekov laekov closed this as completed Aug 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants