Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed #119

Closed
Fragile-azalea opened this issue Jun 10, 2022 · 9 comments

Comments

@Fragile-azalea
Copy link
Contributor

Describe the bug
'nccl.h' file is not found or ncclUnhandledCudaError: Call to CUDA function failed

To Reproduce
Steps to reproduce the behavior:

  1. USE_NCCL=1 python setup.py install

Logs

running install
running bdist_egg
running egg_info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'fmoe_cuda' extension
Emitting ninja build file /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/7] c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o 
c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/xinglinpan/fastmoe-master/cuda/global_exchange.h:1:0,
                 from /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp:1:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.
[2/7] /usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o 
/usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
In file included from /home/xinglinpan/fastmoe-master/cuda/balancing.cuh:1:0,
                 from /home/xinglinpan/fastmoe-master/cuda/balancing.cu:2:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.

Try to fix

  1. Download nccl_2.7.8-1+cuda10.2_x86_64
  2. Set environment variables as mentioned
  3. USE_NCCL=1 python setup.py install
Installed /home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg
Processing dependencies for fastmoe==1.0.0
Finished processing dependencies for fastmoe==1.0.0
  1. cd test && pytest test_ddp.py
Traceback (most recent call last):
  File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Platform

Additional context

>>> torch.cuda.nccl.version()
2708

May some necessary environment variables be lost during the process of subprocess.Popen?

env=env

@laekov
Copy link
Owner

laekov commented Jun 10, 2022

So, there are two issues. The first one is that FastMoE cannot find NCCL, and you have that addressed by installing NCCL. Then, PyTorch gets into trouble with its NCCL AllGather operator. You can first check if your PyTorch distributed.all_gather works well in a mini-reproeuction code without FastMoE. Also, you can use the environment variable NCCL_DEBUG=INFO to get more information about the NCCL cuda call error, and see if there is anything useful.

@Fragile-azalea
Copy link
Contributor Author

Fragile-azalea commented Jun 10, 2022

To check if PyTorch distributed.all_gather works well in a mini-reproduction code without FastMoE.

code

def train_model(rank, args):
    print(f"Running mini code on rank {rank}.")
    setup(rank, args.world_size)
    torch.manual_seed(7 + rank)
    torch.cuda.set_device(rank)
    tensor_list = [torch.zeros(4, dtype=torch.int64, device=rank) for _ in range(4)]
    tensor = torch.arange(4, dtype=torch.int64, device=rank) + 1 + 4 * rank
    print(tensor)
    dist.all_gather(tensor_list, tensor)
    print(tensor_list)
    cleanup()

log

tensor([5, 6, 7, 8], device='cuda:1')
tensor([ 9, 10, 11, 12], device='cuda:2')
tensor([1, 2, 3, 4], device='cuda:0')
tensor([13, 14, 15, 16], device='cuda:3')
[tensor([1, 2, 3, 4], device='cuda:2'), tensor([5, 6, 7, 8], device='cuda:2'), tensor([ 9, 10, 11, 12], device='cuda:2'), tensor([13, 14, 15, 16], device='cuda:2')]
[tensor([1, 2, 3, 4], device='cuda:0'), tensor([5, 6, 7, 8], device='cuda:0'), tensor([ 9, 10, 11, 12], device='cuda:0'), tensor([13, 14, 15, 16], device='cuda:0')]
[tensor([1, 2, 3, 4], device='cuda:3'), tensor([5, 6, 7, 8], device='cuda:3'), tensor([ 9, 10, 11, 12], device='cuda:3'), tensor([13, 14, 15, 16], device='cuda:3')]
[tensor([1, 2, 3, 4], device='cuda:1'), tensor([5, 6, 7, 8], device='cuda:1'), tensor([ 9, 10, 11, 12], device='cuda:1'), tensor([13, 14, 15, 16], device='cuda:1')]

@Fragile-azalea
Copy link
Contributor Author

NCCL_DEBUG=INFO It doesn't seem to work.

log

(fmoe) xinglinpan@gpu9:~/fastmoe-master/tests$ NCCL_DEBUG=INFO python test_ddp.py
Traceback (most recent call last):
  File "test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
Traceback (most recent call last):
  File "test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "test_ddp.py", line 142, in <module>
    test_fmoe_linear_distributed(
  File "test_ddp.py", line 65, in test_fmoe_linear_distributed
    _run_distributed(
  File "test_ddp.py", line 52, in _run_distributed
    assert retc == 0
AssertionError

@laekov
Copy link
Owner

laekov commented Jun 10, 2022

To see the NCCL debug info, you are supposed to add that environment variable at the env dict which you mentinoed above. Besides, you have to remove stdout=subprocess.PIPE in the Popen call.

@Fragile-azalea
Copy link
Contributor Author

remove stdout=subprocess.PIPE and env['NCCL_DEBUG'] = 'INFO'

log

(fmoe) xinglinpan@gpu9:~/fastmoe-master/tests$  python test_ddp.py
Traceback (most recent call last):
  File "test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 345, in _test_fmoe_local_ddp
    model_ddp = LocalDDP(deepcopy(model),
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1092, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 252, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f5976242ef0>
gpu9:213423:213423 [0] NCCL INFO Bootstrap : Using [0]ib0:10.0.0.19<0>
gpu9:213423:213423 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gpu9:213423:213423 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.0.0.19<0>
gpu9:213423:213423 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda10.2
gpu9:213425:213425 [0] NCCL INFO Bootstrap : Using [0]ib0:10.0.0.19<0>
gpu9:213425:213425 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gpu9:213425:213425 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.0.0.19<0>
gpu9:213425:213425 [0] NCCL INFO Using network IB
gpu9:213423:213895 [0] NCCL INFO Channel 00/02 :    0   1
gpu9:213423:213895 [0] NCCL INFO Channel 01/02 :    0   1
gpu9:213425:213897 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
gpu9:213425:213897 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
gpu9:213425:213897 [0] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
gpu9:213423:213895 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
gpu9:213423:213895 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
gpu9:213423:213895 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
gpu9:213423:213895 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[b1000] via direct shared memory
gpu9:213425:213897 [0] NCCL INFO Channel 00 : 1[b1000] -> 0[3d000] via direct shared memory
gpu9:213425:213897 [0] NCCL INFO Channel 01 : 1[b1000] -> 0[3d000] via direct shared memory
gpu9:213423:213895 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[b1000] via direct shared memory
gpu9:213425:213897 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gpu9:213425:213897 [0] NCCL INFO comm 0x7f9198000e00 rank 1 nranks 2 cudaDev 0 busId b1000 - Init COMPLETE
gpu9:213423:213895 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gpu9:213423:213895 [0] NCCL INFO comm 0x7f07e0000e00 rank 0 nranks 2 cudaDev 0 busId 3d000 - Init COMPLETE
gpu9:213423:213423 [0] NCCL INFO Launch mode Parallel
NCCL version 2.7.8+cuda10.2
Traceback (most recent call last):
  File "test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 345, in _test_fmoe_local_ddp
    model_ddp = LocalDDP(deepcopy(model),
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1092, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 252, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f616ca133f0>

@laekov
Copy link
Owner

laekov commented Jun 10, 2022

How many GPUs do you have? The default test_fmoe_linear_distributed test in line 142 of test_ddp.py requires at least 4 available GPUs, or the sub-processes are binded to a non-existing GPU.

@Fragile-azalea
Copy link
Contributor Author

device_count = torch.cuda.device_count()

device_count = 4

@Fragile-azalea
Copy link
Contributor Author

Fragile-azalea commented Jun 10, 2022

We also try to use docker to fix its problem.
Our Command

  1. docker pull pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel
  2. docker run -it --gpus all -v /home/xinglinpan/fastmoe-master:/fastmoe --ipc=host pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel /bin/bash
  3. pip install dm-tree ninja pytest
  4. rm -rf /fastmoe/build
  5. rm -rf /fastmoe/fastmoe.egg-info
  6. USE_NCCL=1 python setup.py install
  7. python demo.py
  8. python test_ddp.py

log

// demo.py
tensor([ 9, 10, 11, 12], device='cuda:2')
tensor([5, 6, 7, 8], device='cuda:1')
tensor([1, 2, 3, 4], device='cuda:0')
tensor([13, 14, 15, 16], device='cuda:3')
[tensor([1, 2, 3, 4], device='cuda:3'), tensor([5, 6, 7, 8], device='cuda:3'), tensor([ 9, 10, 11, 12], device='cuda:3'), tensor([13, 14, 15, 16], device='cuda:3')]
[tensor([1, 2, 3, 4], device='cuda:1'), tensor([5, 6, 7, 8], device='cuda:1'), tensor([ 9, 10, 11, 12], device='cuda:1'), tensor([13, 14, 15, 16], device='cuda:1')]
[tensor([1, 2, 3, 4], device='cuda:0'), tensor([5, 6, 7, 8], device='cuda:0'), tensor([ 9, 10, 11, 12], device='cuda:0'), tensor([13, 14, 15, 16], device='cuda:0')]
[tensor([1, 2, 3, 4], device='cuda:2'), tensor([5, 6, 7, 8], device='cuda:2'), tensor([ 9, 10, 11, 12], device='cuda:2'), tensor([13, 14, 15, 16], device='cuda:2')]
// test_ddp.py
4
44a3b6d368a5:100:100 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:100:100 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

44a3b6d368a5:100:100 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
44a3b6d368a5:100:100 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:100:100 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
Traceback (most recent call last):
  File "test_ddp.py", line 140, in <module>
    locals()[sys.argv[1]](**args)
  File "/fastmoe/tests/test_numerical.py", line 346, in _test_fmoe_local_ddp
    mp_group=mp_group, dp_group=dp_group, world_group=world_group)
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1078, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 250, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f2b887d5e70>
44a3b6d368a5:102:102 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:102:102 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

44a3b6d368a5:102:102 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
44a3b6d368a5:102:102 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:102:102 [0] NCCL INFO Using network Socket
44a3b6d368a5:100:146 [0] NCCL INFO Channel 00/02 :    0   1
44a3b6d368a5:102:147 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
44a3b6d368a5:100:146 [0] NCCL INFO Channel 01/02 :    0   1
44a3b6d368a5:102:147 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
44a3b6d368a5:102:147 [0] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
44a3b6d368a5:100:146 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
44a3b6d368a5:100:146 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
44a3b6d368a5:100:146 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
44a3b6d368a5:102:147 [0] NCCL INFO Channel 00 : 1[b1000] -> 0[3d000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[b1000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[b1000] via direct shared memory
44a3b6d368a5:102:147 [0] NCCL INFO Channel 01 : 1[b1000] -> 0[3d000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
44a3b6d368a5:100:146 [0] NCCL INFO comm 0x7fbb78001060 rank 0 nranks 2 cudaDev 0 busId 3d000 - Init COMPLETE
44a3b6d368a5:102:147 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
44a3b6d368a5:100:100 [0] NCCL INFO Launch mode Parallel
44a3b6d368a5:102:147 [0] NCCL INFO comm 0x7f6c68001060 rank 1 nranks 2 cudaDev 0 busId b1000 - Init COMPLETE
Traceback (most recent call last):
  File "test_ddp.py", line 140, in <module>
    locals()[sys.argv[1]](**args)
  File "/fastmoe/tests/test_numerical.py", line 346, in _test_fmoe_local_ddp
    mp_group=mp_group, dp_group=dp_group, world_group=world_group)
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1078, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 250, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7ff9018b9f70>
NCCL version 2.7.8+cuda10.2

@laekov
Copy link
Owner

laekov commented Jun 10, 2022

I finally begin to understand the issue.

We updated the distributed parameter initialization with bcast over https://github.com/laekov/fastmoe/blob/master/fmoe/distributed.py#L100, which is not correct. In PyTorch's distributed module, you are supposed to pass a global rank to the broadcast function, and it parses the global rank to local rank itself. (very stupid design, from my view) So, when we have multiple data parallel groups, rank 0 is not in many other comms, which raises this issue.

I will have that fixed later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants