nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed #119

Fragile-azalea · 2022-06-10T01:50:34Z

Describe the bug
'nccl.h' file is not found or ncclUnhandledCudaError: Call to CUDA function failed

To Reproduce
Steps to reproduce the behavior:

USE_NCCL=1 python setup.py install

Logs

running install
running bdist_egg
running egg_info
writing fastmoe.egg-info/PKG-INFO
writing dependency_links to fastmoe.egg-info/dependency_links.txt
writing top-level names to fastmoe.egg-info/top_level.txt
reading manifest file 'fastmoe.egg-info/SOURCES.txt'
adding license file 'LICENSE'
writing manifest file 'fastmoe.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
running build_ext
building 'fmoe_cuda' extension
Emitting ninja build file /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/build.ninja...
Compiling objects...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/7] c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o 
c++ -MMD -MF /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o.d -pthread -B /home/xinglinpan/miniconda3/envs/fmoe/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14
cc1plus: warning: command line option ‘-Wstrict-prototypes’ is valid for C/ObjC but not for C++
In file included from /home/xinglinpan/fastmoe-master/cuda/global_exchange.h:1:0,
                 from /home/xinglinpan/fastmoe-master/cuda/global_exchange.cpp:1:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.
[2/7] /usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
FAILED: /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o 
/usr/local/cuda/bin/nvcc  -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/torch/csrc/api/include -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/TH -I/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/xinglinpan/miniconda3/envs/fmoe/include/python3.8 -c -c /home/xinglinpan/fastmoe-master/cuda/balancing.cu -o /home/xinglinpan/fastmoe-master/build/temp.linux-x86_64-3.8/cuda/balancing.o -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr --compiler-options ''"'"'-fPIC'"'"'' -DFMOE_USE_NCCL -DUSE_C10D_NCCL -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' '-DPYBIND11_BUILD_ABI="_cxxabi1011"' -DTORCH_EXTENSION_NAME=fmoe_cuda -DTORCH_EXTENSION_NAME=fmoe_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++14 -gencode=arch=compute_70,code=compute_70 -gencode=arch=compute_70,code=sm_70
In file included from /home/xinglinpan/fastmoe-master/cuda/balancing.cuh:1:0,
                 from /home/xinglinpan/fastmoe-master/cuda/balancing.cu:2:
/home/xinglinpan/fastmoe-master/cuda/stream_manager.h:7:18: fatal error: nccl.h: No such file or directory
compilation terminated.

Try to fix

Download nccl_2.7.8-1+cuda10.2_x86_64
Set environment variables as mentioned
USE_NCCL=1 python setup.py install

Installed /home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg
Processing dependencies for fastmoe==1.0.0
Finished processing dependencies for fastmoe==1.0.0

cd test && pytest test_ddp.py

Traceback (most recent call last):
  File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "/home/xinglinpan/fastmoe-master/tests/test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.

Platform

Device: GeForce RTX 2080Ti
OS: Linux gpu9 4.4.0-142-generic No overlapping observed when enabling Smart Scheduling #168-Ubuntu SMP Wed Jan 16 21:00:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
CUDA version: 10.2
NCCL version: 2.7.8-1
PyTorch version: 1.9.1
Python Version: 3.8

Additional context

>>> torch.cuda.nccl.version()
2708

May some necessary environment variables be lost during the process of subprocess.Popen?

fastmoe/tests/test_ddp.py

Line 44 in 670e140

env=env

The text was updated successfully, but these errors were encountered:

laekov · 2022-06-10T01:58:46Z

So, there are two issues. The first one is that FastMoE cannot find NCCL, and you have that addressed by installing NCCL. Then, PyTorch gets into trouble with its NCCL AllGather operator. You can first check if your PyTorch distributed.all_gather works well in a mini-reproeuction code without FastMoE. Also, you can use the environment variable NCCL_DEBUG=INFO to get more information about the NCCL cuda call error, and see if there is anything useful.

Fragile-azalea · 2022-06-10T02:20:09Z

To check if PyTorch distributed.all_gather works well in a mini-reproduction code without FastMoE.

code

def train_model(rank, args):
    print(f"Running mini code on rank {rank}.")
    setup(rank, args.world_size)
    torch.manual_seed(7 + rank)
    torch.cuda.set_device(rank)
    tensor_list = [torch.zeros(4, dtype=torch.int64, device=rank) for _ in range(4)]
    tensor = torch.arange(4, dtype=torch.int64, device=rank) + 1 + 4 * rank
    print(tensor)
    dist.all_gather(tensor_list, tensor)
    print(tensor_list)
    cleanup()

log

tensor([5, 6, 7, 8], device='cuda:1')
tensor([ 9, 10, 11, 12], device='cuda:2')
tensor([1, 2, 3, 4], device='cuda:0')
tensor([13, 14, 15, 16], device='cuda:3')
[tensor([1, 2, 3, 4], device='cuda:2'), tensor([5, 6, 7, 8], device='cuda:2'), tensor([ 9, 10, 11, 12], device='cuda:2'), tensor([13, 14, 15, 16], device='cuda:2')]
[tensor([1, 2, 3, 4], device='cuda:0'), tensor([5, 6, 7, 8], device='cuda:0'), tensor([ 9, 10, 11, 12], device='cuda:0'), tensor([13, 14, 15, 16], device='cuda:0')]
[tensor([1, 2, 3, 4], device='cuda:3'), tensor([5, 6, 7, 8], device='cuda:3'), tensor([ 9, 10, 11, 12], device='cuda:3'), tensor([13, 14, 15, 16], device='cuda:3')]
[tensor([1, 2, 3, 4], device='cuda:1'), tensor([5, 6, 7, 8], device='cuda:1'), tensor([ 9, 10, 11, 12], device='cuda:1'), tensor([13, 14, 15, 16], device='cuda:1')]

Fragile-azalea · 2022-06-10T02:34:16Z

NCCL_DEBUG=INFO It doesn't seem to work.

log

(fmoe) xinglinpan@gpu9:~/fastmoe-master/tests$ NCCL_DEBUG=INFO python test_ddp.py
Traceback (most recent call last):
  File "test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
Traceback (most recent call last):
  File "test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 137, in test_fmoe_linear
    torch.distributed.all_gather(weight_htoh4_array, moe.experts.htoh4.weight.data)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1921, in all_gather
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
    work = default_pg.allgather([tensor_list], [tensor])
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
Traceback (most recent call last):
  File "test_ddp.py", line 142, in <module>
    test_fmoe_linear_distributed(
  File "test_ddp.py", line 65, in test_fmoe_linear_distributed
    _run_distributed(
  File "test_ddp.py", line 52, in _run_distributed
    assert retc == 0
AssertionError

laekov · 2022-06-10T02:52:16Z

To see the NCCL debug info, you are supposed to add that environment variable at the env dict which you mentinoed above. Besides, you have to remove stdout=subprocess.PIPE in the Popen call.

Fragile-azalea · 2022-06-10T03:39:00Z

remove stdout=subprocess.PIPE and env['NCCL_DEBUG'] = 'INFO'

log

(fmoe) xinglinpan@gpu9:~/fastmoe-master/tests$  python test_ddp.py
Traceback (most recent call last):
  File "test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 345, in _test_fmoe_local_ddp
    model_ddp = LocalDDP(deepcopy(model),
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1092, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 252, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f5976242ef0>
gpu9:213423:213423 [0] NCCL INFO Bootstrap : Using [0]ib0:10.0.0.19<0>
gpu9:213423:213423 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gpu9:213423:213423 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.0.0.19<0>
gpu9:213423:213423 [0] NCCL INFO Using network IB
NCCL version 2.7.8+cuda10.2
gpu9:213425:213425 [0] NCCL INFO Bootstrap : Using [0]ib0:10.0.0.19<0>
gpu9:213425:213425 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
gpu9:213425:213425 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/IB ; OOB ib0:10.0.0.19<0>
gpu9:213425:213425 [0] NCCL INFO Using network IB
gpu9:213423:213895 [0] NCCL INFO Channel 00/02 :    0   1
gpu9:213423:213895 [0] NCCL INFO Channel 01/02 :    0   1
gpu9:213425:213897 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
gpu9:213425:213897 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
gpu9:213425:213897 [0] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
gpu9:213423:213895 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
gpu9:213423:213895 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
gpu9:213423:213895 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
gpu9:213423:213895 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[b1000] via direct shared memory
gpu9:213425:213897 [0] NCCL INFO Channel 00 : 1[b1000] -> 0[3d000] via direct shared memory
gpu9:213425:213897 [0] NCCL INFO Channel 01 : 1[b1000] -> 0[3d000] via direct shared memory
gpu9:213423:213895 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[b1000] via direct shared memory
gpu9:213425:213897 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gpu9:213425:213897 [0] NCCL INFO comm 0x7f9198000e00 rank 1 nranks 2 cudaDev 0 busId b1000 - Init COMPLETE
gpu9:213423:213895 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
gpu9:213423:213895 [0] NCCL INFO comm 0x7f07e0000e00 rank 0 nranks 2 cudaDev 0 busId 3d000 - Init COMPLETE
gpu9:213423:213423 [0] NCCL INFO Launch mode Parallel
NCCL version 2.7.8+cuda10.2
Traceback (most recent call last):
  File "test_ddp.py", line 139, in <module>
    locals()[sys.argv[1]](**args)
  File "/home/xinglinpan/fastmoe-master/tests/test_numerical.py", line 345, in _test_fmoe_local_ddp
    model_ddp = LocalDDP(deepcopy(model),
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/fastmoe-1.0.0-py3.8-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1092, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/home/xinglinpan/miniconda3/envs/fmoe/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 252, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f616ca133f0>

laekov · 2022-06-10T04:36:13Z

How many GPUs do you have? The default test_fmoe_linear_distributed test in line 142 of test_ddp.py requires at least 4 available GPUs, or the sub-processes are binded to a non-existing GPU.

Fragile-azalea · 2022-06-10T04:45:31Z

fastmoe/tests/test_ddp.py

Line 28 in 670e140

device_count = torch.cuda.device_count()

device_count = 4

Fragile-azalea · 2022-06-10T07:14:18Z

We also try to use docker to fix its problem.
Our Command

docker pull pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel
docker run -it --gpus all -v /home/xinglinpan/fastmoe-master:/fastmoe --ipc=host pytorch/pytorch:1.9.0-cuda10.2-cudnn7-devel /bin/bash
pip install dm-tree ninja pytest
rm -rf /fastmoe/build
rm -rf /fastmoe/fastmoe.egg-info
USE_NCCL=1 python setup.py install
python demo.py
python test_ddp.py

log

// demo.py
tensor([ 9, 10, 11, 12], device='cuda:2')
tensor([5, 6, 7, 8], device='cuda:1')
tensor([1, 2, 3, 4], device='cuda:0')
tensor([13, 14, 15, 16], device='cuda:3')
[tensor([1, 2, 3, 4], device='cuda:3'), tensor([5, 6, 7, 8], device='cuda:3'), tensor([ 9, 10, 11, 12], device='cuda:3'), tensor([13, 14, 15, 16], device='cuda:3')]
[tensor([1, 2, 3, 4], device='cuda:1'), tensor([5, 6, 7, 8], device='cuda:1'), tensor([ 9, 10, 11, 12], device='cuda:1'), tensor([13, 14, 15, 16], device='cuda:1')]
[tensor([1, 2, 3, 4], device='cuda:0'), tensor([5, 6, 7, 8], device='cuda:0'), tensor([ 9, 10, 11, 12], device='cuda:0'), tensor([13, 14, 15, 16], device='cuda:0')]
[tensor([1, 2, 3, 4], device='cuda:2'), tensor([5, 6, 7, 8], device='cuda:2'), tensor([ 9, 10, 11, 12], device='cuda:2'), tensor([13, 14, 15, 16], device='cuda:2')]

// test_ddp.py
4
44a3b6d368a5:100:100 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:100:100 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

44a3b6d368a5:100:100 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
44a3b6d368a5:100:100 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:100:100 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
Traceback (most recent call last):
  File "test_ddp.py", line 140, in <module>
    locals()[sys.argv[1]](**args)
  File "/fastmoe/tests/test_numerical.py", line 346, in _test_fmoe_local_ddp
    mp_group=mp_group, dp_group=dp_group, world_group=world_group)
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1078, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 250, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7f2b887d5e70>
44a3b6d368a5:102:102 [0] NCCL INFO Bootstrap : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:102:102 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

44a3b6d368a5:102:102 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
44a3b6d368a5:102:102 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.3<0>
44a3b6d368a5:102:102 [0] NCCL INFO Using network Socket
44a3b6d368a5:100:146 [0] NCCL INFO Channel 00/02 :    0   1
44a3b6d368a5:102:147 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
44a3b6d368a5:100:146 [0] NCCL INFO Channel 01/02 :    0   1
44a3b6d368a5:102:147 [0] NCCL INFO Trees [0] -1/-1/-1->1->0|0->1->-1/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
44a3b6d368a5:102:147 [0] NCCL INFO Setting affinity for GPU 2 to ffff,f00000ff,fff00000
44a3b6d368a5:100:146 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 8/8/64
44a3b6d368a5:100:146 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->-1|-1->0->1/-1/-1
44a3b6d368a5:100:146 [0] NCCL INFO Setting affinity for GPU 0 to 0fffff00,000fffff
44a3b6d368a5:102:147 [0] NCCL INFO Channel 00 : 1[b1000] -> 0[3d000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO Channel 00 : 0[3d000] -> 1[b1000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO Channel 01 : 0[3d000] -> 1[b1000] via direct shared memory
44a3b6d368a5:102:147 [0] NCCL INFO Channel 01 : 1[b1000] -> 0[3d000] via direct shared memory
44a3b6d368a5:100:146 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
44a3b6d368a5:100:146 [0] NCCL INFO comm 0x7fbb78001060 rank 0 nranks 2 cudaDev 0 busId 3d000 - Init COMPLETE
44a3b6d368a5:102:147 [0] NCCL INFO 2 coll channels, 2 p2p channels, 2 p2p channels per peer
44a3b6d368a5:100:100 [0] NCCL INFO Launch mode Parallel
44a3b6d368a5:102:147 [0] NCCL INFO comm 0x7f6c68001060 rank 1 nranks 2 cudaDev 0 busId b1000 - Init COMPLETE
Traceback (most recent call last):
  File "test_ddp.py", line 140, in <module>
    locals()[sys.argv[1]](**args)
  File "/fastmoe/tests/test_numerical.py", line 346, in _test_fmoe_local_ddp
    mp_group=mp_group, dp_group=dp_group, world_group=world_group)
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 80, in __init__
    self._sync_params()
  File "/opt/conda/lib/python3.7/site-packages/fastmoe-1.0.0-py3.7-linux-x86_64.egg/fmoe/distributed.py", line 100, in _sync_params
    torch.distributed.broadcast(coalesced, 0, group=comm)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1078, in broadcast
    group_src_rank = _get_group_rank(group, src)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 250, in _get_group_rank
    raise RuntimeError(f"The global rank {rank} is not part of the group {group}") from None
RuntimeError: The global rank 0 is not part of the group <torch._C._distributed_c10d.ProcessGroupNCCL object at 0x7ff9018b9f70>
NCCL version 2.7.8+cuda10.2

laekov · 2022-06-10T09:04:04Z

I finally begin to understand the issue.

We updated the distributed parameter initialization with bcast over https://github.com/laekov/fastmoe/blob/master/fmoe/distributed.py#L100, which is not correct. In PyTorch's distributed module, you are supposed to pass a global rank to the broadcast function, and it parses the global rank to local rank itself. (very stupid design, from my view) So, when we have multiple data parallel groups, rank 0 is not in many other comms, which raises this issue.

I will have that fixed later today.

laekov mentioned this issue Jun 10, 2022

Fix Broadcast rank bug in DGDP #120

Merged

Fragile-azalea closed this as completed Jun 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed #119

nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed #119

Fragile-azalea commented Jun 10, 2022

laekov commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022 •

edited

Fragile-azalea commented Jun 10, 2022

laekov commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022

laekov commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022 •

edited

laekov commented Jun 10, 2022

nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed #119

nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed #119

Comments

Fragile-azalea commented Jun 10, 2022

laekov commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022 • edited

Fragile-azalea commented Jun 10, 2022

laekov commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022

laekov commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022 • edited

laekov commented Jun 10, 2022

Fragile-azalea commented Jun 10, 2022 •

edited

Fragile-azalea commented Jun 10, 2022 •

edited