New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The output of nccl_all_to_all_scatter_async may be incomplete when num_local_experts>1. #172
Comments
Hi @Fragile-azalea , thanks for reporting this issue! Currently I don't have 2 nodes, so I tried a 2-GPU-1-node run instead of 1-GPU-2-node run, and I didn't see the missing value phenomenon. Is the issue reproducible with 2-GPU-1-node setting? BTW, what was your PyTorch version? |
Thank you for your quick response. I don't have a node with two GPUs. Here is information about my platform:
|
To verify my idea, I perform an extra experiment on https://github.com/NVIDIA/nccl-tests/blob/master/src/alltoall.cu.
Run code with:
The output:
replace https://github.com/NVIDIA/nccl-tests/blob/8274cb47b6dc70ce4411e7f114b77173d3892414/src/alltoall.cu#L71-L76
Run code with:
The output:
|
I tried your modification on NCCL 2.7.8 and it caused nccl-tests to crash, but it works well on NCCL 2.10.3. Could you please try upgrade NCCL to 2.10.3? |
I recompile the nccl-test by the following command:
|
Could you please add NCCL_DEBUG=VERSION when running nccl-tests to check the actual NCCL version you're using? Specifying NCCL_HOME during compile may not change the library used during runtime. BTW, it's strange that both out-of-place and in-place results show no error. NCCL all-to-all should not support in-place operation. |
Comparison from my side FYI: Original nccl-tests:
Modified nccl-tests:
Overall all-to-all latency in the latter case is a little bit larger due to smaller packet size and more P2P operations. |
By set LD_LIBRARY_PATH=/xxx/nccl_2.10.3-1+cuda10.2_x86_64/lib, it works now.
|
Thanks @yzygitzh. Seems like it is an old NCCL issue. I'll close this since it is solved by upgrading NCCL. |
By set LD_LIBRARY_PATH=/xxx/nccl_2.10.3-1+cuda10.2_x86_64/lib, Tutul also works now.
It's subtle that the log contains both NCCL version 2.7.8+cuda10.2 and NCCL version 2.10.3+cuda10.2, but it works well. |
Describe the bug
The output of nccl_all_to_all_scatter_async may be incomplete.
To Reproduce
Steps to reproduce the behavior:
on host0(master): SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=1
on host1: SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=1
Log
The value of
tutel/tutel/impls/moe_layer.py
Line 244 in 2c0cad3
tensor([[[ 1.5410, -0.2934],
[-1.0845, -1.3986]],
[[ 1.5410, -0.2934],
[ 0.4033, 0.8380]],
[[-2.1788, 0.5684],
[-1.0845, -1.3986]],
[[ 0.4033, 0.8380],
[-2.1788, 0.5684]]], device='cuda:0')
The value of
tutel/tutel/impls/moe_layer.py
Line 253 in 2c0cad3
tensor([[[ 1.5410, -0.2934],
[-1.0845, -1.3986]],
[[ 1.5410, -0.2934],
[ 0.4033, 0.8380]],
[[-2.1788, 0.5684],
[-1.0845, -1.3986]],
[[ 0.4033, 0.8380],
[-2.1788, 0.5684]]], device='cuda:0')
This is the result I expect. However, when
on host0(master): SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=0 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=2
on host1: SKIP_EXPERT=1 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=2 --node_rank=1 --master_addr=host0 -m tutel.examples.helloworld --batch_size=4 --num_tokens=1 --model_dim=2 --hidden_size=2 --num_steps=1 --a2a_ffn_overlap_degree=2
The value of
tutel/tutel/impls/moe_layer.py
Line 244 in 2c0cad3
tensor([[[ 1.5410, -0.2934],
[-1.0845, -1.3986]],
[[ 1.5410, -0.2934],
[ 0.4033, 0.8380]],
[[-2.1788, 0.5684],
[-1.0845, -1.3986]],
[[ 0.4033, 0.8380],
[-2.1788, 0.5684]]], device='cuda:0')
The value of
tutel/tutel/impls/moe_layer.py
Line 249 in 2c0cad3
tensor([[[ 0.0000, 0.0000],
[ 0.0000, 0.0000]],
[[ 1.5410, -0.2934],
[ 0.4033, 0.8380]],
[[ 0.0000, 0.0000],
[ 0.0000, 0.0000]],
[[ 0.4033, 0.8380],
[-2.1788, 0.5684]]], device='cuda:0')
It seems incomplete.
The possible code is:
tutel/tutel/custom/custom_kernel.cpp
Lines 472 to 489 in 2c0cad3
It looks like the NCCL group keeps only the last send-recv pair in each peer.
There is no same problem when num_local_experts=1.
The text was updated successfully, but these errors were encountered: