Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NCCL all_reduce performance test on 2 nodes with 10Gbps bandwidth has not any improvements after fastsocket plugin enabled #2

Closed
luoguohao opened this issue Mar 30, 2022 · 1 comment

Comments

@luoguohao
Copy link

luoguohao commented Mar 30, 2022

Enviroment

  • linux kernel version: 4.18.0-193.el8.x86_64 , Centos8
  • network bandwidth: 10Gbps
  • gpus: two nodes,four P40s per nodes
  • test suites: nccl-tests

Command

mpirun --allow-run-as-root -np 8 \
       --hostfile centos8-hostfile \
       --mca orte_base_help_aggregate 0 \
       --mca btl tcp,vader,self \
       --mca plm_rsh_args "-p 8022" \
       --mca btl_tcp_if_include eth0 \
       -bind-to none -oversubscribe \
       --map-by slot \
       -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH \
       -x NCCL_SOCKET_IFNAME=eth0 \
       -x NCCL_IB_DISABLE=1 \
      nccl-tests/build/all_reduce_perf -b 8 -e 1024M -f 5 -g 1 -o all -n 500 -w 10

Perfromence with FastSocket plugin enabled

# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 5(factor) warmup iters: 10 iters: 500 validation: 1
#
# Using devices
#   Rank  0 Pid    418 on ml-gpu-ser423 device  0 [0x02] Tesla P40
#   Rank  1 Pid    419 on ml-gpu-ser423 device  1 [0x03] Tesla P40
#   Rank  2 Pid    420 on ml-gpu-ser423 device  2 [0x83] Tesla P40
#   Rank  3 Pid    421 on ml-gpu-ser423 device  3 [0x84] Tesla P40
#   Rank  4 Pid    488 on ml-gpu-ser604 device  0 [0x02] Tesla P40
#   Rank  5 Pid    489 on ml-gpu-ser604 device  1 [0x03] Tesla P40
#   Rank  6 Pid    490 on ml-gpu-ser604 device  2 [0x83] Tesla P40
#   Rank  7 Pid    491 on ml-gpu-ser604 device  3 [0x84] Tesla P40

#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     avg    77.44    0.00    0.00  9e-10    77.03    0.00    0.00  9e-10
          40            10     float     avg    77.25    0.00    0.00  9e-10    77.08    0.00    0.00  9e-10
         200            50     float     avg    78.30    0.00    0.00  9e-10    78.19    0.00    0.00  9e-10
        1000           250     float     avg    87.68    0.01    0.02  3e-08    87.59    0.01    0.02  3e-08
        5000          1250     float     avg    108.3    0.05    0.08  3e-08    108.4    0.05    0.08  3e-08
       25000          6250     float     avg    261.9    0.10    0.17  3e-08    271.6    0.09    0.16  3e-08
      125000         31250     float     avg    411.9    0.30    0.53  3e-08    420.6    0.30    0.52  3e-08
      625000        156250     float     avg    999.8    0.63    1.09  3e-08    977.8    0.64    1.12  3e-08
     3125000        781250     float     avg   4749.9    0.66    1.15  3e-08   4835.7    0.65    1.13  3e-08
    15625000       3906250     float     avg    15131    1.03    1.81  3e-08    15210    1.03    1.80  3e-08
    78125000      19531250     float     avg    71686    1.09    1.91  3e-08    71619    1.09    1.91  3e-08
   390625000      97656250     float     avg   336844    1.16    2.03  3e-08   337039    1.16    2.03  3e-08

Perfromence with FastSocket plugin disabled

# nThread 1 nGpus 1 minBytes 8 maxBytes 1073741824 step: 5(factor) warmup iters: 10 iters: 500 validation: 1
#
# Using devices
#   Rank  0 Pid    418 on ml-gpu-ser423 device  0 [0x02] Tesla P40
#   Rank  1 Pid    419 on ml-gpu-ser423 device  1 [0x03] Tesla P40
#   Rank  2 Pid    420 on ml-gpu-ser423 device  2 [0x83] Tesla P40
#   Rank  3 Pid    421 on ml-gpu-ser423 device  3 [0x84] Tesla P40
#   Rank  4 Pid    488 on ml-gpu-ser604 device  0 [0x02] Tesla P40
#   Rank  5 Pid    489 on ml-gpu-ser604 device  1 [0x03] Tesla P40
#   Rank  6 Pid    490 on ml-gpu-ser604 device  2 [0x83] Tesla P40
#   Rank  7 Pid    491 on ml-gpu-ser604 device  3 [0x84] Tesla P40


#                                                       out-of-place                       in-place
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
           8             2     float     avg    149.6    0.00    0.00  9e-10    138.9    0.00    0.00  9e-10
          40            10     float     avg    138.0    0.00    0.00  9e-10    102.3    0.00    0.00  9e-10
         200            50     float     avg    96.76    0.00    0.00  9e-10    96.16    0.00    0.00  9e-10
        1000           250     float     avg    82.18    0.01    0.02  3e-08    82.30    0.01    0.02  3e-08
        5000          1250     float     avg    103.8    0.05    0.08  3e-08    102.4    0.05    0.09  3e-08
       25000          6250     float     avg    225.3    0.11    0.19  3e-08    225.4    0.11    0.19  3e-08
      125000         31250     float     avg    346.6    0.36    0.63  3e-08    345.5    0.36    0.63  3e-08
      625000        156250     float     avg    961.4    0.65    1.14  3e-08    968.0    0.65    1.13  3e-08
     3125000        781250     float     avg   4677.0    0.67    1.17  3e-08   4684.6    0.67    1.17  3e-08
    15625000       3906250     float     avg    13943    1.12    1.96  3e-08    13941    1.12    1.96  3e-08
    78125000      19531250     float     avg    68384    1.14    2.00  3e-08    68389    1.14    2.00  3e-08
   390625000      97656250     float     avg   333850    1.17    2.05  3e-08   333890    1.17    2.05  3e-08

Anyone has any suggestions ? am i do the right perfermance tests?

@luoguohao luoguohao changed the title NCCL all_reduce performance test on 2 nodes with 10Gbps bandwidth has no improvements at all after fastsocket plugin enabled NCCL all_reduce performance test on 2 nodes with 10Gbps bandwidth has not any improvements after fastsocket plugin enabled Mar 30, 2022
@changlan
Copy link
Collaborator

The busbw in your test is about 2GB/s, which is already saturating the 10Gbps NIC bandwidth. I would recommend using 100GbE networks for more significant improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants