# Peer-to-Peer (P2P) with 2x NVIDIA RTX A5000   

- cloud provider: [runpod.io](https://www.runpod.io/)
- pod: 2 x RTX A5000 (19 vCPU 100 GB RAM)
- image: `runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04`

### nvidia-smi


In [1]:
!nvidia-smi

Wed Mar 27 20:14:25 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX A5000               On  | 00000000:56:00.0 Off |                  Off |
| 30%   28C    P8              24W / 230W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A5000               On  | 00000000:57:00.0 Off |  

In [2]:
!nvidia-smi topo -m

	[4mGPU0	GPU1	NIC0	NIC1	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	PIX	SYS	SYS	0-23,48-71	0		N/A
GPU1	PIX	 X 	SYS	SYS	0-23,48-71	0		N/A
NIC0	SYS	SYS	 X 	PIX				
NIC1	SYS	SYS	PIX	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1



### Building [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/master)

In [1]:
!cd ../nccl-tests/ && make

make -C src build BUILDDIR=/workspace/code/cuda-mode/p2p-perf/nccl-tests/build
make[1]: Entering directory '/workspace/code/cuda-mode/p2p-perf/nccl-tests/src'
Compiling  timer.cc                            > /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/timer.o


Compiling /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/common.o
Linking  /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/all_reduce.o > /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/all_reduce_perf
Compiling  all_gather.cu                       > /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/all_gather.o
Linking  /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/all_gather.o > /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/all_gather_perf
Compiling  broadcast.cu                        > /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/broadcast.o
Linking  /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/broadcast.o > /workspace/code/cuda-mode/p2p-perf/nccl-tests/build/broadcast_perf
Compiling  reduce_scatter.cu     

### Running all_reduce_perf

In [2]:
!../nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices


#  Rank  0 Group  0 Pid 686484 on 58e5f0f169a6 device  0 [0x56] NVIDIA RTX A5000
#  Rank  1 Group  0 Pid 686484 on 58e5f0f169a6 device  1 [0x57] NVIDIA RTX A5000
#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1    18.02    0.00    0.00      0    18.24    0.00    0.00      0
          16             4     float     sum      -1    18.96    0.00    0.00      0    16.45    0.00    0.00      0
          32             8     float     sum      -1    17.12    0.00    0.00      0    16.92    0.00    0.00      0
          64            16     float     sum      -1    15.88    0.00    0.00      0    16.11    0.00    0.00      0
         128     

In [1]:
import torch
import torch.utils.benchmark as benchmark

device0 = torch.device("cuda", 0)
device1 = torch.device("cuda", 1)

x0 = torch.randn(1024, 1024, 1024, dtype=torch.float32, device=device0)
x1 = torch.randint(0,100, (1024, 1024, 1024), dtype=torch.long, device=device1)

def copy_tensor(x, dest_device):
    y = x.to(dest_device, non_blocking=False, copy=False)
    return y

t0 = benchmark.Timer(
    stmt='copy_tensor(x0, device1)',
    setup='from __main__ import copy_tensor',
    globals={'x0': x0, 'device1': device1},
    num_threads=1)

t1 = benchmark.Timer(
    stmt='copy_tensor(x1, device1)',
    setup='from __main__ import copy_tensor',
    globals={'x1': x1, 'device1': device0},
    num_threads=1)

# sanity check
s0 = x0.sum().cpu()
y1 = copy_tensor(x0, device1)
s1 = y1.sum().cpu()
assert torch.abs(s0-s1) < 1e-5

m0 = t0.timeit(100)
storage_size0 = x0.untyped_storage().size()
print(f"{device0}->{device1}: {storage_size0/m0.mean/2**30:.3f} GB/s")

m1 = t1.timeit(100)
storage_size1 = x1.untyped_storage().size()
print(f"{device1}->{device0}: {storage_size1/m1.mean/2**30:.3f} GB/s")

cuda:0->cuda:1: 24.572 GB/s
cuda:1->cuda:0: 24.535 GB/s


In [10]:
!OMP_NUM_THREADS=1 /usr/local/bin/torchrun --standalone --nnodes=1 --nproc-per-node 2 torch_distributed_nccl_test.py

rank=0 send N=50, elapsed_time=8.6825s, 23.035 GB/s, sum=-9481.1240234375 (cuda:0)
rank=1 recv N=50, elapsed_time=8.6485s, 23.125 GB/s, sum=-9481.1240234375 (cuda:1)
rank=0 broadcast(a, src=0) N=50, elapsed_time=9.5473s, 20.948 GB/s, sum=-9481.1240234375 (cuda:0)
rank=0 broadcast(a, src=1) N=50, elapsed_time=9.3507s, 21.389 GB/s, sum=-9481.1240234375 (cuda:0)


In [8]:
torch.__version__

'2.2.0+cu121'

### Running cuda-samples simpleP2P and p2pBandwidthLatencyTest

In [7]:
!cd ../cuda-samples/Samples/0_Introduction/simpleP2P/ && make
!../cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P

make: Nothing to be done for 'all'.


[../cuda-samples/Samples/0_Introduction/simpleP2P/simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 2

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA RTX A5000 (GPU0) -> NVIDIA RTX A5000 (GPU1) : Yes
> Peer access from NVIDIA RTX A5000 (GPU1) -> NVIDIA RTX A5000 (GPU0) : Yes
Enabling peer access between GPU0 and GPU1...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 24.47GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed


In [22]:
!cd ../cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest && make
!../cuda-samples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest 

/usr/local/cuda/bin/nvcc -ccbin g++ -I../../../Common -m64 --threads 0 --std=c++11 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o p2pBandwidthLatencyTest.o -c p2pBandwidthLatencyTest.cu


/usr/local/cuda/bin/nvcc -ccbin g++ -m64 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_89,code=sm_89 -gencode arch=compute_90,code=sm_90 -gencode arch=compute_90,code=compute_90 -o p2pBandwidthLatencyTest p2pBandwidthLatencyTest.o 
mkdir -p ../../../bin/x86_64/linux/release
cp p2pBandwidthLatencyTest ../../../bin/x86_64/linux/release
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA RTX A5000, pciBusID: 56, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA RTX A5000, pciBusID: 57, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0

***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidt