# Peer-to-Peer (P2P) with 2x NVIDIA 4090 RTX

- cloud provider: [runpod.io](https://www.runpod.io/)
- pod: 2 x RTX 4090 (25 vCPU 200 GB RAM)
- image: `runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04`

### nvidia-smi


In [1]:
!nvidia-smi

Wed Mar 27 13:54:37 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.154.05             Driver Version: 535.154.05   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:02:00.0 Off |                  Off |
| 30%   27C    P8              21W / 450W |      1MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:42:00.0 Off |  

In [2]:
!nvidia-smi topo -m

	[4mGPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m
GPU0	 X 	SYS	0-63	0		N/A
GPU1	SYS	 X 	0-63	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks


### Building [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/master)

In [3]:
!git clone https://github.com/NVIDIA/nccl-tests.git && cd nccl-tests/ && make

Cloning into 'nccl-tests'...
remote: Enumerating objects: 337, done.[K
remote: Counting objects: 100% (215/215), done.[K
remote: Compressing objects: 100% (83/83), done.[K
remote: Total 337 (delta 184), reused 140 (delta 132), pack-reused 122[K
Receiving objects: 100% (337/337), 129.26 KiB | 1.38 MiB/s, done.
Resolving deltas: 100% (223/223), done.
make -C src build BUILDDIR=/workspace/p2p-4090/nccl-tests/build
make[1]: Entering directory '/workspace/p2p-4090/nccl-tests/src'
Compiling  timer.cc                            > /workspace/p2p-4090/nccl-tests/build/timer.o
Compiling /workspace/p2p-4090/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /workspace/p2p-4090/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /workspace/p2p-4090/nccl-tests/build/common.o
Linking  /workspace/p2p-4090/nccl-tests/build/all_reduce.o > /workspace/p2p-4090/nccl-tests/build/all_reduce_perf
Compiling  all_gather.cu                   

### Running all_reduce_perf

In [8]:
!nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  10358 on 252630d85675 device  0 [0x02] NVIDIA GeForce RTX 4090
#  Rank  1 Group  0 Pid  10358 on 252630d85675 device  1 [0x42] NVIDIA GeForce RTX 4090


#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     7.50    0.00    0.00      0     7.46    0.00    0.00      0
          16             4     float     sum      -1     7.42    0.00    0.00      0     7.55    0.00    0.00      0
          32             8     float     sum      -1     7.36    0.00    0.00      0     7.41    0.00    0.00      0
          64            16     float     sum      -1     7.36    0.01    0.01      0     7.52    0.01    0.01      0
         128            32     float     sum      -1     7.42    0.02    0.02      0     7.35    0.02    0.02      0
         256            64     float     sum      -1     7.63 

In [1]:
import torch
import torch.utils.benchmark as benchmark

device0 = torch.device("cuda", 0)
device1 = torch.device("cuda", 1)

x0 = torch.randn(1024, 1024, 1024, dtype=torch.float32, device=device0)
x1 = torch.randint(0,100, (1024, 1024, 1024), dtype=torch.long, device=device1)

def copy_tensor(x, dest_device):
    y = x.to(dest_device, non_blocking=False, copy=False)
    return y

t0 = benchmark.Timer(
    stmt='copy_tensor(x0, device1)',
    setup='from __main__ import copy_tensor',
    globals={'x0': x0, 'device1': device1},
    num_threads=1)

t1 = benchmark.Timer(
    stmt='copy_tensor(x1, device1)',
    setup='from __main__ import copy_tensor',
    globals={'x1': x1, 'device1': device0},
    num_threads=1)

# sanity check
s0 = x0.sum().cpu()
y1 = copy_tensor(x0, device1)
s1 = y1.sum().cpu()
assert torch.abs(s0-s1) < 1e-5

m0 = t0.timeit(100)
storage_size0 = x0.untyped_storage().size()
print(f"{device0}->{device1}: {storage_size0/m0.mean/2**30:.3f} GB/s")

m1 = t1.timeit(100)
storage_size1 = x1.untyped_storage().size()
print(f"{device1}->{device0}: {storage_size1/m1.mean/2**30:.3f} GB/s")

cuda:0->cuda:1: 20.989 GB/s
cuda:1->cuda:0: 20.932 GB/s


In [16]:
!OMP_NUM_THREADS=1 torchrun --standalone --nnodes=1 --nproc-per-node 2 torch_distributed_nccl_test.py

rank=1 recv N=50, elapsed_time=50.4978s, 3.961 GB/s, sum=-75526.25 (cuda:1)
rank=0 send N=50, elapsed_time=50.4978s, 3.961 GB/s, sum=-75526.25 (cuda:0)
rank=0 broadcast(a, src=0) N=50, elapsed_time=50.1422s, 3.989 GB/s, sum=-75526.25 (cuda:0)
rank=1 broadcast(a, src=0) N=50, elapsed_time=50.1425s, 3.989 GB/s, sum=-75526.25 (cuda:1)
rank=1 broadcast(a, src=1) N=50, elapsed_time=50.3260s, 3.974 GB/s, sum=-75526.25 (cuda:1)
rank=0 broadcast(a, src=1) N=50, elapsed_time=50.3260s, 3.974 GB/s, sum=-75526.25 (cuda:0)
