# Peer-to-Peer (P2P) with 2x RTX 4000 Ada   

- cloud provider: [runpod.io](https://www.runpod.io/)
- pod: 2 x RTX 4000 Ada (19 vCPU 100 GB RAM)
- image: `runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04`

### nvidia-smi


In [1]:
!nvidia-smi

Wed Mar 27 22:31:29 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.12             Driver Version: 535.104.12   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|   0  NVIDIA RTX 4000 Ada Gene...    On  | 00000000:82:00.0 Off |                  Off |
| 30%   30C    P8              11W / 130W |      2MiB / 20475MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX 4000 Ada Gene...    On  | 00000000:C1:00.0 Off |  

In [2]:
!nvidia-smi topo -m

	[4mGPU0	GPU1	CPU Affinity	NUMA Affinity	GPU NUMA ID[0m


GPU0	 X 	SYS	0-47		N/A		N/A
GPU1	SYS	 X 	0-47		N/A		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks


### Building [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/master)

In [3]:
!git clone https://github.com/NVIDIA/nccl-tests.git && cd nccl-tests/ && make

Cloning into 'nccl-tests'...


remote: Enumerating objects: 337, done.[K
remote: Counting objects: 100% (215/215), done.[K
remote: Compressing objects: 100% (83/83), done.[K
remote: Total 337 (delta 184), reused 140 (delta 132), pack-reused 122[K
Receiving objects: 100% (337/337), 129.29 KiB | 1.56 MiB/s, done.
Resolving deltas: 100% (223/223), done.
make -C src build BUILDDIR=/root/p2p-perf/rtx-A4000-ada-2x/nccl-tests/build
make[1]: Entering directory '/root/p2p-perf/rtx-A4000-ada-2x/nccl-tests/src'
Compiling  timer.cc                            > /root/p2p-perf/rtx-A4000-ada-2x/nccl-tests/build/timer.o
Compiling /root/p2p-perf/rtx-A4000-ada-2x/nccl-tests/build/verifiable/verifiable.o
Compiling  all_reduce.cu                       > /root/p2p-perf/rtx-A4000-ada-2x/nccl-tests/build/all_reduce.o
Compiling  common.cu                           > /root/p2p-perf/rtx-A4000-ada-2x/nccl-tests/build/common.o
Linking  /root/p2p-perf/rtx-A4000-ada-2x/nccl-tests/build/all_reduce.o > /root/p2p-perf/rtx-A4000-ada-2x/nccl-test

### Running all_reduce_perf

In [5]:
!nccl-tests/build/all_reduce_perf -b 8 -e 128M -f 2 -g 2

# nThread 1 nGpus 2 minBytes 8 maxBytes 134217728 step: 2(factor) warmup iters: 5 iters: 20 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid   2858 on 49a30927f3aa device  0 [0x82] NVIDIA RTX 4000 Ada Generation
#  Rank  1 Group  0 Pid   2858 on 49a30927f3aa device  1 [0xc1] NVIDIA RTX 4000 Ada Generation


#
#                                                              out-of-place                       in-place          
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum      -1     9.74    0.00    0.00      0     9.51    0.00    0.00      0
          16             4     float     sum      -1     9.32    0.00    0.00      0     9.34    0.00    0.00      0
          32             8     float     sum      -1     9.39    0.00    0.00      0     9.28    0.00    0.00      0
          64            16     float     sum      -1     9.37    0.01    0.01      0     9.32    0.01    0.01      0
         128            32     float     sum      -1     9.39    0.01    0.01      0     9.30    0.01    0.01      0
         256            64     float     sum      -1     9.37 

In [6]:
import torch
import torch.utils.benchmark as benchmark

device0 = torch.device("cuda", 0)
device1 = torch.device("cuda", 1)

x0 = torch.randn(1024, 1024, 1024, dtype=torch.float32, device=device0)
x1 = torch.randint(0,100, (1024, 1024, 1024), dtype=torch.long, device=device1)

def copy_tensor(x, dest_device):
    y = x.to(dest_device, non_blocking=False, copy=False)
    return y

t0 = benchmark.Timer(
    stmt='copy_tensor(x0, device1)',
    setup='from __main__ import copy_tensor',
    globals={'x0': x0, 'device1': device1},
    num_threads=1)

t1 = benchmark.Timer(
    stmt='copy_tensor(x1, device1)',
    setup='from __main__ import copy_tensor',
    globals={'x1': x1, 'device1': device0},
    num_threads=1)

# sanity check
s0 = x0.sum().cpu()
y1 = copy_tensor(x0, device1)
s1 = y1.sum().cpu()
assert torch.abs(s0-s1) < 1e-5

m0 = t0.timeit(100)
storage_size0 = x0.untyped_storage().size()
print(f"{device0}->{device1}: {storage_size0/m0.mean/2**30:.3f} GB/s")

m1 = t1.timeit(100)
storage_size1 = x1.untyped_storage().size()
print(f"{device1}->{device0}: {storage_size1/m1.mean/2**30:.3f} GB/s")

cuda:0->cuda:1: 24.545 GB/s
cuda:1->cuda:0: 24.488 GB/s


In [1]:
!OMP_NUM_THREADS=1 /usr/local/bin/torchrun --standalone --nnodes=1 --nproc-per-node 2 torch_distributed_nccl_test.py

rank=0 send N=50, elapsed_time=9.6049s, 20.823 GB/s, sum=-49910.9375 (cuda:0)
rank=1 recv N=50, elapsed_time=9.6050s, 20.823 GB/s, sum=-49910.9375 (cuda:1)
rank=0 broadcast(a, src=0) N=50, elapsed_time=10.6061s, 18.857 GB/s, sum=-49910.9375 (cuda:0)
rank=0 broadcast(a, src=1) N=50, elapsed_time=10.5242s, 19.004 GB/s, sum=-49910.9375 (cuda:0)


In [8]:
torch.__version__

'2.2.0+cu121'