p2pmark

GPU peer-to-peer interconnect benchmark. Measures PCIe/NVLink P2P bandwidth and latency across multi-GPU systems.

Modes

Bandwidth (default)

Measures P2P bandwidth using 64 MB cudaMemcpyPeerAsync transfers and produces two scores:

PCIe Link Score — how close each GPU's P2P bandwidth gets to the theoretical PCIe 5.0 x16 limit (63 GB/s).
Dense Interconnect Score — ratio of actual aggregate bandwidth (all GPUs reading from all peers simultaneously) to the ideal sum of isolated per-GPU bandwidths. 1.0 = perfect full-mesh with no contention, lower values indicate shared fabric bottlenecks (e.g. multi-switch PCIe topologies).

Latency (`--latency`)

Measures P2P latency using a custom CUDA kernel that performs direct remote reads (single cacheline, 128 bytes) via P2P-mapped pointers. Single-thread, single-block launches isolate the true hardware transfer latency from API overhead.

Reports the best-pair P2P latency and loaded latency (avg and max) when all GPUs read from all peers simultaneously.

Tests

Both modes run the same four test patterns:

Sequential NxN — baseline per-link measurement, one transfer at a time.
Topology probe — staggered reads by peer distance, reveals switch topology.
Single reader — one GPU reads from all peers concurrently.
All GPUs concurrent — every GPU reads from all peers at once.

Build & Run

make
./p2pmark            # bandwidth mode
./p2pmark --latency  # latency mode

Requires CUDA toolkit and a C++20 compiler (nvcc with -std=c++20).

Example Output

Bandwidth

===========================================================
  PCIe LINK SCORE:           0.86
  (54.30 GB/s avg  /  63.0 GB/s PCIe 5.0 x16 theoretical)

  DENSE INTERCONNECT SCORE:  0.44
  (189.13 GB/s measured  /  434.71 GB/s ideal)

  1.00 = perfect, 0.00 = none
===========================================================

Latency

===========================================================
  Min latency:             4.10 us  (best pair, isolated)
  Mean latency:            6.14 us  (per GPU under full load)

  EFFECTIVE LATENCY:       6.14 us  (all GPUs done reading all peers)
===========================================================

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
community-results		community-results
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
p2pmark.cu		p2pmark.cu
xgmi_alloc.cu		xgmi_alloc.cu
xgmi_match.cu		xgmi_match.cu
xgmi_probe.cu		xgmi_probe.cu
xgmi_probe2.cu		xgmi_probe2.cu
xgmi_size.cu		xgmi_size.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

p2pmark

Modes

Bandwidth (default)

Latency (`--latency`)

Tests

Build & Run

Example Output

Bandwidth

Latency

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

p2pmark

Modes

Bandwidth (default)

Latency (--latency)

Tests

Build & Run

Example Output

Bandwidth

Latency

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Latency (`--latency`)

Packages