Skip to content

Conversation

achalpandeyy
Copy link
Contributor

@achalpandeyy achalpandeyy commented Sep 11, 2025

Use CUDA events for pmpp_v2 benchmarking to eliminate device sync overhead.

I have duplicated the code for clear_l2_cache from amd_distributed, but happy to move it out to a common place if it is desired.

Log before the change

$ PYTHONPATH=vectorsum_py POPCORN_FD=1 python3 eval.py benchmark vectorsum_py/bench_cases.txt
benchmark-count: 6
benchmark.0.spec: size: 1638400; seed: 93246
benchmark.0.runs: 100
benchmark.0.bandwidth (GBPS): 23.495783829333316
benchmark.0.mean: 519541.35
benchmark.0.std: 165647.28339166287
benchmark.0.err: 16564.728339166286
benchmark.0.best: 362325.0
benchmark.0.worst: 1164039.0
benchmark.1.spec: size: 3276800; seed: 6256
benchmark.1.runs: 100
benchmark.1.bandwidth (GBPS): 29.851347176453594
benchmark.1.mean: 817854.63
benchmark.1.std: 223377.4959560363
benchmark.1.err: 22337.74959560363
benchmark.1.best: 577446.0
benchmark.1.worst: 1851050.0
benchmark.2.spec: size: 6553600; seed: 8841
benchmark.2.runs: 100
benchmark.2.bandwidth (GBPS): 38.22532830793835
benchmark.2.mean: 1277376.21
benchmark.2.std: 346352.9819896125
benchmark.2.err: 34635.29819896125
benchmark.2.best: 919113.0
benchmark.2.worst: 2475113.0
benchmark.3.spec: size: 13107200; seed: 6252
benchmark.3.runs: 100
benchmark.3.bandwidth (GBPS): 51.416789655408145
benchmark.3.mean: 1899306.64
benchmark.3.std: 235694.93807828726
benchmark.3.err: 23569.493807828727
benchmark.3.best: 1647360.0
benchmark.3.worst: 2907078.0
benchmark.4.spec: size: 26214400; seed: 82135
benchmark.4.runs: 100
benchmark.4.bandwidth (GBPS): 57.20257865373236
benchmark.4.mean: 3414400.27
benchmark.4.std: 235066.04157666484
benchmark.4.err: 23506.604157666483
benchmark.4.best: 3191940.0
benchmark.4.worst: 4731688.0
benchmark.5.spec: size: 52428800; seed: 12345
benchmark.5.runs: 100
benchmark.5.bandwidth (GBPS): 62.49237962923411
benchmark.5.mean: 6250762.13
benchmark.5.std: 234928.7028364935
benchmark.5.err: 23492.87028364935
benchmark.5.best: 6068260.0
benchmark.5.worst: 7927425.0
check: pass

Log after the change

$ PYTHONPATH=vectorsum_py POPCORN_FD=1 python3 eval.py benchmark vectorsum_py/bench_cases.txt
benchmark-count: 6
benchmark.0.spec: size: 1638400; seed: 93246
benchmark.0.runs: 100
benchmark.0.bandwidth (GBPS): 56.515038533784214
benchmark.0.mean: 215996.15901708603
benchmark.0.std: 29035.950794373504
benchmark.0.err: 2903.5950794373503
benchmark.0.best: 208447.99280166626
benchmark.0.worst: 489504.00948524475
benchmark.1.spec: size: 3276800; seed: 6256
benchmark.1.runs: 100
benchmark.1.bandwidth (GBPS): 62.302996908461715
benchmark.1.mean: 391860.16261577606
benchmark.1.std: 32548.073898003942
benchmark.1.err: 3254.807389800394
benchmark.1.best: 383296.01287841797
benchmark.1.worst: 700640.0227546692
benchmark.2.spec: size: 6553600; seed: 8841
benchmark.2.runs: 100
benchmark.2.bandwidth (GBPS): 64.44218906047696
benchmark.2.mean: 757704.319357872
benchmark.2.std: 100145.04530299056
benchmark.2.err: 10014.504530299057
benchmark.2.best: 734879.9705505371
benchmark.2.worst: 1448832.0350646973
benchmark.3.spec: size: 13107200; seed: 6252
benchmark.3.runs: 100
benchmark.3.bandwidth (GBPS): 66.08331315440807
benchmark.3.mean: 1477774.72615242
benchmark.3.std: 127271.5885550621
benchmark.3.err: 12727.15885550621
benchmark.3.best: 1439743.995666504
benchmark.3.worst: 2391616.106033325
benchmark.4.spec: size: 26214400; seed: 82135
benchmark.4.runs: 100
benchmark.4.bandwidth (GBPS): 67.09424983580644
benchmark.4.mean: 2911016.972064972
benchmark.4.std: 183735.76795047574
benchmark.4.err: 18373.576795047575
benchmark.4.best: 2852895.975112915
benchmark.4.worst: 4468736.171722412
benchmark.5.spec: size: 52428800; seed: 12345
benchmark.5.runs: 48
benchmark.5.bandwidth (GBPS): 67.90498315241787
benchmark.5.mean: 5752523.332834244
benchmark.5.std: 39765.546906717944
benchmark.5.err: 5739.662302766574
benchmark.5.best: 5658624.172210693
benchmark.5.worst: 5799935.817718506
check: pass

Bandwidth numbers for testing for vectorsum_py were calculated this way:

if field.name == "mean":
  bandwidth = 1e9*((test.args["size"]*torch.float64.itemsize)/(1024.0*1024.0*1024.0))/getattr(result, field.name)
  logger.log(f"benchmark.{idx}.bandwidth (GBPS)", bandwidth)

@msaroufim msaroufim self-requested a review September 11, 2025 17:00
@msaroufim msaroufim merged commit 2963e52 into gpu-mode:main Sep 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants