Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a more comprehensive kokkos_{malloc, free} perf_test #6377

Merged
merged 2 commits into from
Aug 29, 2023

Conversation

cwpearson
Copy link
Contributor

  • Adds this new perf test with a few modes
  • Removes the now-redundant benchmark from ViewAlloc

Command and sample output

$ ./Kokkos_PerformanceTest_Benchmark --benchmark_filter="Malloc"
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 8.0 on device with compute capability 8.6 , this will likely reduce potential performance.
2023-08-23T19:17:59+00:00
Running ./Kokkos_PerformanceTest_Benchmark
Run on (56 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x28)
  L1 Instruction 32 KiB (x28)
  L2 Unified 2048 KiB (x28)
  L3 Unified 76800 KiB (x1)
Load Average: 0.20, 2.89, 10.79
CPU architecture: none
Default Device: N6Kokkos4CudaE
GIT_BRANCH: perf_test/malloc-free
GIT_CLEAN_STATUS: DIRTY
GIT_COMMIT_DATE: %cI
GIT_COMMIT_DESCRIPTION: Remove duplicate ViewAllocate_Raw (present in PerfTest_MallocFree)
GIT_COMMIT_HASH: 389c560
GPU architecture: AMPERE80
KOKKOS_COMPILER_GNU: 1220
KOKKOS_COMPILER_NVCC: 1220
KOKKOS_ENABLE_ASM: yes
KOKKOS_ENABLE_CUDA: yes
KOKKOS_ENABLE_CUDA_LAMBDA: yes
KOKKOS_ENABLE_CUDA_LDG_INTRINSIC: yes
KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: no
KOKKOS_ENABLE_CUDA_UVM: no
KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA: yes
KOKKOS_ENABLE_CXX17: yes
KOKKOS_ENABLE_CXX20: no
KOKKOS_ENABLE_CXX23: no
KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK: no
KOKKOS_ENABLE_HBWSPACE: no
KOKKOS_ENABLE_HWLOC: no
KOKKOS_ENABLE_INTEL_MM_ALLOC: no
KOKKOS_ENABLE_LIBDL: yes
KOKKOS_ENABLE_LIBRT: no
KOKKOS_ENABLE_PRAGMA_IVDEP: no
KOKKOS_ENABLE_PRAGMA_LOOPCOUNT: no
KOKKOS_ENABLE_PRAGMA_UNROLL: no
KOKKOS_ENABLE_PRAGMA_VECTOR: no
KOKKOS_ENABLE_SERIAL: yes
Kokkos: Cuda[ 0 ] NVIDIA RTX A2000 12GB capability 8.6, Total Global Memory: 11.75 G, Shared Memory per Block: 48 K : Selected
Kokkos Version: 4.1.99
macro  KOKKOS_ENABLE_CUDA: defined
platform: 64bit
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
----------------------------------------------------------------------------------------------
Benchmark                                    Time             CPU   Iterations UserCounters...
----------------------------------------------------------------------------------------------
Malloc/LOG2:0/manual_time                0.007 ms        0.015 ms        96958 FOM: GB/s=0/s MB=0
Malloc/LOG2:4/manual_time                0.005 ms        0.012 ms       123028 FOM: GB/s=0/s MB=0
Malloc/LOG2:8/manual_time                0.006 ms        0.012 ms       127470 FOM: GB/s=0/s MB=0
Malloc/LOG2:12/manual_time               0.005 ms        0.012 ms       127270 FOM: GB/s=0/s MB=0
Malloc/LOG2:16/manual_time               0.005 ms        0.012 ms       128348 FOM: GB/s=0/s MB=0
Malloc/LOG2:20/manual_time               0.006 ms        0.012 ms       128340 FOM: GB/s=180.604/s MB=1
Malloc/LOG2:24/manual_time               0.430 ms        0.469 ms         1518 FOM: GB/s=37.2177/s MB=16
Malloc/LOG2:28/manual_time                3.67 ms         4.35 ms          353 FOM: GB/s=72.9692/s MB=268
Malloc/LOG2:32/manual_time                99.3 ms         8.14 ms            8 FOM: GB/s=43.2517/s MB=4.294k
MallocFree/LOG2:0/manual_time            0.013 ms        0.013 ms        48081 FOM: GB/s=0/s MB=0
MallocFree/LOG2:4/manual_time            0.013 ms        0.013 ms        63699 FOM: GB/s=0/s MB=0
MallocFree/LOG2:8/manual_time            0.015 ms        0.015 ms        48123 FOM: GB/s=0/s MB=0
MallocFree/LOG2:12/manual_time           0.012 ms        0.012 ms        58934 FOM: GB/s=0/s MB=0
MallocFree/LOG2:16/manual_time           0.012 ms        0.012 ms        59990 FOM: GB/s=0/s MB=0
MallocFree/LOG2:20/manual_time           0.012 ms        0.012 ms        60445 FOM: GB/s=84.8584/s MB=1
MallocFree/LOG2:24/manual_time           0.469 ms        0.469 ms         1371 FOM: GB/s=34.1096/s MB=16
MallocFree/LOG2:28/manual_time            5.78 ms         4.27 ms          166 FOM: GB/s=46.3931/s MB=268
MallocFree/LOG2:32/manual_time             100 ms         7.96 ms            8 FOM: GB/s=42.7488/s MB=4.294k
MallocTouch/LOG2:0/manual_time           0.006 ms        0.012 ms       103922 FOM: GB/s=0/s MB=0
MallocTouch/LOG2:4/manual_time           0.006 ms        0.012 ms       121728 FOM: GB/s=0/s MB=0
MallocTouch/LOG2:8/manual_time           0.006 ms        0.013 ms       111126 FOM: GB/s=0/s MB=0
MallocTouch/LOG2:12/manual_time          0.012 ms        0.018 ms        60395 FOM: GB/s=0/s MB=0
MallocTouch/LOG2:16/manual_time          0.012 ms        0.018 ms        59416 FOM: GB/s=0/s MB=0
MallocTouch/LOG2:20/manual_time          0.012 ms        0.018 ms        58867 FOM: GB/s=83.8676/s MB=1
MallocTouch/LOG2:24/manual_time          0.441 ms        0.481 ms         1493 FOM: GB/s=36.2758/s MB=16
MallocTouch/LOG2:28/manual_time           3.79 ms         4.36 ms          335 FOM: GB/s=70.6594/s MB=268
MallocTouch/LOG2:32/manual_time           87.8 ms         8.97 ms            7 FOM: GB/s=48.8848/s MB=4.294k
MallocTouchFree/LOG2:0/manual_time       0.015 ms        0.016 ms        45183 FOM: GB/s=0/s MB=0
MallocTouchFree/LOG2:4/manual_time       0.015 ms        0.015 ms        45311 FOM: GB/s=0/s MB=0
MallocTouchFree/LOG2:8/manual_time       0.014 ms        0.014 ms        45324 FOM: GB/s=0/s MB=0
MallocTouchFree/LOG2:12/manual_time      0.018 ms        0.018 ms        39143 FOM: GB/s=0/s MB=0
MallocTouchFree/LOG2:16/manual_time      0.018 ms        0.018 ms        38604 FOM: GB/s=0/s MB=0
MallocTouchFree/LOG2:20/manual_time      0.018 ms        0.018 ms        38227 FOM: GB/s=55.0325/s MB=1
MallocTouchFree/LOG2:24/manual_time      0.535 ms        0.535 ms          957 FOM: GB/s=29.8856/s MB=16
MallocTouchFree/LOG2:28/manual_time       5.73 ms         4.30 ms          166 FOM: GB/s=46.7523/s MB=268
MallocTouchFree/LOG2:32/manual_time        112 ms         9.09 ms            9 FOM: GB/s=38.4532/s MB=4.294k

@simongdg

@cwpearson cwpearson requested a review from crtrott August 23, 2023 19:22
core/perf_test/PerfTest_MallocFree.cpp Outdated Show resolved Hide resolved
core/perf_test/PerfTest_MallocFree.cpp Outdated Show resolved Hide resolved
core/perf_test/PerfTest_MallocFree.cpp Outdated Show resolved Hide resolved
core/perf_test/PerfTest_MallocFree.cpp Outdated Show resolved Hide resolved
core/perf_test/PerfTest_MallocFree.cpp Outdated Show resolved Hide resolved
core/perf_test/PerfTest_MallocFree.cpp Outdated Show resolved Hide resolved
Copy link
Member

@crtrott crtrott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally agree with Damien's comments. Also lets change the FOM to a simple rate. I.e. inverse of time per try.

@cwpearson
Copy link
Contributor Author

$ core/perf_test/Kokkos_PerformanceTest_Benchmark --benchmark_filter="Malloc"
Kokkos::Cuda::initialize WARNING: running kernels compiled for compute capability 8.0 on device with compute capability 8.6 , this will likely reduce potential performance.
2023-08-24T16:56:07+00:00
Running core/perf_test/Kokkos_PerformanceTest_Benchmark
Run on (56 X 4600 MHz CPU s)
CPU Caches:
  L1 Data 48 KiB (x28)
  L1 Instruction 32 KiB (x28)
  L2 Unified 2048 KiB (x28)
  L3 Unified 76800 KiB (x1)
Load Average: 0.42, 0.41, 0.45
CPU architecture: none
Default Device: N6Kokkos4CudaE
GIT_BRANCH: perf_test/malloc-free
GIT_CLEAN_STATUS: DIRTY
GIT_COMMIT_DATE: %cI
GIT_COMMIT_DESCRIPTION: let benchmark library handle argument scaling
GIT_COMMIT_HASH: 553e008
GPU architecture: AMPERE80
KOKKOS_COMPILER_GNU: 1220
KOKKOS_COMPILER_NVCC: 1220
KOKKOS_ENABLE_ASM: yes
KOKKOS_ENABLE_CUDA: yes
KOKKOS_ENABLE_CUDA_LAMBDA: yes
KOKKOS_ENABLE_CUDA_LDG_INTRINSIC: yes
KOKKOS_ENABLE_CUDA_RELOCATABLE_DEVICE_CODE: no
KOKKOS_ENABLE_CUDA_UVM: no
KOKKOS_ENABLE_CXX11_DISPATCH_LAMBDA: yes
KOKKOS_ENABLE_CXX17: yes
KOKKOS_ENABLE_CXX20: no
KOKKOS_ENABLE_CXX23: no
KOKKOS_ENABLE_DEBUG_BOUNDS_CHECK: no
KOKKOS_ENABLE_HBWSPACE: no
KOKKOS_ENABLE_HWLOC: no
KOKKOS_ENABLE_INTEL_MM_ALLOC: no
KOKKOS_ENABLE_LIBDL: yes
KOKKOS_ENABLE_LIBRT: no
KOKKOS_ENABLE_PRAGMA_IVDEP: no
KOKKOS_ENABLE_PRAGMA_LOOPCOUNT: no
KOKKOS_ENABLE_PRAGMA_UNROLL: no
KOKKOS_ENABLE_PRAGMA_VECTOR: no
KOKKOS_ENABLE_SERIAL: yes
Kokkos: Cuda[ 0 ] NVIDIA RTX A2000 12GB capability 8.6, Total Global Memory: 11.75 G, Shared Memory per Block: 48 K : Selected
Kokkos Version: 4.1.99
macro  KOKKOS_ENABLE_CUDA: defined
platform: 64bit
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
---------------------------------------------------------------------------------------------------
Benchmark                                         Time             CPU   Iterations UserCounters...
---------------------------------------------------------------------------------------------------
Malloc/N:1/manual_time                         6.84 us         14.9 us        97480 FOM: rate=146.219k/s
Malloc/N:16/manual_time                        5.60 us         12.0 us       114327 FOM: rate=178.594k/s
Malloc/N:256/manual_time                       5.48 us         11.7 us       127607 FOM: rate=182.346k/s
Malloc/N:4096/manual_time                      5.47 us         11.7 us       128791 FOM: rate=182.662k/s
Malloc/N:65536/manual_time                     5.48 us         11.7 us       126448 FOM: rate=182.637k/s
Malloc/N:1048576/manual_time                   5.48 us         11.7 us       123283 FOM: rate=182.327k/s
Malloc/N:16777216/manual_time                   426 us          465 us         1502 FOM: rate=2.34783k/s
Malloc/N:268435456/manual_time                 4831 us         2285 us          269 FOM: rate=206.977/s
Malloc/N:4294967296/manual_time              108286 us         8340 us            9 FOM: rate=9.23482/s
MallocFree/N:1/manual_time                     14.3 us         14.4 us        48526 FOM: rate=69.7024k/s
MallocFree/N:16/manual_time                    14.3 us         14.3 us        48883 FOM: rate=69.8233k/s
MallocFree/N:256/manual_time                   12.8 us         12.8 us        48437 FOM: rate=78.2881k/s
MallocFree/N:4096/manual_time                  12.5 us         12.5 us        56683 FOM: rate=80.183k/s
MallocFree/N:65536/manual_time                 11.9 us         11.9 us        53448 FOM: rate=83.9025k/s
MallocFree/N:1048576/manual_time               11.6 us         11.7 us        60146 FOM: rate=85.9233k/s
MallocFree/N:16777216/manual_time               524 us          523 us          982 FOM: rate=1.90995k/s
MallocFree/N:268435456/manual_time             6075 us         2339 us          184 FOM: rate=164.622/s
MallocFree/N:4294967296/manual_time          100821 us         7628 us            8 FOM: rate=9.91854/s
MallocTouch/N:1/manual_time                    7.33 us         15.4 us        94995 FOM: rate=136.428k/s
MallocTouch/N:16/manual_time                   6.69 us         13.6 us        96005 FOM: rate=149.401k/s
MallocTouch/N:256/manual_time                  6.26 us         12.5 us       112387 FOM: rate=159.704k/s
MallocTouch/N:4096/manual_time                 11.6 us         17.9 us        60643 FOM: rate=86.2988k/s
MallocTouch/N:65536/manual_time                11.7 us         18.0 us        59855 FOM: rate=85.266k/s
MallocTouch/N:1048576/manual_time              11.9 us         18.2 us        58581 FOM: rate=84.3525k/s
MallocTouch/N:16777216/manual_time              441 us          481 us         1508 FOM: rate=2.26732k/s
MallocTouch/N:268435456/manual_time            4968 us         2487 us          255 FOM: rate=201.28/s
MallocTouch/N:4294967296/manual_time         112439 us         9245 us            9 FOM: rate=8.89369/s
MallocTouchFree/N:1/manual_time                11.5 us         11.5 us        50779 FOM: rate=86.8893k/s
MallocTouchFree/N:16/manual_time               11.3 us         11.3 us        62298 FOM: rate=88.3836k/s
MallocTouchFree/N:256/manual_time              12.2 us         12.2 us        62236 FOM: rate=81.9834k/s
MallocTouchFree/N:4096/manual_time             17.9 us         17.9 us        39333 FOM: rate=55.9081k/s
MallocTouchFree/N:65536/manual_time            18.1 us         18.1 us        38843 FOM: rate=55.3629k/s
MallocTouchFree/N:1048576/manual_time          18.2 us         18.2 us        38457 FOM: rate=55.037k/s
MallocTouchFree/N:16777216/manual_time          481 us          481 us         1356 FOM: rate=2.07821k/s
MallocTouchFree/N:268435456/manual_time        6040 us         2405 us          177 FOM: rate=165.568/s
MallocTouchFree/N:4294967296/manual_time     103759 us         9323 us            8 FOM: rate=9.63773/s

Copy link
Member

@dalg24 dalg24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good after you move the stride definition next to the parallel_for as Daniel suggested.

@cwpearson
Copy link
Contributor Author

I made the change and rebased to two commits

@dalg24
Copy link
Member

dalg24 commented Aug 29, 2023

Jenkins failure are unrelated (toomanyrequests: You have reached your pull rate limit)

@dalg24 dalg24 merged commit 43d3e53 into kokkos:develop Aug 29, 2023
27 of 28 checks passed
@cwpearson cwpearson mentioned this pull request Aug 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants