Performance benchs for Julia GPU using the CUDA.jl and AMDGPU.jl software stack.
Effetive memory throughput T_tot
measured in GB/s for:
- the triad 2D kernel
A[ix,iy] = B[ix,iy] + s*C[ix,iy]
- the triad 2D kernel with power (
Int
)
A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_int
- the triad 2D kernel with power (
Float
)
A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_float
- the diffusion 2D kernel
T2[ix,iy] = T[ix,iy] + dt*(Ci[ix,iy]*(
- ((-lam*(T[ix+1,iy] - T[ix,iy])*_dx) - (-lam*(T[ix,iy] - T[ix-1,iy])*_dx))*_dx
- ((-lam*(T[ix,iy+1] - T[ix,iy])*_dy) - (-lam*(T[ix,iy] - T[ix,iy-1])*_dy))*_dy ))
(JuliaGPUPerf) pkg> st
Status
[21141c5a] AMDGPU v0.2.17 `https://github.com/JuliaGPU/AMDGPU.jl.git#jps/julia-1.7`
[6e4b80f9] BenchmarkTools v1.2.2
[052768ef] CUDA v3.8.0
Hardware:
- Nvidia A100 SXM4 40GB
- Nvidia V100 SXM2 32GB
- Nvidia Titan Xm PCIe3.0 12GB
- AMD Vega 20 gfx906 - Ault
- AMD Vega 20 gfx906 - Satori
Running the codes as julia --project -O3 --check-bounds=no [amd/cuda]_bench.jl
Reported for single precision Float32
:
nx, ny, DAT = 65536, 32768, Float32
T_tot triad2D = 1301.714 GB/s
T_tot triad2D pow_int = 1287.426 GB/s
T_tot triad2D pow_float = 874.5707 GB/s
T_tot diffusion 2D = 1293.076 GB/s
And for double precision Float64
:
nx, ny, DAT = 32768, 32768, Float64
T_tot triad2D = 1358.021 GB/s
T_tot triad2D pow_int = 1356.478 GB/s
T_tot triad2D pow_float = 1020.444 GB/s
T_tot diffusion 2D = 1362.546 GB/s
Single precision execution performs at ~95-96% of double precision, with exception for the Float
power performing at ~86%.
- Hardware: running on an Nvidia A100 SXM4:
julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.82.1, for CUDA 11.4
CUDA driver 11.6
Toolchain:
- Julia: 1.7.0
- LLVM: 12.0.1
Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false
8 devices:
0: NVIDIA A100-SXM4-40GB (sm_80, 15.127 GiB / 39.586 GiB available)
Using the cuda_c_bench.cu script. Results reported for single precision Float32
:
65536x32768, 8.000 GB, 100 iterations.
launching (2048x4096) grid of (32x8) blocks.
Performance triad2D: 1.754 seconds, 1231.145 GB/s
Performance triad2D_pow_int: 1.978 seconds, 1091.983 GB/s (using fpow)
Performance triad2D_pow_float: 1.974 seconds, 1094.501 GB/s (using fpow)
Performance diff2D_step: 1.772 seconds, 1219.103 GB/s
And for double precision Float64
:
32768x32768, 8.000 GB, 100 iterations.
launching (1024x4096) grid of (32x8) blocks.
Performance triad2D: 1.691 seconds, 1277.339 GB/s
Performance triad2D_pow_int: 2.027 seconds, 1065.666 GB/s (using pow)
Performance triad2D_pow_float: 2.058 seconds, 1049.685 GB/s (using pow)
Performance diff2D_step: 1.686 seconds, 1280.941 GB/s
Reported for single precision Float32
:
nx, ny, DAT = 65536, 32768, Float32
T_tot triad2D = 718.3771 GB/s
T_tot triad2D pow_int = 688.5375 GB/s
T_tot triad2D pow_float = 548.9982 GB/s
T_tot diffusion 2D = 641.0068 GB/s
And for double precision Float64
:
nx, ny, DAT = 32768, 32768, Float64
T_tot triad2D = 803.2403 GB/s
T_tot triad2D pow_int = 789.4891 GB/s
T_tot triad2D pow_float = 775.9522 GB/s
T_tot diffusion 2D = 736.5969 GB/s
Single precision execution performs at 87-89% of double precision, with exception for the Float
power performing at 70%.
- Hardware: running on an Nvidia V100 SXM2:
julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.42.1, for CUDA 11.4
CUDA driver 11.4
Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false
8 devices:
0: Tesla V100-SXM2-32GB (sm_70, 7.408 GiB / 31.749 GiB available)
Reported for single precision Float32
:
nx, ny, DAT = 32768, 16384, Float32
T_tot triad2D = 248.12 GB/s
T_tot triad2D pow_int = 241.5461 GB/s
T_tot triad2D pow_float = 171.4937 GB/s
T_tot diffusion 2D = 226.185 GB/s
And for double precision Float64
:
nx, ny, DAT = 16384, 16384, Float64
T_tot triad2D = 250.0162 GB/s
T_tot triad2D pow_int = 251.8233 GB/s
T_tot triad2D pow_float = 31.56365 GB/s
T_tot diffusion 2D = 163.9904 GB/s
Single precision execution outperforms double precision, especially for the Float
power.
- Hardware: running on an Nvidia Titan Xm:
CUDA toolkit 11.4, local installation
NVIDIA driver 470.42.1, for CUDA 11.4
CUDA driver 11.4
Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false
8 devices:
0: NVIDIA GeForce GTX TITAN X (sm_52, 5.819 GiB / 11.927 GiB available)
Reported for single precision Float32
:
nx, ny, DAT = 49152, 24576, Float32
T_tot triad2D = 577.557 GB/s
T_tot triad2D pow_int = 242.3805 GB/s
T_tot triad2D pow_float = 240.4102 GB/s
T_tot diffusion 2D = 504.6102 GB/s
And for double precision Float64
:
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 728.9446 GB/s
T_tot triad2D pow_int = 721.0397 GB/s
T_tot triad2D pow_float = 275.0624 GB/s
T_tot diffusion 2D = 648.548 GB/s
Single precision execution performs at 77-79% of double precision.
- Hardware: running on an AMD Vega 20:
julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Version: 1.1.0
- Initialized: true
ld.lld (ready)
- Path: /apps/ault/spack/opt/spack/linux-centos8-zen/gcc-8.4.1/llvm-amdgpu-4.2.0-rsmtqpi3nz4w2vj5qnvrghl5uyip5iy4/bin/ld.lld
ROCm-Device-Libs (ready)
- Downloaded: true
HIP Runtime (ready)
rocBLAS (MISSING)
rocSOLVER (MISSING)
rocFFT (MISSING)
rocRAND (MISSING)
rocSPARSE (MISSING)
rocALUTION (MISSING)
MIOpen (MISSING)
HSA Agents (2):
- CPU: AMD EPYC 7742 64-Core Processor
- GPU: Vega 20 WKS GL-XE [Radeon Pro VII] (gfx906)
Reported for single precision Float32
:
nx, ny, DAT = 49152, 24576, Float32
T_tot triad2D = 701.3548 GB/s
T_tot triad2D pow_int = 244.6597 GB/s
T_tot triad2D pow_float = 242.5716 GB/s
T_tot diffusion 2D = 559.6188 GB/s
And for double precision Float64
:
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 772.3414 GB/s
T_tot triad2D pow_int = 760.2888 GB/s
T_tot triad2D pow_float = 278.3227 GB/s
T_tot diffusion 2D = 722.7216 GB/s
Single precision execution performs at 77-90% of double precision.
- Hardware: running on an AMD Vega 20:
HSA Runtime (ready)
- Version: 1.1.0
- Initialized: true
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Downloaded: true
HIP Runtime (ready)
rocBLAS (ready)
rocSOLVER (ready)
rocFFT (ready)
rocRAND (ready)
rocSPARSE (ready)
rocALUTION (ready)
MIOpen (ready)
HSA Agents (2):
- GPU: Vega 20 (gfx906)
- CPU: AMD EPYC 7642 48-Core Processor
Results reported for node gpu36-002 (8 GPU devices) and for double precision Float64
Device: 1
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 704.6837 GB/s
T_tot triad2D pow_int = 702.3983 GB/s
T_tot triad2D pow_float = 276.64 GB/s
T_tot diffusion 2D = 639.9492 GB/s
Device: 2
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 708.6794 GB/s
T_tot triad2D pow_int = 703.3594 GB/s
T_tot triad2D pow_float = 276.4491 GB/s
T_tot diffusion 2D = 637.9515 GB/s
Device: 3
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 698.3451 GB/s
T_tot triad2D pow_int = 689.2274 GB/s
T_tot triad2D pow_float = 275.7074 GB/s
T_tot diffusion 2D = 628.2715 GB/s
Device: 4
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 694.4658 GB/s
T_tot triad2D pow_int = 685.5483 GB/s
T_tot triad2D pow_float = 271.552 GB/s
T_tot diffusion 2D = 627.0023 GB/s
Device: 5
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 708.423 GB/s
T_tot triad2D pow_int = 696.6548 GB/s
T_tot triad2D pow_float = 273.7445 GB/s
T_tot diffusion 2D = 636.6477 GB/s
Device: 6
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 687.6798 GB/s
T_tot triad2D pow_int = 681.9737 GB/s
T_tot triad2D pow_float = 273.1908 GB/s
T_tot diffusion 2D = 622.9047 GB/s
Device: 7
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 692.7241 GB/s
T_tot triad2D pow_int = 682.6617 GB/s
T_tot triad2D pow_float = 274.2336 GB/s
T_tot diffusion 2D = 622.4565 GB/s
Device: 8
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D = 695.2322 GB/s
T_tot triad2D pow_int = 687.7494 GB/s
T_tot triad2D pow_float = 272.0593 GB/s
T_tot diffusion 2D = 626.1292 GB/s
Details:
julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Path: /opt/rocm/hsa/lib/libhsa-runtime64.so
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
- Downloaded: true
HIP Runtime (ready)
- Path: /opt/rocm/lib/libamdhip64.so
rocBLAS (ready)
- Path: /opt/rocm/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm/lib/librocsolver.so
rocALUTION (MISSING)
rocSPARSE (ready)
- Path: /opt/rocm/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm/lib/libMIOpen.so
HSA Agents (10):
- GPU-938470a172da5ee3 [Vega 20 (gfx906)]
- GPU-2d1060a172da5f17 [Vega 20 (gfx906)]
- GPU-d918788172da5f19 [Vega 20 (gfx906)]
- GPU-eaae68e172da5eb6 [Vega 20 (gfx906)]
- CPU-XX [AMD EPYC 7452 32-Core Processor]
- CPU-XX [AMD EPYC 7452 32-Core Processor]
- GPU-b28a58c172da5ee1 [Vega 20 (gfx906)]
- GPU-218c606172fd62db [Vega 20 (gfx906)]
- GPU-2182408172fd62db [Vega 20 (gfx906)]
- GPU-a5ac38e172dc768b [Vega 20 (gfx906)]