Julia GPU Perf

Performance benchs for Julia GPU using the CUDA.jl and AMDGPU.jl software stack.

Benchmarks

Effetive memory throughput T_tot measured in GB/s for:

the triad 2D kernel

A[ix,iy] = B[ix,iy] + s*C[ix,iy]

the triad 2D kernel with power (Int)

A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_int

the triad 2D kernel with power (Float)

A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_float

the diffusion 2D kernel

T2[ix,iy] = T[ix,iy] + dt*(Ci[ix,iy]*(
            - ((-lam*(T[ix+1,iy] - T[ix,iy])*_dx) - (-lam*(T[ix,iy] - T[ix-1,iy])*_dx))*_dx
            - ((-lam*(T[ix,iy+1] - T[ix,iy])*_dy) - (-lam*(T[ix,iy] - T[ix,iy-1])*_dy))*_dy ))

Packages

(JuliaGPUPerf) pkg> st
      Status
  [21141c5a] AMDGPU v0.2.17 `https://github.com/JuliaGPU/AMDGPU.jl.git#jps/julia-1.7`
  [6e4b80f9] BenchmarkTools v1.2.2
  [052768ef] CUDA v3.8.0

Tests

Hardware:

Nvidia A100 SXM4 40GB
Nvidia V100 SXM2 32GB
Nvidia Titan Xm PCIe3.0 12GB
AMD Vega 20 gfx906 - Ault
AMD Vega 20 gfx906 - Satori

Running the codes as julia --project -O3 --check-bounds=no [amd/cuda]_bench.jl

Nvidia A100 SXM4 40GB

Reported for single precision Float32:

nx, ny, DAT = 65536, 32768, Float32
T_tot triad2D           = 1301.714 GB/s
T_tot triad2D pow_int   = 1287.426 GB/s
T_tot triad2D pow_float = 874.5707 GB/s
T_tot diffusion 2D      = 1293.076 GB/s

And for double precision Float64:

nx, ny, DAT = 32768, 32768, Float64
T_tot triad2D           = 1358.021 GB/s
T_tot triad2D pow_int   = 1356.478 GB/s
T_tot triad2D pow_float = 1020.444 GB/s
T_tot diffusion 2D      = 1362.546 GB/s

Single precision execution performs at ~95-96% of double precision, with exception for the Float power performing at ~86%.

Hardware: running on an Nvidia A100 SXM4:

julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.82.1, for CUDA 11.4
CUDA driver 11.6

Toolchain:
- Julia: 1.7.0
- LLVM: 12.0.1

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 15.127 GiB / 39.586 GiB available)

CUDA C for comparison

Using the cuda_c_bench.cu script. Results reported for single precision Float32:

65536x32768, 8.000 GB, 100 iterations.
launching (2048x4096) grid of (32x8) blocks.
Performance triad2D:           1.754 seconds, 1231.145 GB/s
Performance triad2D_pow_int:   1.978 seconds, 1091.983 GB/s (using fpow)
Performance triad2D_pow_float: 1.974 seconds, 1094.501 GB/s (using fpow)
Performance diff2D_step:       1.772 seconds, 1219.103 GB/s

And for double precision Float64:

32768x32768, 8.000 GB, 100 iterations.
launching (1024x4096) grid of (32x8) blocks.
Performance triad2D:           1.691 seconds, 1277.339 GB/s
Performance triad2D_pow_int:   2.027 seconds, 1065.666 GB/s (using pow)
Performance triad2D_pow_float: 2.058 seconds, 1049.685 GB/s (using pow)
Performance diff2D_step:       1.686 seconds, 1280.941 GB/s

Nvidia V100 SXM2 32GB

Reported for single precision Float32:

nx, ny, DAT = 65536, 32768, Float32
T_tot triad2D           = 718.3771 GB/s
T_tot triad2D pow_int   = 688.5375 GB/s
T_tot triad2D pow_float = 548.9982 GB/s
T_tot diffusion 2D      = 641.0068 GB/s

And for double precision Float64:

nx, ny, DAT = 32768, 32768, Float64
T_tot triad2D           = 803.2403 GB/s
T_tot triad2D pow_int   = 789.4891 GB/s
T_tot triad2D pow_float = 775.9522 GB/s
T_tot diffusion 2D      = 736.5969 GB/s

Single precision execution performs at 87-89% of double precision, with exception for the Float power performing at 70%.

Hardware: running on an Nvidia V100 SXM2:

julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.42.1, for CUDA 11.4
CUDA driver 11.4

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 7.408 GiB / 31.749 GiB available)

Nvidia Titan Xm PCIe3.0 12GB

Reported for single precision Float32:

nx, ny, DAT = 32768, 16384, Float32
T_tot triad2D           = 248.12 GB/s
T_tot triad2D pow_int   = 241.5461 GB/s
T_tot triad2D pow_float = 171.4937 GB/s
T_tot diffusion 2D      = 226.185 GB/s

And for double precision Float64:

nx, ny, DAT = 16384, 16384, Float64
T_tot triad2D           = 250.0162 GB/s
T_tot triad2D pow_int   = 251.8233 GB/s
T_tot triad2D pow_float = 31.56365 GB/s
T_tot diffusion 2D      = 163.9904 GB/s

Single precision execution outperforms double precision, especially for the Float power.

Hardware: running on an Nvidia Titan Xm:

CUDA toolkit 11.4, local installation
NVIDIA driver 470.42.1, for CUDA 11.4
CUDA driver 11.4

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: NVIDIA GeForce GTX TITAN X (sm_52, 5.819 GiB / 11.927 GiB available)

AMD Vega 20 gfx906 - Ault

Reported for single precision Float32:

nx, ny, DAT = 49152, 24576, Float32
T_tot triad2D           = 577.557 GB/s
T_tot triad2D pow_int   = 242.3805 GB/s
T_tot triad2D pow_float = 240.4102 GB/s
T_tot diffusion 2D      = 504.6102 GB/s

And for double precision Float64:

nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 728.9446 GB/s
T_tot triad2D pow_int   = 721.0397 GB/s
T_tot triad2D pow_float = 275.0624 GB/s
T_tot diffusion 2D      = 648.548 GB/s

Single precision execution performs at 77-79% of double precision.

Hardware: running on an AMD Vega 20:

julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Version: 1.1.0
- Initialized: true
ld.lld (ready)
- Path: /apps/ault/spack/opt/spack/linux-centos8-zen/gcc-8.4.1/llvm-amdgpu-4.2.0-rsmtqpi3nz4w2vj5qnvrghl5uyip5iy4/bin/ld.lld
ROCm-Device-Libs (ready)
- Downloaded: true
HIP Runtime (ready)
rocBLAS (MISSING)
rocSOLVER (MISSING)
rocFFT (MISSING)
rocRAND (MISSING)
rocSPARSE (MISSING)
rocALUTION (MISSING)
MIOpen (MISSING)
HSA Agents (2):
- CPU: AMD EPYC 7742 64-Core Processor
- GPU: Vega 20 WKS GL-XE [Radeon Pro VII] (gfx906)

AMD Vega 20 gfx906 - Satori

Reported for single precision Float32:

nx, ny, DAT = 49152, 24576, Float32
T_tot triad2D           = 701.3548 GB/s
T_tot triad2D pow_int   = 244.6597 GB/s
T_tot triad2D pow_float = 242.5716 GB/s
T_tot diffusion 2D      = 559.6188 GB/s

And for double precision Float64:

nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 772.3414 GB/s
T_tot triad2D pow_int   = 760.2888 GB/s
T_tot triad2D pow_float = 278.3227 GB/s
T_tot diffusion 2D      = 722.7216 GB/s

Single precision execution performs at 77-90% of double precision.

Hardware: running on an AMD Vega 20:

HSA Runtime (ready)
- Version: 1.1.0
- Initialized: true
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Downloaded: true
HIP Runtime (ready)
rocBLAS (ready)
rocSOLVER (ready)
rocFFT (ready)
rocRAND (ready)
rocSPARSE (ready)
rocALUTION (ready)
MIOpen (ready)
HSA Agents (2):
- GPU: Vega 20 (gfx906)
- CPU: AMD EPYC 7642 48-Core Processor

AMD Vega 20 gfx906 - Goethe-HLR

Results reported for node gpu36-002 (8 GPU devices) and for double precision Float64

Device: 1
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 704.6837 GB/s
T_tot triad2D pow_int   = 702.3983 GB/s
T_tot triad2D pow_float = 276.64 GB/s
T_tot diffusion 2D      = 639.9492 GB/s
Device: 2
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 708.6794 GB/s
T_tot triad2D pow_int   = 703.3594 GB/s
T_tot triad2D pow_float = 276.4491 GB/s
T_tot diffusion 2D      = 637.9515 GB/s
Device: 3
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 698.3451 GB/s
T_tot triad2D pow_int   = 689.2274 GB/s
T_tot triad2D pow_float = 275.7074 GB/s
T_tot diffusion 2D      = 628.2715 GB/s
Device: 4
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 694.4658 GB/s
T_tot triad2D pow_int   = 685.5483 GB/s
T_tot triad2D pow_float = 271.552 GB/s
T_tot diffusion 2D      = 627.0023 GB/s
Device: 5
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 708.423 GB/s
T_tot triad2D pow_int   = 696.6548 GB/s
T_tot triad2D pow_float = 273.7445 GB/s
T_tot diffusion 2D      = 636.6477 GB/s
Device: 6
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 687.6798 GB/s
T_tot triad2D pow_int   = 681.9737 GB/s
T_tot triad2D pow_float = 273.1908 GB/s
T_tot diffusion 2D      = 622.9047 GB/s
Device: 7
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 692.7241 GB/s
T_tot triad2D pow_int   = 682.6617 GB/s
T_tot triad2D pow_float = 274.2336 GB/s
T_tot diffusion 2D      = 622.4565 GB/s
Device: 8
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 695.2322 GB/s
T_tot triad2D pow_int   = 687.7494 GB/s
T_tot triad2D pow_float = 272.0593 GB/s
T_tot diffusion 2D      = 626.1292 GB/s

Details:

julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Path: /opt/rocm/hsa/lib/libhsa-runtime64.so
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
- Downloaded: true
HIP Runtime (ready)
- Path: /opt/rocm/lib/libamdhip64.so
rocBLAS (ready)
- Path: /opt/rocm/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm/lib/librocsolver.so
rocALUTION (MISSING)
rocSPARSE (ready)
- Path: /opt/rocm/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm/lib/libMIOpen.so
HSA Agents (10):
- GPU-938470a172da5ee3 [Vega 20 (gfx906)]
- GPU-2d1060a172da5f17 [Vega 20 (gfx906)]
- GPU-d918788172da5f19 [Vega 20 (gfx906)]
- GPU-eaae68e172da5eb6 [Vega 20 (gfx906)]
- CPU-XX [AMD EPYC 7452 32-Core Processor]
- CPU-XX [AMD EPYC 7452 32-Core Processor]
- GPU-b28a58c172da5ee1 [Vega 20 (gfx906)]
- GPU-218c606172fd62db [Vega 20 (gfx906)]
- GPU-2182408172fd62db [Vega 20 (gfx906)]
- GPU-a5ac38e172dc768b [Vega 20 (gfx906)]

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Project.toml		Project.toml
README.md		README.md
amd_bench.jl		amd_bench.jl
amd_bench2.jl		amd_bench2.jl
cuda_bench.jl		cuda_bench.jl
cuda_c_bench.cu		cuda_c_bench.cu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Julia GPU Perf

Benchmarks

Packages

Tests

Nvidia A100 SXM4 40GB

CUDA C for comparison

Nvidia V100 SXM2 32GB

Nvidia Titan Xm PCIe3.0 12GB

AMD Vega 20 gfx906 - Ault

AMD Vega 20 gfx906 - Satori

AMD Vega 20 gfx906 - Goethe-HLR

About

Releases

Packages

Contributors 2

Languages

License

luraess/JuliaGPUPerf

Folders and files

Latest commit

History

Repository files navigation

Julia GPU Perf

Benchmarks

Packages

Tests

Nvidia A100 SXM4 40GB

CUDA C for comparison

Nvidia V100 SXM2 32GB

Nvidia Titan Xm PCIe3.0 12GB

AMD Vega 20 gfx906 - Ault

AMD Vega 20 gfx906 - Satori

AMD Vega 20 gfx906 - Goethe-HLR

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages