Skip to content

luraess/JuliaGPUPerf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Julia GPU Perf

Performance benchs for Julia GPU using the CUDA.jl and AMDGPU.jl software stack.

Benchmarks

Effetive memory throughput T_tot measured in GB/s for:

  1. the triad 2D kernel
A[ix,iy] = B[ix,iy] + s*C[ix,iy]
  1. the triad 2D kernel with power (Int)
A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_int
  1. the triad 2D kernel with power (Float)
A[ix,iy] = B[ix,iy] + s*C[ix,iy]^pow_float
  1. the diffusion 2D kernel
T2[ix,iy] = T[ix,iy] + dt*(Ci[ix,iy]*(
            - ((-lam*(T[ix+1,iy] - T[ix,iy])*_dx) - (-lam*(T[ix,iy] - T[ix-1,iy])*_dx))*_dx
            - ((-lam*(T[ix,iy+1] - T[ix,iy])*_dy) - (-lam*(T[ix,iy] - T[ix,iy-1])*_dy))*_dy ))

Packages

(JuliaGPUPerf) pkg> st
      Status
  [21141c5a] AMDGPU v0.2.17 `https://github.com/JuliaGPU/AMDGPU.jl.git#jps/julia-1.7`
  [6e4b80f9] BenchmarkTools v1.2.2
  [052768ef] CUDA v3.8.0

Tests

Hardware:

Running the codes as julia --project -O3 --check-bounds=no [amd/cuda]_bench.jl

Nvidia A100 SXM4 40GB

Reported for single precision Float32:

nx, ny, DAT = 65536, 32768, Float32
T_tot triad2D           = 1301.714 GB/s
T_tot triad2D pow_int   = 1287.426 GB/s
T_tot triad2D pow_float = 874.5707 GB/s
T_tot diffusion 2D      = 1293.076 GB/s

And for double precision Float64:

nx, ny, DAT = 32768, 32768, Float64
T_tot triad2D           = 1358.021 GB/s
T_tot triad2D pow_int   = 1356.478 GB/s
T_tot triad2D pow_float = 1020.444 GB/s
T_tot diffusion 2D      = 1362.546 GB/s

Single precision execution performs at ~95-96% of double precision, with exception for the Float power performing at ~86%.

  • Hardware: running on an Nvidia A100 SXM4:
julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.82.1, for CUDA 11.4
CUDA driver 11.6

Toolchain:
- Julia: 1.7.0
- LLVM: 12.0.1

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 15.127 GiB / 39.586 GiB available)

CUDA C for comparison

Using the cuda_c_bench.cu script. Results reported for single precision Float32:

65536x32768, 8.000 GB, 100 iterations.
launching (2048x4096) grid of (32x8) blocks.
Performance triad2D:           1.754 seconds, 1231.145 GB/s
Performance triad2D_pow_int:   1.978 seconds, 1091.983 GB/s (using fpow)
Performance triad2D_pow_float: 1.974 seconds, 1094.501 GB/s (using fpow)
Performance diff2D_step:       1.772 seconds, 1219.103 GB/s

And for double precision Float64:

32768x32768, 8.000 GB, 100 iterations.
launching (1024x4096) grid of (32x8) blocks.
Performance triad2D:           1.691 seconds, 1277.339 GB/s
Performance triad2D_pow_int:   2.027 seconds, 1065.666 GB/s (using pow)
Performance triad2D_pow_float: 2.058 seconds, 1049.685 GB/s (using pow)
Performance diff2D_step:       1.686 seconds, 1280.941 GB/s

Nvidia V100 SXM2 32GB

Reported for single precision Float32:

nx, ny, DAT = 65536, 32768, Float32
T_tot triad2D           = 718.3771 GB/s
T_tot triad2D pow_int   = 688.5375 GB/s
T_tot triad2D pow_float = 548.9982 GB/s
T_tot diffusion 2D      = 641.0068 GB/s

And for double precision Float64:

nx, ny, DAT = 32768, 32768, Float64
T_tot triad2D           = 803.2403 GB/s
T_tot triad2D pow_int   = 789.4891 GB/s
T_tot triad2D pow_float = 775.9522 GB/s
T_tot diffusion 2D      = 736.5969 GB/s

Single precision execution performs at 87-89% of double precision, with exception for the Float power performing at 70%.

  • Hardware: running on an Nvidia V100 SXM2:
julia> CUDA.versioninfo()
CUDA toolkit 11.4, local installation
NVIDIA driver 470.42.1, for CUDA 11.4
CUDA driver 11.4

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 7.408 GiB / 31.749 GiB available)

Nvidia Titan Xm PCIe3.0 12GB

Reported for single precision Float32:

nx, ny, DAT = 32768, 16384, Float32
T_tot triad2D           = 248.12 GB/s
T_tot triad2D pow_int   = 241.5461 GB/s
T_tot triad2D pow_float = 171.4937 GB/s
T_tot diffusion 2D      = 226.185 GB/s

And for double precision Float64:

nx, ny, DAT = 16384, 16384, Float64
T_tot triad2D           = 250.0162 GB/s
T_tot triad2D pow_int   = 251.8233 GB/s
T_tot triad2D pow_float = 31.56365 GB/s
T_tot diffusion 2D      = 163.9904 GB/s

Single precision execution outperforms double precision, especially for the Float power.

  • Hardware: running on an Nvidia Titan Xm:
CUDA toolkit 11.4, local installation
NVIDIA driver 470.42.1, for CUDA 11.4
CUDA driver 11.4

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: NVIDIA GeForce GTX TITAN X (sm_52, 5.819 GiB / 11.927 GiB available)

AMD Vega 20 gfx906 - Ault

Reported for single precision Float32:

nx, ny, DAT = 49152, 24576, Float32
T_tot triad2D           = 577.557 GB/s
T_tot triad2D pow_int   = 242.3805 GB/s
T_tot triad2D pow_float = 240.4102 GB/s
T_tot diffusion 2D      = 504.6102 GB/s

And for double precision Float64:

nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 728.9446 GB/s
T_tot triad2D pow_int   = 721.0397 GB/s
T_tot triad2D pow_float = 275.0624 GB/s
T_tot diffusion 2D      = 648.548 GB/s

Single precision execution performs at 77-79% of double precision.

  • Hardware: running on an AMD Vega 20:
julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Version: 1.1.0
- Initialized: true
ld.lld (ready)
- Path: /apps/ault/spack/opt/spack/linux-centos8-zen/gcc-8.4.1/llvm-amdgpu-4.2.0-rsmtqpi3nz4w2vj5qnvrghl5uyip5iy4/bin/ld.lld
ROCm-Device-Libs (ready)
- Downloaded: true
HIP Runtime (ready)
rocBLAS (MISSING)
rocSOLVER (MISSING)
rocFFT (MISSING)
rocRAND (MISSING)
rocSPARSE (MISSING)
rocALUTION (MISSING)
MIOpen (MISSING)
HSA Agents (2):
- CPU: AMD EPYC 7742 64-Core Processor
- GPU: Vega 20 WKS GL-XE [Radeon Pro VII] (gfx906)

AMD Vega 20 gfx906 - Satori

Reported for single precision Float32:

nx, ny, DAT = 49152, 24576, Float32
T_tot triad2D           = 701.3548 GB/s
T_tot triad2D pow_int   = 244.6597 GB/s
T_tot triad2D pow_float = 242.5716 GB/s
T_tot diffusion 2D      = 559.6188 GB/s

And for double precision Float64:

nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 772.3414 GB/s
T_tot triad2D pow_int   = 760.2888 GB/s
T_tot triad2D pow_float = 278.3227 GB/s
T_tot diffusion 2D      = 722.7216 GB/s

Single precision execution performs at 77-90% of double precision.

  • Hardware: running on an AMD Vega 20:
HSA Runtime (ready)
- Version: 1.1.0
- Initialized: true
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Downloaded: true
HIP Runtime (ready)
rocBLAS (ready)
rocSOLVER (ready)
rocFFT (ready)
rocRAND (ready)
rocSPARSE (ready)
rocALUTION (ready)
MIOpen (ready)
HSA Agents (2):
- GPU: Vega 20 (gfx906)
- CPU: AMD EPYC 7642 48-Core Processor

AMD Vega 20 gfx906 - Goethe-HLR

Results reported for node gpu36-002 (8 GPU devices) and for double precision Float64

Device: 1
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 704.6837 GB/s
T_tot triad2D pow_int   = 702.3983 GB/s
T_tot triad2D pow_float = 276.64 GB/s
T_tot diffusion 2D      = 639.9492 GB/s
Device: 2
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 708.6794 GB/s
T_tot triad2D pow_int   = 703.3594 GB/s
T_tot triad2D pow_float = 276.4491 GB/s
T_tot diffusion 2D      = 637.9515 GB/s
Device: 3
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 698.3451 GB/s
T_tot triad2D pow_int   = 689.2274 GB/s
T_tot triad2D pow_float = 275.7074 GB/s
T_tot diffusion 2D      = 628.2715 GB/s
Device: 4
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 694.4658 GB/s
T_tot triad2D pow_int   = 685.5483 GB/s
T_tot triad2D pow_float = 271.552 GB/s
T_tot diffusion 2D      = 627.0023 GB/s
Device: 5
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 708.423 GB/s
T_tot triad2D pow_int   = 696.6548 GB/s
T_tot triad2D pow_float = 273.7445 GB/s
T_tot diffusion 2D      = 636.6477 GB/s
Device: 6
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 687.6798 GB/s
T_tot triad2D pow_int   = 681.9737 GB/s
T_tot triad2D pow_float = 273.1908 GB/s
T_tot diffusion 2D      = 622.9047 GB/s
Device: 7
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 692.7241 GB/s
T_tot triad2D pow_int   = 682.6617 GB/s
T_tot triad2D pow_float = 274.2336 GB/s
T_tot diffusion 2D      = 622.4565 GB/s
Device: 8
nx, ny, DAT = 24576, 24576, Float64
T_tot triad2D           = 695.2322 GB/s
T_tot triad2D pow_int   = 687.7494 GB/s
T_tot triad2D pow_float = 272.0593 GB/s
T_tot diffusion 2D      = 626.1292 GB/s

Details:

julia> AMDGPU.versioninfo()
HSA Runtime (ready)
- Path: /opt/rocm/hsa/lib/libhsa-runtime64.so
- Version: 1.1.0
ld.lld (ready)
- Path: /opt/rocm/llvm/bin/ld.lld
ROCm-Device-Libs (ready)
- Path: /opt/rocm/amdgcn/bitcode
- Downloaded: true
HIP Runtime (ready)
- Path: /opt/rocm/lib/libamdhip64.so
rocBLAS (ready)
- Path: /opt/rocm/lib/librocblas.so
rocSOLVER (ready)
- Path: /opt/rocm/lib/librocsolver.so
rocALUTION (MISSING)
rocSPARSE (ready)
- Path: /opt/rocm/lib/librocsparse.so
rocRAND (ready)
- Path: /opt/rocm/lib/librocrand.so
rocFFT (ready)
- Path: /opt/rocm/lib/librocfft.so
MIOpen (ready)
- Path: /opt/rocm/lib/libMIOpen.so
HSA Agents (10):
- GPU-938470a172da5ee3 [Vega 20 (gfx906)]
- GPU-2d1060a172da5f17 [Vega 20 (gfx906)]
- GPU-d918788172da5f19 [Vega 20 (gfx906)]
- GPU-eaae68e172da5eb6 [Vega 20 (gfx906)]
- CPU-XX [AMD EPYC 7452 32-Core Processor]
- CPU-XX [AMD EPYC 7452 32-Core Processor]
- GPU-b28a58c172da5ee1 [Vega 20 (gfx906)]
- GPU-218c606172fd62db [Vega 20 (gfx906)]
- GPU-2182408172fd62db [Vega 20 (gfx906)]
- GPU-a5ac38e172dc768b [Vega 20 (gfx906)]

About

Performance benchs for Julia GPU

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published