Skip to content

CuArray Performance #56

@Lightup1

Description

@Lightup1

It seems that the PencilFFTs speed is slower than only using one GPU with CUFFT.
I'm using CUDA 11.7, openmpi 4.1.4, uxc 1.13(without gdrcopy) and julia 1.7.3.
Data dim is (8192,32,32)
For CUFFT with single gpu:

Dimension 8192,32,32
start fft benchmark
complete fft benchmark
FFT benchmark results
BenchmarkTools.Trial: 1521 samples with 1 evaluation.
 Range (min … max):  2.044 ms … 51.419 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.411 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.737 ms ±  2.174 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▄█▆▃                                                        
  █████▇▄▁▄▃▃▄▃▄▅▃▁▁▁▁▃▁▃▄▁▁▁▃▁▃▃▁▄▁▃▁▁▁▁▃▃▃▁▁▃▁▁▁▁▁▁▁▁▃▁▃▃▅ █
  2.04 ms      Histogram: log(frequency) by time     14.5 ms <

 Memory estimate: 2.94 KiB, allocs estimate: 49.

For PencilFFTs with single gpu:

rank:0GPU:CuDevice(0)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):   93.965 ms … 300.333 ms  ┊ GC (min … max): 0.00% … 4.89%
 Time  (median):     173.977 ms               ┊ GC (median):    0.00%
 Time  (mean ± σ):   181.007 ms ±  39.066 ms  ┊ GC (mean ± σ):  0.08% ± 0.49%

                 █▆     ▂▂  ▃                                    
  ▄▁▁▁▁▁▁▁▁▁▁▅▇▄▁██▇▇▇▄███▄██▇▅█▅▅▁▅▅▁▁█▅▄▇▅▅▁▄▁▄▅▁▁▄▁▁▄▄▁▁▁▁▁▅ ▄
  94 ms            Histogram: frequency by time          288 ms <

 Memory estimate: 17.64 KiB, allocs estimate: 343.

For PencilFFTs with 4 gpus in the same node:

rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  16.761 ms … 28.305 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     26.601 ms              ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.547 ms ±  1.360 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                 ▁▄█ ▃ ▃ ▄ ▃▂  
  ▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▃▁▄▇▇████▇█▆███▃██ ▃
  16.8 ms         Histogram: frequency by time        28.2 ms <

 Memory estimate: 13.38 KiB, allocs estimate: 294.

benchmark file:
CUFFT:

using CUDA
using FFTW
using BenchmarkTools

println("Dimension 8192,32,32")
println("start fft benchmark")
b=@benchmark (CUDA.@sync op*data) setup=(op=plan_fft!(CuArray{ComplexF64}(undef,8192,32,32));data=CUDA.rand(ComplexF64,8192,32,32))
println("complete fft benchmark")
io = IOBuffer()
show(io, "text/plain", b)
s = String(take!(io))
println("FFT benchmark results")
println(s)

PencilFFTs:

using MPI
using PencilFFTs
using PencilArrays
using BenchmarkTools
using Random
using CUDA

MPI.Init(threadlevel=:funneled)
comm = MPI.COMM_WORLD
dims = (8192, 32, 32)

rank=MPI.Comm_rank(comm)
device!(rank % length(devices()))
sleep(1*rank)
print("rank:",rank,"GPU:",device(),"\n")

pen = Pencil(CuArray,dims, comm)
transform=Transforms.FFT!()

plan = PencilFFTPlan(pen, transform)
u = allocate_input(plan)
if rank == 0
    println("has-cuda:",MPI.has_cuda())
    print("data size:",dims,"\n")
    print("Start data allocationg\n")
end
randn!(first(u))

b = @benchmark $plan*$u evals=1 samples=100 seconds=30 teardown=(MPI.Barrier(comm))

if rank == 0
    io = IOBuffer()
    show(io, "text/plain", b)
    s = String(take!(io))
    println(s)
end

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions