-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
It seems that the PencilFFTs speed is slower than only using one GPU with CUFFT.
I'm using CUDA 11.7, openmpi 4.1.4, uxc 1.13(without gdrcopy) and julia 1.7.3.
Data dim is (8192,32,32)
For CUFFT with single gpu:
Dimension 8192,32,32
start fft benchmark
complete fft benchmark
FFT benchmark results
BenchmarkTools.Trial: 1521 samples with 1 evaluation.
Range (min … max): 2.044 ms … 51.419 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.411 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.737 ms ± 2.174 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▄█▆▃
█████▇▄▁▄▃▃▄▃▄▅▃▁▁▁▁▃▁▃▄▁▁▁▃▁▃▃▁▄▁▃▁▁▁▁▃▃▃▁▁▃▁▁▁▁▁▁▁▁▃▁▃▃▅ █
2.04 ms Histogram: log(frequency) by time 14.5 ms <
Memory estimate: 2.94 KiB, allocs estimate: 49.
For PencilFFTs with single gpu:
rank:0GPU:CuDevice(0)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 93.965 ms … 300.333 ms ┊ GC (min … max): 0.00% … 4.89%
Time (median): 173.977 ms ┊ GC (median): 0.00%
Time (mean ± σ): 181.007 ms ± 39.066 ms ┊ GC (mean ± σ): 0.08% ± 0.49%
█▆ ▂▂ ▃
▄▁▁▁▁▁▁▁▁▁▁▅▇▄▁██▇▇▇▄███▄██▇▅█▅▅▁▅▅▁▁█▅▄▇▅▅▁▄▁▄▅▁▁▄▁▁▄▄▁▁▁▁▁▅ ▄
94 ms Histogram: frequency by time 288 ms <
Memory estimate: 17.64 KiB, allocs estimate: 343.
For PencilFFTs with 4 gpus in the same node:
rank:0GPU:CuDevice(0)
rank:1GPU:CuDevice(1)
rank:2GPU:CuDevice(2)
rank:3GPU:CuDevice(3)
has-cuda:true
data size:(8192, 32, 32)
Start data allocationg
BenchmarkTools.Trial: 100 samples with 1 evaluation.
Range (min … max): 16.761 ms … 28.305 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 26.601 ms ┊ GC (median): 0.00%
Time (mean ± σ): 26.547 ms ± 1.360 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▄█ ▃ ▃ ▄ ▃▂
▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁▃▃▁▄▇▇████▇█▆███▃██ ▃
16.8 ms Histogram: frequency by time 28.2 ms <
Memory estimate: 13.38 KiB, allocs estimate: 294.
benchmark file:
CUFFT:
using CUDA
using FFTW
using BenchmarkTools
println("Dimension 8192,32,32")
println("start fft benchmark")
b=@benchmark (CUDA.@sync op*data) setup=(op=plan_fft!(CuArray{ComplexF64}(undef,8192,32,32));data=CUDA.rand(ComplexF64,8192,32,32))
println("complete fft benchmark")
io = IOBuffer()
show(io, "text/plain", b)
s = String(take!(io))
println("FFT benchmark results")
println(s)
PencilFFTs:
using MPI
using PencilFFTs
using PencilArrays
using BenchmarkTools
using Random
using CUDA
MPI.Init(threadlevel=:funneled)
comm = MPI.COMM_WORLD
dims = (8192, 32, 32)
rank=MPI.Comm_rank(comm)
device!(rank % length(devices()))
sleep(1*rank)
print("rank:",rank,"GPU:",device(),"\n")
pen = Pencil(CuArray,dims, comm)
transform=Transforms.FFT!()
plan = PencilFFTPlan(pen, transform)
u = allocate_input(plan)
if rank == 0
println("has-cuda:",MPI.has_cuda())
print("data size:",dims,"\n")
print("Start data allocationg\n")
end
randn!(first(u))
b = @benchmark $plan*$u evals=1 samples=100 seconds=30 teardown=(MPI.Barrier(comm))
if rank == 0
io = IOBuffer()
show(io, "text/plain", b)
s = String(take!(io))
println(s)
end
Metadata
Metadata
Assignees
Labels
No labels