Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cuFINUFFT interface #58

Merged
merged 14 commits into from
Jul 5, 2024
Merged

cuFINUFFT interface #58

merged 14 commits into from
Jul 5, 2024

Conversation

ludvigak
Copy link
Owner

@ludvigak ludvigak commented Jun 20, 2024

Interface to the (guru) cuFINUFFT library, using CUDA.jl for copying data to/from device.

  • Works just like current guru interfaces
  • Can be called with both host and device arrays (copies to device if needed)
  • Only for x86_64 at the moment, due to artifact build system

To-do list:

Fixes #49

@ludvigak ludvigak requested a review from ahbarnett June 27, 2024 17:29
@ludvigak ludvigak marked this pull request as ready for review June 27, 2024 17:29
@ludvigak
Copy link
Owner Author

@ahbarnett I think the CUDA interface is pretty much in place now, let me know what you think before I merge it!
The cufinufft_* functions now direct interfaces to the cuFINUFFT library, while the finufftDdN! functions are overloaded and go to cuFINUFFT when called with CUDA arrays.

Copy link
Collaborator

@ahbarnett ahbarnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A100 GPU. Julia 1.9.3.

julia> ]add https://github.com/ludvigak/FINUFFT.jl#cufinufft
pkg> test FINUFFT
(wait several minutes for compilation of cuda etc...)

Test Summary: | Pass  Total  Time
FINUFFT       |   51     51  5.9s
┌ Warning: You are using a non-official build of Julia. This may cause issues with CUDA.jl.
│ Please consider using an official build from https://julialang.org/downloads/.
└ @ CUDA ~/.julia/packages/CUDA/75aiI/src/initialization.jl:180
Test Summary: | Pass  Total   Time
cuFINUFFT     |   34     34  24.0s
     Testing FINUFFT tests passed 

Looks good. I seem not to have needed ]add CUDA

@ahbarnett
Copy link
Collaborator

I also played around with Float32 and benchmarking. 1d1 is 20x faster on A100 than 10threads of a top-end xeon.

You may want to include example benchmark code. Here's mine:

## Here we demo CUDA routines using the 1D type 1 transform
# single-prec speed conparison.

using FINUFFT
using LinearAlgebra
using BenchmarkTools

dtype = Float32 # Datatype for computations
tol   = 1e-5   # requested relative tolerance

# Setup problem
nj = Int(3e8)
x = pi*(1 .- 2*rand(dtype, nj)); # nonuniform points
c = rand(Complex{dtype}, nj);    # their strengths
ms = Int(1e6)                      # output size (number of Fourier modes)

# CPU computation with preallocated array
fk = Array{Complex{dtype}}(undef, ms);

@btime nufft1d1!(x, c, 1, tol, fk)       # 6 sec on 10 threads of xeon 8358
                                        # (0.16G NUpt/s)

##############################################
## Simple GPU interface for preallocated array
using CUDA # CUDA must be loaded for cuFINUFFT to be activated

# Copy input data to GPU, "_d" suffix indiciates data on device (GPU)
x_d = CuArray(x);
c_d = CuArray(c);
# Allocate CUDA aray
out_d = CuArray{Complex{dtype}}(undef, ms);
# Note: identical interface as CPU, but with CUDA arrays on device

@btime nufft1d1!(x_d, c_d, 1, tol, out_d)       # 0.28 sec on A100 (3.6G NUpt/s)
# 20x the CPU speed, float32 or 64.

# Copy results back to host memory
gpu_results = Array(out_d);
magnitude = norm(fk, Inf)
@show norm(gpu_results-fk, Inf) / magnitude     # Should be < epsilon

Thanks for the nice example code. Looks good to merge.

ludvigak and others added 2 commits July 5, 2024 13:02
@ludvigak ludvigak merged commit c863e3e into master Jul 5, 2024
13 checks passed
@ludvigak ludvigak deleted the cufinufft branch July 5, 2024 11:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Interface to GPU code cufinufft
2 participants