Metal FFT for powers of 2 up to 2048 #915

barronalex · 2024-03-27T22:24:28Z

Proposed changes

Add a GPU FFT algorithm for the powers of 2 from 4 -> 2048.

Only supports 1D, forward, complex to complex transforms at the moment but planning to follow up with more features shortly.

Performance

On my M1 Max:

For 64 <= N <= 512, we're doing ~360Gb/s with a large batch size which isn't far off from the maximum memory bandwidth of an M1 Max (~400Gb/s). The other sizes are slightly slower but can be addressed in a follow up PR.

The GPU implementation is 20 to 35 times faster than the CPU implementation for the FFT sizes it implements.

Checklist

Put an x in the boxes that apply.

I have read the CONTRIBUTING document
I have run pre-commit run --all-files to format my code / installed pre-commit prior to committing changes
I have added tests that prove my fix is effective or that my feature works
I have updated the necessary documentation (if needed)

awni · 2024-03-28T04:56:04Z

python/tests/test_fft.py

+        atol = 1e-4
+        rtol = 1e-4
+        np.random.seed(7)
+        with mx.stream(mx.gpu):


I would just run these tests on both devices rather than skipping them for the CPU and specifying the context manager. The goal is to remove the context manager for the CPU above once all the ops can run on the GPU.

(usually we don't specify the device in the tests but just run the full test suite twice with the CPU and GPU as default).

Ah that makes sense -- updated.

benchmarks/python/fft_bench.py

mlx/backend/metal/fft.cpp

awni · 2024-03-28T05:04:47Z

mlx/backend/metal/fft.cpp

+
+  size_t n = in.shape(axes_[0]);
+
+  if (!is_power_of_2(n) || n > 2048 || n < 4) {


Just curious, how difficult would it be to support non powers of 2? (We could easily pad with zeros.. but maybe there is a clean way to do it in the kernel itself?).

It's not completely trivial since there are a couple different algorithms libraries typically use based on the prime factors of N.

VkFFT seems to do this:

Pure radix decomposition (as currently implemented) if N factorizes into primes < 13. This would require adding custom DFTs for 3, 5, 7, 11 and 13.

Rader's algorithm for everything else except Sophia Germain primes

Bluestein's algorithm for Sophia Germain primes

For N > 2048, we'll probably want to use the 4-step FFT algorithm.

I was thinking I'd first have a go at implementing the 3/5/7/11/13-radix kernels. Then we'd have about 17% of 1 < N <= 2048 covered.

I'm happy to work on Rader's/Bluestein's/4 step after too.

awni · 2024-03-28T05:11:03Z

mlx/backend/metal/fft.cpp

+      // FFT dim has stride 1 so there are no gaps
+      flags.contiguous = true;


That doesn't seem quite right. Even if the FFT dim has stride 1, you could have gaps in due to another dimensions having a larger stride?

You're definitely right there's a bug in the no_copy case, which should now be fixed. Thanks for catching it!

awni · 2024-03-28T05:12:38Z

mlx/backend/metal/fft.cpp

+      return x_copy;
+    }
+  };
+  const array& in_contiguous = check_input(inputs[0]);


This array, is it actually contiguous or just contiguous in the FFT dim? If it's not truly row_contiguous, then we presumably need to specify the strides to the kernel?

Good catch! I've changed it to do the GeneralGeneral copy when the input array isn't contiguous and added a test for this case. Is that alright for now?

I started working on passing the strides to directly to the kernel to avoid the extra copy, but I might save that for a future PR.

@barronalex FWIW, it is indeed place where nextup operations on FFT result (like .abs()) cause fatal error. Replacing in_contiguous.flags() to empty is exact that makes my experimental code working. BTW big thank you for your work on Metal FFT.

Thanks for catching this! Will push a fix shortly.

@barronalex would you mind adding a test or two with non-contiguous arrays (e.g. output of transpose or broadcast). The logic here is a bit subtle so some tests would be good.

barronalex · 2024-03-29T15:38:22Z

If this goes in, I have a follow up working that implements ifft, rfft and rifft with similar performance characteristics.

yury · 2024-04-05T10:03:33Z

I wonder how pocketfft compared to Accelerate.framework vDSP_fft in terms of performance.
Or Accelerate framework uses pocketfft inside?

awni · 2024-04-07T13:25:25Z

mlx/backend/metal/fft.cpp

+  auto check_input = [this, &copies, &s](const array& x) {
+    // TODO: Pass the strides to the kernel so
+    // we can avoid the copy when x is not contiguous.
+    bool no_copy = x.strides()[axes_[0]] == 1 && x.flags().contiguous;


I think you want x.falgs().row_contiguous || x.flags().col_contiguous here instead of x.flags().contiguous (which has a different meaning).

awni · 2024-04-07T13:28:09Z

This looks great and I think we can merge it soon! Just left a couple more comments. Let me know when you've addressed and will re-run tests.

barronalex · 2024-04-08T15:34:37Z

@awni Thanks for the comments! I added some contiguity tests and changed some of the logic. Let me know if they look reasonable.

mlx/backend/metal/fft.cpp

awni

This is awesome. Let's get it landed and focus on additions in #981

awni · 2024-04-11T17:18:52Z

mlx/backend/metal/fft.cpp

 #include "mlx/primitives.h"

 namespace mlx::core {

 void FFT::eval_gpu(const std::vector<array>& inputs, array& out) {
+  auto& s = out.primitive().stream();


auto& s = stream() works here ;)

awni

I'm cool to land this and continue the discussion in #981

barronalex · 2024-04-11T20:00:09Z

Sounds good to me!

awni reviewed Mar 28, 2024

View reviewed changes

benchmarks/python/fft_bench.py Outdated Show resolved Hide resolved

awni reviewed Mar 28, 2024

View reviewed changes

mlx/backend/metal/fft.cpp Outdated Show resolved Hide resolved

awni reviewed Mar 28, 2024

View reviewed changes

barronalex force-pushed the ab-metal-fft branch from ccfa3b4 to 98a9197 Compare March 28, 2024 20:06

awni reviewed Apr 7, 2024

View reviewed changes

awni reviewed Apr 11, 2024

View reviewed changes

mlx/backend/metal/fft.cpp Outdated Show resolved Hide resolved

awni reviewed Apr 11, 2024

View reviewed changes

mlx/backend/metal/fft.cpp Outdated Show resolved Hide resolved

Alex Barron and others added 6 commits April 11, 2024 08:59

add Metal FFT for powers of 2

713bcf2

skip GPU test on linux

4681934

fix contiguity bug

649f2f5

address comments

a430ae9

Update mlx/backend/metal/fft.cpp

36b8241

Update mlx/backend/metal/fft.cpp

9331fd3

awni force-pushed the ab-metal-fft branch from 0c58bda to 9331fd3 Compare April 11, 2024 16:00

awni approved these changes Apr 11, 2024

View reviewed changes

awni reviewed Apr 11, 2024

View reviewed changes

fix bug in synch

5831538

awni approved these changes Apr 11, 2024

View reviewed changes

awni merged commit 2e7c02d into ml-explore:main Apr 12, 2024
5 checks passed

barronalex deleted the ab-metal-fft branch May 10, 2024 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal FFT for powers of 2 up to 2048 #915

Metal FFT for powers of 2 up to 2048 #915

barronalex commented Mar 27, 2024

awni Mar 28, 2024

awni Mar 28, 2024

barronalex Mar 28, 2024

awni Mar 28, 2024

barronalex Mar 28, 2024

awni Mar 28, 2024

barronalex Mar 28, 2024

awni Mar 28, 2024

barronalex Mar 28, 2024

djphoenix Apr 4, 2024 •

edited

Loading

barronalex Apr 5, 2024

awni Apr 7, 2024

barronalex commented Mar 29, 2024

yury commented Apr 5, 2024

awni Apr 7, 2024

awni commented Apr 7, 2024

barronalex commented Apr 8, 2024

awni left a comment

awni Apr 11, 2024 •

edited

Loading

awni left a comment

barronalex commented Apr 11, 2024


		size_t n = in.shape(axes_[0]);

		if (!is_power_of_2(n) \|\| n > 2048 \|\| n < 4) {

		// FFT dim has stride 1 so there are no gaps
		flags.contiguous = true;

Metal FFT for powers of 2 up to 2048 #915

Metal FFT for powers of 2 up to 2048 #915

Conversation

barronalex commented Mar 27, 2024

Proposed changes

Performance

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djphoenix Apr 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

barronalex commented Mar 29, 2024

yury commented Apr 5, 2024

Choose a reason for hiding this comment

awni commented Apr 7, 2024

barronalex commented Apr 8, 2024

awni left a comment

Choose a reason for hiding this comment

awni Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

awni left a comment

Choose a reason for hiding this comment

barronalex commented Apr 11, 2024

djphoenix Apr 4, 2024 •

edited

Loading

awni Apr 11, 2024 •

edited

Loading