ENH: support float and longdouble in FFT using C++ pocketfft version #25711

mhvk · 2024-01-28T19:40:39Z

Using the C++ version of pocketfft turned out to be not that much of a hassle. With it, we can support all the dtypes that it supports: float, double, and long double.

@asmeurer, @mreineck - comments are most welcome! (My C++ is non-existent so this really is just a C file with templates thrown in, mostly using trial & error...)

fixes #17801

EDIT: some things to decide:

How should we add the license? Right now just added LICENSES.md in the pocketfft directory. (The licence wasn't there for the C version, but that seems to have been an oversight.)
Does it make most sense to include pocketfft as a git submodule? This would include the license (and a small demo programme)
Should we use the cache? -> No (@mreineck suggested not to given the extra memory usage; scipy does use 16)
Should we enable threading? (scipy exposes it but uses 1 thread by default, see its meson.build) -> leave for later (A quick check on my laptop shows it gives about a rather array-size dependent speed-up of about a factor 2 even when using all 12 cores, so likely memory bound)
Pocketfft will just bail if it cannot allocate memory (for its twiddle factors, and I guess a possible cache). ~~If that is something we should catch, please advice how to intercept throw...~~ dealt with using try/catch.
More tests? (push to this PR welcome.)

numpy/fft/_pocketfft_umath.cpp

mreineck · 2024-01-28T19:59:01Z

Ah, OK, this switches to the C++ version, but uses only its 1D transforms!
Makes a lot of sense for a quick fix of the data type issue, but it misses the chance of massive speedups (through vectorization and/or multithreading) in multi-D transforms.
If numpy 2.0 is close that's probably the best option, though I would still urge to unlock the other features in later releases.
If only 1D FFTs are used, a lot of C++ code could be #if 0'd to reduce compile time. I can do this once the rest of the PR has converged.

seberg · 2024-01-28T20:33:16Z

but it misses the chance of massive speedups (through vectorization and/or multithreading) in multi-D transforms.

That is actually a (small or not?) issue with using the gufuncs, it may be tricky to expand to arbitrary dimensions in a nice way for those (which was mentioned to me, I admit; although I didn't realize it might be an issue any time soon).

And yes, so long as this looks all fine, it seems fine to not do the N-D versions.

mhvk · 2024-01-28T20:41:58Z

Indeed, I went the easy route, based on the gufunc I had just introduced (mostly to be able to allow out). Is the threading/vectorization specific to the multi-dimensional iterator, or something that makes the 1D transform faster? From a quick look, it looks like it is the latter, so that should be relatively straightforward, by just using c2c, etc., in my calls. Indeed, that would mean no need for creating a buffer any more too...

mreineck · 2024-01-28T20:45:50Z

Is the threading/vectorization specific to the multi-dimensional iterator, or something that makes the 1D transform faster?

No, threading/vectorization only affects 2D and higher transforms. I have played with applying this to 1D as well, but only in pocketfft's successor package.

numpy/fft/_pocketfft_umath.cpp

mhvk · 2024-01-28T21:02:03Z

Just to be clear, do you mean that vectorization helps for FTs over multiple axes, or FTs over a single axis in higher-D arrays? From the code, it looks like the latter, which might still work with the ufunc, as we get an extra axis passed in.

No, threading/vectorization only affects 2D and higher transforms. I have played with applying this to 1D as well, but only in pocketfft's successor package.

OK, that seems reasonable for follow-up, then. More generally, it may well make sense to use your iterator too, and/or copy what scipy does... I definitely went for what seemed the fastest route...

asmeurer · 2024-01-28T21:02:07Z

doc/release/upcoming_changes/25536.new_feature.rst

-The various FFT routines in `numpy.fft` have gained an ``out``
+`numpy.fft` support for different precisions and in-place calculations
+----------------------------------------------------------------------
+The various FFT routines in `numpy.fft` now support calculations in float,


They have always supported single precision but have upcast the result. We should be clear that this is a breaking change.

Does this sound good?

The various FFT routines in `numpy.fft` now do their calculations natively in float, double, or long double precision, depending on the input precision, instead of always calculating in double precision. Hence, the calculation will now be less precise for single and more precise for long double precision. The data type of the output array will now be adjusted accordingly. Furthermore, all FFT routines have gained an ``out`` argument that can be used for in-place calculations.

Maybe it makes sense to have a dedicated change note that points out the single precision change explicitly.

What's the numpy policy on using versionchanged. Should that be added to the fft docstrings?

Maybe it makes sense to have a dedicated change note that points out the single precision change explicitly.

I don't know how the numpy release notes are amalgamated, but for 2.0 it would be good to highlight all the compatibility breaks somewhere in the release notes.

mreineck · 2024-01-28T21:05:22Z

Just to be clear, do you mean that vectorization helps for FTs over multiple axes, or FTs over a single axis in higher-D arrays? From the code, it looks like the latter, which might still work with the ufunc, as we get an extra axis passed in.

Yes, it works for transforming just a single axis in a multi-D array too. The idea is to "bundle" a few of the 1D transforms into a single one of SIMD type.

mhvk · 2024-01-28T22:24:15Z

The macos_arm64_test failure seems unrelated...

mhvk · 2024-01-28T22:27:17Z

@mreineck - OK, I'll look into using c2c and friends...

mhvk · 2024-01-29T00:58:39Z

Calling c2c, etc., using the outer dimension actually seems fairly straightforward and gives a nice speed-up. The only unfortunate thing is that pocketfft does not deal with the case that the input is shorter than the output, so in that case one still needs to buffer/copy. But the speed-up is worth the complexity, so I'll push a further refactoring tomorrow.

rgommers · 2024-01-29T09:58:32Z

Nice!

Furthermore, I have not tried setting up a pocketfft cache, or done anything with threading. Scipy uses its meson.build to set a cache size of 16 and have compiler-dependent threading; happy to do the same here.

The cache size may help, assuming it's a useful setting in single-threaded mode (@mreineck?). I'd leave the threading along, we can't do much with that. In SciPy it's controlled by a workers= keyword to opt in to multi-threading. NumPy doesn't have that, and everything is supposed to be single-threaded (underlying BLAS/LAPACK is an exception for [reasons]).

Should the license be added to LICENSES.txt? It wasn't there for the C version, but perhaps that was an oversight.

Probably an oversight indeed, since the SciPy licenses list has it (here). So +1 for adding it.

With it, we can support all the dtypes that it supports: float, double, and long double.

Adding float32 support is great and obviously desirable. I'm less sure about longdouble support, since that's not really all that useful and something we may have more work to get rid of again later. We didn't get around to revisiting this topic for 2.0, but overall I'm still convinced that longdouble is overall more harmful than helpful for NumPy as a whole, so why add more support when I don't think anyone asked for it here?

Build time wise this PR in its current state looks fine - impact of C++ is okay overall, see quick profiling result below. Binary size is fine too, it's ~240 kb for a dev (debugoptimized) build. A note though that the SciPy pocketfft extension is more than 4x larger, so enabling all functionality that SciPy uses in the NumPy build is probably not the way to go (it'd be ~25% of the size of _multiarray_umath at that point).

mreineck · 2024-01-29T10:07:58Z

The cache size may help, assuming it's a useful setting in single-threaded mode (@mreineck?)

It will work in single-threaded mode. However, my feeling is that the plan caching isn't really worth it: it may cause large memory consumption (see, e.g., scipy/scipy#12297) for very little performance gain (unless you call very short 1D FFTs repeatedly). Scipy maintainers tend towards keeping the cache, however.

mhvk · 2024-01-30T11:48:06Z

OK, I now have a version that calls the vectorized pocketfft version if possible. It still has the direct loops for the case that nin < nout as well as for 1D transforms (where one can avoid copies in the probably fairly common case of very long 1D transforms with contiguous output).

Some remaining questions:

On the implementation: pocketfft vectorization properties are set by what is available at compile-time, not run-time. Is this a problem? Or can this be left for later optimization?
Would it make sense to include pocketfft as a submodule? That would automatically include the license (and not much more except for a small demo version). I think it is included as a submodule by scipy.

mhvk · 2024-01-30T15:54:47Z

Adding float32 support is great and obviously desirable. I'm less sure about longdouble support, since that's not really all that useful and something we may have more work to get rid of again later.

@rgommers - I would love to have a proper float128 instead of longdouble, but on the other hand, the present behaviour is not that longdouble is not supported (in which case it would be fine to continue with that), but rather that longdouble is just converted to double without any warning. Given that, my sense is that supporting it correctly is the least disruptive change.

rgommers · 2024-01-30T19:05:57Z

the present behaviour is not that longdouble is not supported (in which case it would be fine to continue with that), but rather that longdouble is just converted to double without any warning. Given that, my sense is that supporting it correctly is the least disruptive change.

I'm not really opposed, but it is a change, with technically only increased binary size as a penalty and no upsides for Windows and macOS arm64 users. Doing nothing and continuing to convert to float64 either internally like before or, as an improvement, use the float64 loops internally but cast back to longdouble before returning the result, are therefore also options (even the former is zero change from now, so hard to label it as more disruptive).

I'm not sure how serious the plan was to include all the pocketfft functionality that SciPy needs into NumPy - but binary size does seem to become relevant then. If's it's <0.1 MB I'm not too worried, but I wouldn't really want to pay 0.25 MB or more for long double loops that no one has asked for.

mhvk · 2024-01-30T23:24:28Z

ls -lh build/numpy/fft/_pocketfft_umath.cpython-311-x86_64-linux-gnu.so
# 5.1M present PR
# 4.1M without longdouble
# 2.9M w/ longdouble but with POCKETFFT_NO_VECTORS
# (which uses low-level routines only, but halves speed;
# I checked that high-level routines are not compiled for longdouble)

For comparison, multiarray_umath is 73 MB, the simd stuff 15MB, and everything in random adds up to 43 MB.

rgommers · 2024-01-31T00:42:18Z

Thanks for the comparison, that is helpful. Note that those are debugoptimized numbers; what I had in mind was wheel sizes (I should have specified). Our wheel sizes are in the 13 - 21 MB range, and the new macOS arm64 ones even as small as 5 MB.

I did a quick test building a wheel from this branch; the pocketfft extension is 508 kb uncompressed, 208 kb compressed in it (total wheel size 7.8 MB; _multiarray_umath.so 2.6 MB compressed). So that should still be fine. If dst/dct transforms would be enabled, it may get a bit questionable. But that could still be taken care of mostly by disabling the longdouble loops for platforms where it's an alias for double. Anyway, it doesn't seem too likely to become an issue at this moment. Let's not worry about it further.

mhvk · 2024-01-31T02:18:21Z

OK, great! The other transforms may not be as bad as one would think, since they mostly use the same code. But anyway, not sure those are needed, and definitely not relevant here.

rgommers · 2024-01-31T18:09:16Z

The relevant flags are here:

numpy/numpy/_core/meson.build

Lines 676 to 685 in 2634f80

    
           # Same as NPY_CXX_FLAGS (TODO: extend for what ccompiler_opt adds) 
        
           cpp_args_common = c_args_common + [ 
        
             '-D__STDC_VERSION__=0',  # for compatibility with C headers 
        
           ] 
        
           if cc.get_argument_syntax() != 'msvc' 
        
             cpp_args_common += [ 
        
               '-fno-exceptions',  # no exception support 
        
               '-fno-rtti',  # no runtime type information 
        
             ] 
        
           endif

seiko2plus · 2024-01-31T18:16:47Z

The relevant flags are here:

Its only limited to core only, I thought it was a global flag

mhvk · 2024-01-31T18:16:55Z

Are we sure those flags are seen in fft -- since they don't appear in my build log... I'm not sure how meson exactly works, but does what happens in subdir(_core) influence `subdir(fft)? See

numpy/numpy/meson.build

Lines 398 to 401 in 2634f80

    
           subdir('_core') 
        
           subdir('fft') 
        
           subdir('linalg') 
        
           subdir('random')

mhvk · 2024-01-31T18:18:45Z

OK, I just empirically confirmed that that flag does not get passed on. And with that, that I can catch C++ exceptions. I'll push that up in a bit.

seiko2plus · 2024-01-31T18:24:08Z

And with that, that I can catch C++ exceptions

If its about memory allocations, I think we should use our custom allocations over malloc or new.

mhvk · 2024-01-31T19:48:36Z

@seiko2plus - the problem is that the allocations (and possible other errors) happen inside pocketfft and ideally we do not start rewriting that...

My current solution is just to put everything in a try/catch using macros and convert C++ exceptions to python errors. I don't really know how to test memory errors, but I checked that if I purposely pass in an invalid axis number, the corresponding exception properly leads to a RuntimeError. See last commit, ab16ef4

mreineck · 2024-01-31T19:49:20Z

If its about memory allocations, I think we should use our custom allocations over malloc or new.

I'd really recommend not doing that. Most likely these custom operations are C-style and therefore will silently return NULL on error, which goes against fundamental assumptions C++ codes are making. You'd have to patch every place in pocketfft where memory is allocated and add error handling code.

mhvk · 2024-01-31T19:50:44Z

p.s. Note that I now on purpose use pocketfft allocators, since those throw an exception, hence I could remove my custom code.

mreineck · 2024-01-31T19:52:15Z

My current solution is just to put everything in a try/catch using macros and convert C++ exceptions to python errors.

To me this looks exactly like the right solution. It's located directly at the interface, does not require any big patching and can be extended in case more detailed error handling is required.
If memory allocations behave in the standard C++ way, they will be caught here as well.

mreineck · 2024-01-31T19:55:38Z

"-fno-exceptions" should certainly not be used when compiling pocketfft and the wrapper code.
To be honest, I'd consider removing -fno-exceptions -fno-rtti from your configuration altogether. This was a good idea 20 years ago, but today the overhead (both memory and runtime) should be pretty negligible.

seberg

Not a deep review (although I doubt that is needed at this point!).
Just commenting on two things so they don't get lost (one bug, one suggestion to try to have a more C++ pattern for catching the exceptions).

Submodule+license seem still two things that would be nice to do as well.

seberg · 2024-02-03T11:13:38Z

numpy/fft/_pocketfft_umath.cpp

+    void *buff = NULL;
+    if (step_out != sizeof(std::complex<T>)) {
+        buff = pocketfft::detail::aligned_alloc(sizeof(std::complex<T>),
+                                                nout * sizeof(std::complex<T>));


There is a small problem here that buff is a void * and not a pocketfft:detail::arr<std::complex<T>>. Which I think means that if the plan->exec fails below no cleanup will occur.
My C++ is a bit trial and error, but I think:

/* Create a temporary array to hold a buffer if op is not contiguous */ bool contiguous_op = (step_out != sizeof(std::complex<T>)); pocketfft::detail::arr<std::complex<T>> tmp_arr(contiguous_op ? 0 : nout);

should be an option (an array of length 0 doesn't do a malloc so that is fine).

And below:

std::complex<T> *op_or_buff = (contiguous_op ? (std::complex<T> *)op : tmp_arr.data());

(Clearly, the dealloc below then becomes unnecessary, which is the whole point.)

OK, that makes a lot of sense! I implemented it (though with an opposite bool, buffered).

numpy/fft/_pocketfft_umath.cpp

Note that the C++ version is header only. We just link to upstream via a git submodule.

mhvk · 2024-02-03T21:43:05Z

@seberg - thanks! Both suggestions were really nice, and make the code a lot cleaner. I pushed a revised version that uses them and that also replaces the addition of the C++ pocketfft header with a new submodule (which includes the license).

mhvk · 2024-02-03T21:45:51Z

p.s. In the checklist above, this just leaves the question of whether this is tested thoroughly enough. I could adjust more tests to try different dtype, but I'm a bit wary of just increasing test run time for little benefit. Currently, the tests that I did change exercise the options to do longer or shorter FTs than the input (i.e., which set n), which has been the main pain point so far.

mhvk · 2024-02-03T22:06:25Z

p.p.s. Well the above about tests was not actually true, but now it is, so I ticked my "more tests" item...

mhvk · 2024-02-03T22:07:18Z

numpy/fft/tests/test_pocketfft.py

@@ -381,15 +389,6 @@ def test_all_1d_norm_preserving(self):
                    assert_allclose(x_norm,
                                    np.linalg.norm(tmp), atol=1e-6)

-    @pytest.mark.parametrize("dtype", [np.half, np.single, np.double,


This one is now done much more thoroughly in test_identity_long_short and test_identity_long_short_reversed

Also adjust python side to allow multiple double types.

Also make sure every part uses copy_input and copy_output consistently, and avoid compiling multithreading.

Also use internal pocketfft arr class to allocate the possible temporary buffers, so we do not have to worry about deallocating them any more.

mhvk · 2024-02-09T16:47:40Z

Is there anything more to do for this? (C++ pocketfft in np.fft)

seberg

Not managing to concentrate very deeply on it, but those two C++ things were basically what I noticed and the rest looks fine.

Since nobody voiced other concerns, let's get this in. (It isn't vastly changed after all, since it is porting the C code.)

Thanks for this great new code!

mhvk added 01 - Enhancement component: numpy.fft labels Jan 28, 2024

mhvk added this to the 2.0.0 release milestone Jan 28, 2024

mhvk mentioned this pull request Jan 28, 2024

np.fft.fft always returns np.complex128 regardless of input type #17801

Closed

mhvk force-pushed the pocketfft-cpp branch from f96fe5b to 7ef8282 Compare January 28, 2024 19:49

mhvk commented Jan 28, 2024

View reviewed changes

numpy/fft/_pocketfft_umath.cpp Show resolved Hide resolved

mhvk force-pushed the pocketfft-cpp branch from 7ef8282 to 66c2bae Compare January 28, 2024 20:44

seberg reviewed Jan 28, 2024

View reviewed changes

numpy/fft/_pocketfft_umath.cpp Outdated Show resolved Hide resolved

asmeurer reviewed Jan 28, 2024

View reviewed changes

mhvk force-pushed the pocketfft-cpp branch 2 times, most recently from 30eeaad to 102ac99 Compare January 30, 2024 11:38

mhvk force-pushed the pocketfft-cpp branch from 102ac99 to aa97df0 Compare January 30, 2024 11:52

mhvk changed the title ~~ENH: support float and longdoubt in FFT using C++ pocketfft version~~ ENH: support float and longdouble in FFT using C++ pocketfft version Jan 30, 2024

mhvk force-pushed the pocketfft-cpp branch from ab16ef4 to ce7a8dc Compare January 31, 2024 20:10

seberg reviewed Feb 3, 2024

View reviewed changes

MAINT: Replace pocketfft with the C++ version and use it in ufuncs

a22dc48

Note that the C++ version is header only. We just link to upstream via a git submodule.

mhvk force-pushed the pocketfft-cpp branch from ce7a8dc to c1b5e17 Compare February 3, 2024 21:41

mhvk force-pushed the pocketfft-cpp branch from c1b5e17 to 055dd1d Compare February 3, 2024 22:05

mhvk commented Feb 3, 2024

View reviewed changes

mhvk added 3 commits February 3, 2024 17:18

ENH: Use templating to add support float and longdouble.

22c56b5

Also adjust python side to allow multiple double types.

MAINT: use vectorized pocketfft routines when possible

a344ca3

Also make sure every part uses copy_input and copy_output consistently, and avoid compiling multithreading.

MAINT: handle possible c++ errors by wrapping the ufuncs

b8c020f

Also use internal pocketfft arr class to allocate the possible temporary buffers, so we do not have to worry about deallocating them any more.

mhvk force-pushed the pocketfft-cpp branch from 055dd1d to b8c020f Compare February 3, 2024 22:19

seberg approved these changes Feb 9, 2024

View reviewed changes

seberg merged commit 0291a81 into numpy:main Feb 9, 2024
63 checks passed

mhvk deleted the pocketfft-cpp branch February 9, 2024 17:14

rgommers mentioned this pull request Feb 11, 2024

BLD: restore 'setup-args=-Duse-ilp64=true' in cibuildwheel [wheel build] #25803

Merged

mreineck mentioned this pull request Feb 12, 2024

API: Support float32 in fft? #23522

Closed

mhvk mentioned this pull request Feb 16, 2024

NEP: add NEP 56 on array API standard support in main namespace #25542

Merged

ENH: support float and longdouble in FFT using C++ pocketfft version #25711

ENH: support float and longdouble in FFT using C++ pocketfft version #25711

Conversation

mhvk commented Jan 28, 2024 • edited

mreineck commented Jan 28, 2024

seberg commented Jan 28, 2024

mhvk commented Jan 28, 2024

mreineck commented Jan 28, 2024

mhvk commented Jan 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mreineck commented Jan 28, 2024

mhvk commented Jan 28, 2024

mhvk commented Jan 28, 2024

mhvk commented Jan 29, 2024

rgommers commented Jan 29, 2024

mreineck commented Jan 29, 2024

mhvk commented Jan 30, 2024

mhvk commented Jan 30, 2024

rgommers commented Jan 30, 2024

mhvk commented Jan 30, 2024

rgommers commented Jan 31, 2024

mhvk commented Jan 31, 2024

rgommers commented Jan 31, 2024

seiko2plus commented Jan 31, 2024

mhvk commented Jan 31, 2024

mhvk commented Jan 31, 2024

seiko2plus commented Jan 31, 2024

mhvk commented Jan 31, 2024

mreineck commented Jan 31, 2024

mhvk commented Jan 31, 2024

mreineck commented Jan 31, 2024

mreineck commented Jan 31, 2024

seberg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhvk commented Feb 3, 2024

mhvk commented Feb 3, 2024

mhvk commented Feb 3, 2024

Choose a reason for hiding this comment

mhvk commented Feb 9, 2024

seberg left a comment

Choose a reason for hiding this comment

mhvk commented Jan 28, 2024 •

edited