Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: support float and longdouble in FFT using C++ pocketfft version #25711

Merged
merged 4 commits into from Feb 9, 2024

Conversation

mhvk
Copy link
Contributor

@mhvk mhvk commented Jan 28, 2024

Using the C++ version of pocketfft turned out to be not that much of a hassle. With it, we can support all the dtypes that it supports: float, double, and long double.

@asmeurer, @mreineck - comments are most welcome! (My C++ is non-existent so this really is just a C file with templates thrown in, mostly using trial & error...)

fixes #17801

EDIT: some things to decide:

  • How should we add the license? Right now just added LICENSES.md in the pocketfft directory. (The licence wasn't there for the C version, but that seems to have been an oversight.)
  • Does it make most sense to include pocketfft as a git submodule? This would include the license (and a small demo programme)
  • Should we use the cache? -> No (@mreineck suggested not to given the extra memory usage; scipy does use 16)
  • Should we enable threading? (scipy exposes it but uses 1 thread by default, see its meson.build) -> leave for later (A quick check on my laptop shows it gives about a rather array-size dependent speed-up of about a factor 2 even when using all 12 cores, so likely memory bound)
  • Pocketfft will just bail if it cannot allocate memory (for its twiddle factors, and I guess a possible cache). If that is something we should catch, please advice how to intercept throw... dealt with using try/catch.
  • More tests? (push to this PR welcome.)

@mreineck
Copy link
Contributor

Ah, OK, this switches to the C++ version, but uses only its 1D transforms!
Makes a lot of sense for a quick fix of the data type issue, but it misses the chance of massive speedups (through vectorization and/or multithreading) in multi-D transforms.
If numpy 2.0 is close that's probably the best option, though I would still urge to unlock the other features in later releases.
If only 1D FFTs are used, a lot of C++ code could be #if 0'd to reduce compile time. I can do this once the rest of the PR has converged.

@seberg
Copy link
Member

seberg commented Jan 28, 2024

but it misses the chance of massive speedups (through vectorization and/or multithreading) in multi-D transforms.

That is actually a (small or not?) issue with using the gufuncs, it may be tricky to expand to arbitrary dimensions in a nice way for those (which was mentioned to me, I admit; although I didn't realize it might be an issue any time soon).

And yes, so long as this looks all fine, it seems fine to not do the N-D versions.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 28, 2024

Indeed, I went the easy route, based on the gufunc I had just introduced (mostly to be able to allow out). Is the threading/vectorization specific to the multi-dimensional iterator, or something that makes the 1D transform faster? From a quick look, it looks like it is the latter, so that should be relatively straightforward, by just using c2c, etc., in my calls. Indeed, that would mean no need for creating a buffer any more too...

@mreineck
Copy link
Contributor

Is the threading/vectorization specific to the multi-dimensional iterator, or something that makes the 1D transform faster?

No, threading/vectorization only affects 2D and higher transforms. I have played with applying this to 1D as well, but only in pocketfft's successor package.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 28, 2024

Just to be clear, do you mean that vectorization helps for FTs over multiple axes, or FTs over a single axis in higher-D arrays? From the code, it looks like the latter, which might still work with the ufunc, as we get an extra axis passed in.

No, threading/vectorization only affects 2D and higher transforms. I have played with applying this to 1D as well, but only in pocketfft's successor package.

OK, that seems reasonable for follow-up, then. More generally, it may well make sense to use your iterator too, and/or copy what scipy does... I definitely went for what seemed the fastest route...

The various FFT routines in `numpy.fft` have gained an ``out``
`numpy.fft` support for different precisions and in-place calculations
----------------------------------------------------------------------
The various FFT routines in `numpy.fft` now support calculations in float,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They have always supported single precision but have upcast the result. We should be clear that this is a breaking change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this sound good?

The various FFT routines in `numpy.fft` now do their calculations natively in
float, double, or long double precision, depending on the input precision,
instead of always calculating in double precision. Hence, the calculation will
now be less precise for single and more precise for long double precision.
The data type of the output array will now be adjusted accordingly.

Furthermore, all FFT routines have gained an ``out`` argument that can be used
for in-place calculations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to have a dedicated change note that points out the single precision change explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the numpy policy on using versionchanged. Should that be added to the fft docstrings?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it makes sense to have a dedicated change note that points out the single precision change explicitly.

I don't know how the numpy release notes are amalgamated, but for 2.0 it would be good to highlight all the compatibility breaks somewhere in the release notes.

@mreineck
Copy link
Contributor

Just to be clear, do you mean that vectorization helps for FTs over multiple axes, or FTs over a single axis in higher-D arrays? From the code, it looks like the latter, which might still work with the ufunc, as we get an extra axis passed in.

Yes, it works for transforming just a single axis in a multi-D array too. The idea is to "bundle" a few of the 1D transforms into a single one of SIMD type.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 28, 2024

The macos_arm64_test failure seems unrelated...

@mhvk
Copy link
Contributor Author

mhvk commented Jan 28, 2024

@mreineck - OK, I'll look into using c2c and friends...

@mhvk
Copy link
Contributor Author

mhvk commented Jan 29, 2024

Calling c2c, etc., using the outer dimension actually seems fairly straightforward and gives a nice speed-up. The only unfortunate thing is that pocketfft does not deal with the case that the input is shorter than the output, so in that case one still needs to buffer/copy. But the speed-up is worth the complexity, so I'll push a further refactoring tomorrow.

@rgommers
Copy link
Member

Nice!

Furthermore, I have not tried setting up a pocketfft cache, or done anything with threading. Scipy uses its meson.build to set a cache size of 16 and have compiler-dependent threading; happy to do the same here.

The cache size may help, assuming it's a useful setting in single-threaded mode (@mreineck?). I'd leave the threading along, we can't do much with that. In SciPy it's controlled by a workers= keyword to opt in to multi-threading. NumPy doesn't have that, and everything is supposed to be single-threaded (underlying BLAS/LAPACK is an exception for [reasons]).

Should the license be added to LICENSES.txt? It wasn't there for the C version, but perhaps that was an oversight.

Probably an oversight indeed, since the SciPy licenses list has it (here). So +1 for adding it.

With it, we can support all the dtypes that it supports: float, double, and long double.

Adding float32 support is great and obviously desirable. I'm less sure about longdouble support, since that's not really all that useful and something we may have more work to get rid of again later. We didn't get around to revisiting this topic for 2.0, but overall I'm still convinced that longdouble is overall more harmful than helpful for NumPy as a whole, so why add more support when I don't think anyone asked for it here?

Build time wise this PR in its current state looks fine - impact of C++ is okay overall, see quick profiling result below. Binary size is fine too, it's ~240 kb for a dev (debugoptimized) build. A note though that the SciPy pocketfft extension is more than 4x larger, so enabling all functionality that SciPy uses in the NumPy build is probably not the way to go (it'd be ~25% of the size of _multiarray_umath at that point).

image

@mreineck
Copy link
Contributor

The cache size may help, assuming it's a useful setting in single-threaded mode (@mreineck?)

It will work in single-threaded mode. However, my feeling is that the plan caching isn't really worth it: it may cause large memory consumption (see, e.g., scipy/scipy#12297) for very little performance gain (unless you call very short 1D FFTs repeatedly). Scipy maintainers tend towards keeping the cache, however.

@mhvk mhvk force-pushed the pocketfft-cpp branch 2 times, most recently from 30eeaad to 102ac99 Compare January 30, 2024 11:38
@mhvk
Copy link
Contributor Author

mhvk commented Jan 30, 2024

OK, I now have a version that calls the vectorized pocketfft version if possible. It still has the direct loops for the case that nin < nout as well as for 1D transforms (where one can avoid copies in the probably fairly common case of very long 1D transforms with contiguous output).

Some remaining questions:

  1. On the implementation: pocketfft vectorization properties are set by what is available at compile-time, not run-time. Is this a problem? Or can this be left for later optimization?
  2. Would it make sense to include pocketfft as a submodule? That would automatically include the license (and not much more except for a small demo version). I think it is included as a submodule by scipy.

@mhvk mhvk changed the title ENH: support float and longdoubt in FFT using C++ pocketfft version ENH: support float and longdouble in FFT using C++ pocketfft version Jan 30, 2024
@mhvk
Copy link
Contributor Author

mhvk commented Jan 30, 2024

Adding float32 support is great and obviously desirable. I'm less sure about longdouble support, since that's not really all that useful and something we may have more work to get rid of again later.

@rgommers - I would love to have a proper float128 instead of longdouble, but on the other hand, the present behaviour is not that longdouble is not supported (in which case it would be fine to continue with that), but rather that longdouble is just converted to double without any warning. Given that, my sense is that supporting it correctly is the least disruptive change.

@rgommers
Copy link
Member

the present behaviour is not that longdouble is not supported (in which case it would be fine to continue with that), but rather that longdouble is just converted to double without any warning. Given that, my sense is that supporting it correctly is the least disruptive change.

I'm not really opposed, but it is a change, with technically only increased binary size as a penalty and no upsides for Windows and macOS arm64 users. Doing nothing and continuing to convert to float64 either internally like before or, as an improvement, use the float64 loops internally but cast back to longdouble before returning the result, are therefore also options (even the former is zero change from now, so hard to label it as more disruptive).

I'm not sure how serious the plan was to include all the pocketfft functionality that SciPy needs into NumPy - but binary size does seem to become relevant then. If's it's <0.1 MB I'm not too worried, but I wouldn't really want to pay 0.25 MB or more for long double loops that no one has asked for.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 30, 2024

ls -lh build/numpy/fft/_pocketfft_umath.cpython-311-x86_64-linux-gnu.so
# 5.1M present PR
# 4.1M without longdouble
# 2.9M w/ longdouble but with POCKETFFT_NO_VECTORS
# (which uses low-level routines only, but halves speed;
# I checked that high-level routines are not compiled for longdouble)

For comparison, multiarray_umath is 73 MB, the simd stuff 15MB, and everything in random adds up to 43 MB.

@rgommers
Copy link
Member

Thanks for the comparison, that is helpful. Note that those are debugoptimized numbers; what I had in mind was wheel sizes (I should have specified). Our wheel sizes are in the 13 - 21 MB range, and the new macOS arm64 ones even as small as 5 MB.

I did a quick test building a wheel from this branch; the pocketfft extension is 508 kb uncompressed, 208 kb compressed in it (total wheel size 7.8 MB; _multiarray_umath.so 2.6 MB compressed). So that should still be fine. If dst/dct transforms would be enabled, it may get a bit questionable. But that could still be taken care of mostly by disabling the longdouble loops for platforms where it's an alias for double. Anyway, it doesn't seem too likely to become an issue at this moment. Let's not worry about it further.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 31, 2024

OK, great! The other transforms may not be as bad as one would think, since they mostly use the same code. But anyway, not sure those are needed, and definitely not relevant here.

@rgommers
Copy link
Member

The relevant flags are here:

# Same as NPY_CXX_FLAGS (TODO: extend for what ccompiler_opt adds)
cpp_args_common = c_args_common + [
'-D__STDC_VERSION__=0', # for compatibility with C headers
]
if cc.get_argument_syntax() != 'msvc'
cpp_args_common += [
'-fno-exceptions', # no exception support
'-fno-rtti', # no runtime type information
]
endif

@seiko2plus
Copy link
Member

The relevant flags are here:

Its only limited to core only, I thought it was a global flag

@mhvk
Copy link
Contributor Author

mhvk commented Jan 31, 2024

Are we sure those flags are seen in fft -- since they don't appear in my build log... I'm not sure how meson exactly works, but does what happens in subdir(_core) influence `subdir(fft)? See

numpy/numpy/meson.build

Lines 398 to 401 in 2634f80

subdir('_core')
subdir('fft')
subdir('linalg')
subdir('random')

@mhvk
Copy link
Contributor Author

mhvk commented Jan 31, 2024

OK, I just empirically confirmed that that flag does not get passed on. And with that, that I can catch C++ exceptions. I'll push that up in a bit.

@seiko2plus
Copy link
Member

And with that, that I can catch C++ exceptions

If its about memory allocations, I think we should use our custom allocations over malloc or new.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 31, 2024

@seiko2plus - the problem is that the allocations (and possible other errors) happen inside pocketfft and ideally we do not start rewriting that...

My current solution is just to put everything in a try/catch using macros and convert C++ exceptions to python errors. I don't really know how to test memory errors, but I checked that if I purposely pass in an invalid axis number, the corresponding exception properly leads to a RuntimeError. See last commit, ab16ef4

@mreineck
Copy link
Contributor

If its about memory allocations, I think we should use our custom allocations over malloc or new.

I'd really recommend not doing that. Most likely these custom operations are C-style and therefore will silently return NULL on error, which goes against fundamental assumptions C++ codes are making. You'd have to patch every place in pocketfft where memory is allocated and add error handling code.

@mhvk
Copy link
Contributor Author

mhvk commented Jan 31, 2024

p.s. Note that I now on purpose use pocketfft allocators, since those throw an exception, hence I could remove my custom code.

@mreineck
Copy link
Contributor

My current solution is just to put everything in a try/catch using macros and convert C++ exceptions to python errors.

To me this looks exactly like the right solution. It's located directly at the interface, does not require any big patching and can be extended in case more detailed error handling is required.
If memory allocations behave in the standard C++ way, they will be caught here as well.

@mreineck
Copy link
Contributor

"-fno-exceptions" should certainly not be used when compiling pocketfft and the wrapper code.
To be honest, I'd consider removing -fno-exceptions -fno-rtti from your configuration altogether. This was a good idea 20 years ago, but today the overhead (both memory and runtime) should be pretty negligible.

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a deep review (although I doubt that is needed at this point!).
Just commenting on two things so they don't get lost (one bug, one suggestion to try to have a more C++ pattern for catching the exceptions).

Submodule+license seem still two things that would be nice to do as well.

void *buff = NULL;
if (step_out != sizeof(std::complex<T>)) {
buff = pocketfft::detail::aligned_alloc(sizeof(std::complex<T>),
nout * sizeof(std::complex<T>));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small problem here that buff is a void * and not a pocketfft:detail::arr<std::complex<T>>. Which I think means that if the plan->exec fails below no cleanup will occur.
My C++ is a bit trial and error, but I think:

/* Create a temporary array to hold a buffer if op is not contiguous */
bool contiguous_op = (step_out != sizeof(std::complex<T>));
pocketfft::detail::arr<std::complex<T>> tmp_arr(contiguous_op ? 0 : nout);

should be an option (an array of length 0 doesn't do a malloc so that is fine).

And below:

std::complex<T> *op_or_buff = (contiguous_op ? (std::complex<T> *)op : tmp_arr.data());

(Clearly, the dealloc below then becomes unnecessary, which is the whole point.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that makes a lot of sense! I implemented it (though with an opposite bool, buffered).

numpy/fft/_pocketfft_umath.cpp Show resolved Hide resolved
numpy/fft/_pocketfft_umath.cpp Show resolved Hide resolved
Note that the C++ version is header only.  We just link to upstream
via a git submodule.
@mhvk
Copy link
Contributor Author

mhvk commented Feb 3, 2024

@seberg - thanks! Both suggestions were really nice, and make the code a lot cleaner. I pushed a revised version that uses them and that also replaces the addition of the C++ pocketfft header with a new submodule (which includes the license).

@mhvk
Copy link
Contributor Author

mhvk commented Feb 3, 2024

p.s. In the checklist above, this just leaves the question of whether this is tested thoroughly enough. I could adjust more tests to try different dtype, but I'm a bit wary of just increasing test run time for little benefit. Currently, the tests that I did change exercise the options to do longer or shorter FTs than the input (i.e., which set n), which has been the main pain point so far.

@mhvk
Copy link
Contributor Author

mhvk commented Feb 3, 2024

p.p.s. Well the above about tests was not actually true, but now it is, so I ticked my "more tests" item...

@@ -381,15 +389,6 @@ def test_all_1d_norm_preserving(self):
assert_allclose(x_norm,
np.linalg.norm(tmp), atol=1e-6)

@pytest.mark.parametrize("dtype", [np.half, np.single, np.double,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is now done much more thoroughly in test_identity_long_short and test_identity_long_short_reversed

Also adjust python side to allow multiple double types.
Also make sure every part uses copy_input and copy_output
consistently, and avoid compiling multithreading.
Also use internal pocketfft arr class to allocate the possible
temporary buffers, so we do not have to worry about deallocating
them any more.
@mhvk
Copy link
Contributor Author

mhvk commented Feb 9, 2024

Is there anything more to do for this? (C++ pocketfft in np.fft)

Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not managing to concentrate very deeply on it, but those two C++ things were basically what I noticed and the rest looks fine.

Since nobody voiced other concerns, let's get this in. (It isn't vastly changed after all, since it is porting the C code.)

Thanks for this great new code!

@seberg seberg merged commit 0291a81 into numpy:main Feb 9, 2024
63 checks passed
@mhvk mhvk deleted the pocketfft-cpp branch February 9, 2024 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

np.fft.fft always returns np.complex128 regardless of input type
6 participants