ENH: array summation (np.add.reduce) parallelisation #25835

dgrigonis · 2024-02-16T15:34:54Z

Proposed new feature or change:

This is to follow up the discussion in the mailing list.

From my POV, sum for medium sized arrays (50k+) often becomes a bottleneck in greedy algorithms that need to compute distance of partial space over and over.

In short, sum does not seem to be parallelized and sum operation via dot product becomes faster from a certain threshold.

This is using OpenBLAS

npy.sum being:

def dotsum(arr):
    a = arr.reshape(1000, 100)
    return a.dot(np.ones(100)).sum()

So one might need to resort to constructing sum operation via other functions (when program design is single process, but taking advantage of multiprocessing of individual components. Which IMO is still fairly common approach, where cost of implementing multi-process design doesn't justify the benefits)

Numpy being fairly low level library in the whole python ecosystem, I think user should be able to rely that something as basic as sum is going to be near optimal solution for the operation it is designed to do.

Also note, there seems to exist a library for what I am proposing https://github.com/quansight/pnumpy.

The sum implemented in it converges with speed of dotsum function above:

However, even with pnumpy enabled there is a significant size range, where npy.sum aka dotsum is more performant.

Also, shouldn't sum generally be faster than dot product? After all, dot product needs to do one additional operation (multiplication) compared to vanilla sum? Or is there some clever trick of dot product, which results in the same number of operations as vanilla sum?

The text was updated successfully, but these errors were encountered:

rgommers · 2024-02-19T06:54:16Z

@dgrigonis thanks for sharing this idea and benchmark.

I'm afraid we can't really do much about this - at least not for np.sum alone. The design of NumPy is that it's single-threaded. The only exception is underlying BLAS/LAPACK libraries, which we cannot control. Importantly, to avoid oversubscription issues (i.e., trying to use N^2 processes/threads on a machine with N CPU cores) when users or downstream packages use multiprocessing, dask or a similar such multi-process or multi-thread approach on top, things have to be predictable. threadpoolctl was developed to control the BLAS library threading behavior, and is used by for example scikit-learn to make BLAS single-threaded when users use the n_jobs keyword that's present in many scikit-learn APIs. If we'd randomly start making some functions inside NumPy multi-threaded, there would not be a way to avoid oversubscription issues like that.

It's not impossible for NumPy to become multi-threaded in the future, but it'd require a coherent approach and ways to control it. See https://thomasjpfan.github.io/parallelism-python-libraries-design/ for more context.

dgrigonis · 2024-02-19T12:38:49Z

Thank you for info.

Also, my last point regarding sum via dot product performance converging with vanilla sum (both parallelised) is still unclear. Theoretically, dot product should be slower than sum. Or is it the case that multiplication in dot product is so much faster than the sum part that it becomes negligible?

Is numpy using OpenBLAS for sum computation? If yes, do you think it would be sensible to raise a parallelized sum request in OpenBLAS then?

If I remember correctly, MKL provides parallelized sum, which results in different numpy behaviour depending which library it was built with.

I appreciate that there are many more discrepancies between MKL and OpenBLAS, which is understandable. But maybe for something as basic as sum operation the effort of improvement could be justified?

rgommers · 2024-02-19T16:46:50Z

sum shouldn't be using BLAS I believe, the two lines in your graph probably converge because the speed for large arrays is limited by memory bandwidth. And the left side of your graph (really small arrays) represents function call overhead:

>>> x1 = np.ones(3)
>>> x2 = np.ones(1000)
>>> %timeit np.sum(x1)
1.93 µs ± 8.42 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit np.sum(x2)
2.2 µs ± 98.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> x3 = np.ones(4000)
>>> %timeit np.sum(x3)
2.82 µs ± 6.38 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

And in the pnumpy benchmarks you see that performance is ~9.5x faster for int16, ~3.5x for float32 and only ~2x for float64/int64. That seems consistent with memory bandwidth becoming important for larger array sizes.

MKL provides parallelized sum,

I think you mean here that Intel / Anaconda defaults shipped an optimized numpy version (maybe still does). Using MKL instead of OpenBLAS when one builds from source shouldn't change anything here.

But maybe for something as basic as sum operation the effort of improvement could be justified?

It really depends on having an overall design/strategy for multi-threading. pnumpy was/is interesting, but overall there hasn't been that much interest in it. A strategy would require threading control for numpy functions, and could probably only be opt-in (e.g. via a context manager).

Doing something only for np.sum is a nonstarter.

dgrigonis · 2024-02-19T17:18:12Z

That's understandable. Thank you for taking your time.

I think I will just use pnumpy for the time being and get back to this when I have gathered more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: array summation (np.add.reduce) parallelisation #25835

ENH: array summation (np.add.reduce) parallelisation #25835

dgrigonis commented Feb 16, 2024 •

edited

rgommers commented Feb 19, 2024

dgrigonis commented Feb 19, 2024

rgommers commented Feb 19, 2024

dgrigonis commented Feb 19, 2024

ENH: array summation (np.add.reduce) parallelisation #25835

ENH: array summation (np.add.reduce) parallelisation #25835

Comments

dgrigonis commented Feb 16, 2024 • edited

Proposed new feature or change:

rgommers commented Feb 19, 2024

dgrigonis commented Feb 19, 2024

rgommers commented Feb 19, 2024

dgrigonis commented Feb 19, 2024

dgrigonis commented Feb 16, 2024 •

edited