Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: array summation (np.add.reduce) parallelisation #25835

Open
dgrigonis opened this issue Feb 16, 2024 · 4 comments
Open

ENH: array summation (np.add.reduce) parallelisation #25835

dgrigonis opened this issue Feb 16, 2024 · 4 comments

Comments

@dgrigonis
Copy link

dgrigonis commented Feb 16, 2024

Proposed new feature or change:

This is to follow up the discussion in the mailing list.

From my POV, sum for medium sized arrays (50k+) often becomes a bottleneck in greedy algorithms that need to compute distance of partial space over and over.

In short, sum does not seem to be parallelized and sum operation via dot product becomes faster from a certain threshold.

  • This is using OpenBLAS

npy.sum being:

def dotsum(arr):
    a = arr.reshape(1000, 100)
    return a.dot(np.ones(100)).sum()

So one might need to resort to constructing sum operation via other functions (when program design is single process, but taking advantage of multiprocessing of individual components. Which IMO is still fairly common approach, where cost of implementing multi-process design doesn't justify the benefits)

Numpy being fairly low level library in the whole python ecosystem, I think user should be able to rely that something as basic as sum is going to be near optimal solution for the operation it is designed to do.

Also note, there seems to exist a library for what I am proposing https://github.com/quansight/pnumpy.

The sum implemented in it converges with speed of dotsum function above:

However, even with pnumpy enabled there is a significant size range, where npy.sum aka dotsum is more performant.

Also, shouldn't sum generally be faster than dot product? After all, dot product needs to do one additional operation (multiplication) compared to vanilla sum? Or is there some clever trick of dot product, which results in the same number of operations as vanilla sum?

@rgommers
Copy link
Member

@dgrigonis thanks for sharing this idea and benchmark.

I'm afraid we can't really do much about this - at least not for np.sum alone. The design of NumPy is that it's single-threaded. The only exception is underlying BLAS/LAPACK libraries, which we cannot control. Importantly, to avoid oversubscription issues (i.e., trying to use N^2 processes/threads on a machine with N CPU cores) when users or downstream packages use multiprocessing, dask or a similar such multi-process or multi-thread approach on top, things have to be predictable. threadpoolctl was developed to control the BLAS library threading behavior, and is used by for example scikit-learn to make BLAS single-threaded when users use the n_jobs keyword that's present in many scikit-learn APIs. If we'd randomly start making some functions inside NumPy multi-threaded, there would not be a way to avoid oversubscription issues like that.

It's not impossible for NumPy to become multi-threaded in the future, but it'd require a coherent approach and ways to control it. See https://thomasjpfan.github.io/parallelism-python-libraries-design/ for more context.

@dgrigonis
Copy link
Author

Thank you for info.

Also, my last point regarding sum via dot product performance converging with vanilla sum (both parallelised) is still unclear. Theoretically, dot product should be slower than sum. Or is it the case that multiplication in dot product is so much faster than the sum part that it becomes negligible?

Is numpy using OpenBLAS for sum computation? If yes, do you think it would be sensible to raise a parallelized sum request in OpenBLAS then?

If I remember correctly, MKL provides parallelized sum, which results in different numpy behaviour depending which library it was built with.

I appreciate that there are many more discrepancies between MKL and OpenBLAS, which is understandable. But maybe for something as basic as sum operation the effort of improvement could be justified?

@rgommers
Copy link
Member

sum shouldn't be using BLAS I believe, the two lines in your graph probably converge because the speed for large arrays is limited by memory bandwidth. And the left side of your graph (really small arrays) represents function call overhead:

>>> x1 = np.ones(3)
>>> x2 = np.ones(1000)
>>> %timeit np.sum(x1)
1.93 µs ± 8.42 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> %timeit np.sum(x2)
2.2 µs ± 98.9 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)
>>> x3 = np.ones(4000)
>>> %timeit np.sum(x3)
2.82 µs ± 6.38 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

And in the pnumpy benchmarks you see that performance is ~9.5x faster for int16, ~3.5x for float32 and only ~2x for float64/int64. That seems consistent with memory bandwidth becoming important for larger array sizes.

MKL provides parallelized sum,

I think you mean here that Intel / Anaconda defaults shipped an optimized numpy version (maybe still does). Using MKL instead of OpenBLAS when one builds from source shouldn't change anything here.

But maybe for something as basic as sum operation the effort of improvement could be justified?

It really depends on having an overall design/strategy for multi-threading. pnumpy was/is interesting, but overall there hasn't been that much interest in it. A strategy would require threading control for numpy functions, and could probably only be opt-in (e.g. via a context manager).

Doing something only for np.sum is a nonstarter.

@dgrigonis
Copy link
Author

That's understandable. Thank you for taking your time.

I think I will just use pnumpy for the time being and get back to this when I have gathered more information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants