New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: array summation (np.add.reduce) parallelisation #25835
Comments
@dgrigonis thanks for sharing this idea and benchmark. I'm afraid we can't really do much about this - at least not for It's not impossible for NumPy to become multi-threaded in the future, but it'd require a coherent approach and ways to control it. See https://thomasjpfan.github.io/parallelism-python-libraries-design/ for more context. |
Thank you for info. Also, my last point regarding Is numpy using OpenBLAS for sum computation? If yes, do you think it would be sensible to raise a parallelized sum request in OpenBLAS then? If I remember correctly, MKL provides parallelized sum, which results in different numpy behaviour depending which library it was built with. I appreciate that there are many more discrepancies between MKL and OpenBLAS, which is understandable. But maybe for something as basic as |
And in the
I think you mean here that Intel / Anaconda defaults shipped an optimized
It really depends on having an overall design/strategy for multi-threading. Doing something only for |
That's understandable. Thank you for taking your time. I think I will just use |
Proposed new feature or change:
This is to follow up the discussion in the mailing list.
From my POV, sum for medium sized arrays (50k+) often becomes a bottleneck in greedy algorithms that need to compute distance of partial space over and over.
In short, sum does not seem to be parallelized and sum operation via dot product becomes faster from a certain threshold.
npy.sum being:
So one might need to resort to constructing
sum
operation via other functions (when program design is single process, but taking advantage of multiprocessing of individual components. Which IMO is still fairly common approach, where cost of implementing multi-process design doesn't justify the benefits)Numpy being fairly low level library in the whole python ecosystem, I think user should be able to rely that something as basic as
sum
is going to be near optimal solution for the operation it is designed to do.Also note, there seems to exist a library for what I am proposing https://github.com/quansight/pnumpy.
The sum implemented in it converges with speed of
dotsum
function above:However, even with
pnumpy
enabled there is a significant size range, wherenpy.sum aka dotsum
is more performant.Also, shouldn't sum generally be faster than dot product? After all, dot product needs to do one additional operation (multiplication) compared to vanilla sum? Or is there some clever trick of dot product, which results in the same number of operations as vanilla sum?
The text was updated successfully, but these errors were encountered: