New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Port single-copy np.block implementation to C #13186
base: main
Are you sure you want to change the base?
Conversation
Results from BEFORE
AFTER
|
Awesome stuff. Im surprised to see such a difference in the performance for large arrays. I'm wondering if there is a way for the whole copy function to be done with the C api instead of the python objects. It would avoid the need to hold the GIL during the copy |
This makes things harder for downstream libraries wanting to implement |
@eric-wieser, are you referring to the PR in general or the request to release the GIL? |
@hmaarrfk I think one can interpret the results as there being a constant overhead due to views/tuples of ~15-30us for the 2x2 case and 70-100us for 4x4, for all the runs from large to small. @eric-wieser True, and the same can be said of |
The other worry I have with this is it ruins my ability to use the existing infrastructure to implement # python
a = [b, c]
b, c = a
# numpy
a = np.block([b, c])
np.unblock([b, c], a)
# numpy with syntactic sugar:
a = np.b_[b, c]
np.b_[b, c] = a One of the things I liked about @hmaarrfk's PR is it made it really easy to implement this. |
This PR makes it trivial to implement It's 6-character change: Replace (That's a cool idea by the way) |
No need to add it now, good to know that door remains open |
This looks nice! Something perhaps for the future: might one be able to decouple the two stages? Thinking in terms of subclasses or array mimics, many would likely have the relevant shape/dtype/etc. atttributes that the first stage need, but most will fail the second stage (as for concatenate) as the copies that are being made cannot be overridden (would need going through Though arguably for this kind of thing is premature - we do not even have something that takes the shapes of all inputs and returns the broadcast shape... which surely many have implemented independently by now. |
|
||
*dtype = NULL; | ||
|
||
if (PyTuple_Check(arrays)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to make this check a bit more robust, so that more is checked than just tuple
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what kind of robustness are yuo thinking of adding? Can you point to a specific issue? This seems like the same logic as the all-python code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it only checks against tuple
. Is it possible to check against other non-allowed structures in a more generic way? Something like isinstance(x, list) and not isinstance(x, ndarray)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because we also allow other array-like stuff. For convenience, we allow things like scalars
# We've 'bottomed out' - arrays is either a scalar or an array
size = _nx.size(arrays)
return parent_index, _nx.ndim(arrays), size
how does this compare to calling cythonize on the py file as is? Is this kind of "C rewriting" worth it? |
@ahaldane in the python implementation, the code was organized such that:
Were are all very independent methods. In this new proposal, 1-3 are combined into a single recursion. I wonder if that is too much to do in a single function. |
Responding to a few different comments:
Not only that, but the creation of the new nested list gives a performance hit. Currently, comparing the times of
I can't say for sure unless it gets implemented, but I suspect cython won't save you from the tuple manipulation nor the view creation, which are the main speedups here. As for "is it worth it", I can't say for sure either. It does give a nice 6x speedup for small arrays, but no-one asked for a speedup.
It seemed both faster and avoided C code duplication this way, because in C you need a lot of boilerplate for the recursion logic. I'll add really all the hard work here was in your PR and with Eric's reviews of it, to figure out the logic and the edge cases to test for. The C port was fairly straightforward building on that. I only did this PR because I had already thought about the C port when reviewing your PR, @hmaarrfk, and I wanted to write it down before I forgot it all. If we want more time to think about alternatives, I'm fine leaving this PR alone for a while. |
@ahaldane currently has merge conflicts. I think I will put reviewing this on my TODO list, but ping me if I don't and you want to continue. |
Hey guys, generally my comments here is that moving this to dense C code will encourage others to roll their own algorithms (since python users aren't always adept with numpy c code) Is it possible to maybe roll I think this would avoid much of the python overhead analyzing tuples and lists, while leaving the two steps of algorithm seperate:
I can see this becoming really useful to algorithms that could benefit from hardware acceleration or other distributed computation paradigms. |
Agreed that the index-gathering state might well be useful independently for duck arrays. |
Do we want to drop the requirement for an index gathering function? We could open issue copy pasting the code I'm not sure if anybody else has used it. |
A cursory search on github showed that most references of |
Hmm, weird that the search does not uncover that astropy uses In both places, we are using it to help override |
I did not thoroughly read through things. thank you for revealing those. Generally speaking, at scikit-image we learned the hard way not to use functions that start I believe you can vendor the specific functions you need. I think it may be useful to make these a public API. the Array interface seems to have matured since 2019 and likely others would like to use this blocking algorithm. It was mostly "theoretical" in 2019, but I am very glad to see that other libraries are using this code! If this were to be made public, I would suggest that we remove the word |
@hmaarrfk - yes, completely agreed that one should not rely on private functions. We test against numpy-dev to know if things break and obviously will copy the code if needed. Here, my goal was just to note that the function is actually useful! |
This is a port from python to C of the new single-copy block algorithm developed by @hmaarrfk in #11971.
This port gets a speedup by 1. avoiding the shape-tuple-manipulation needed in python, and 2. Avoiding creating many temporary view objects of the output buffer. It does this by following the example of the
np.concatenate
code which solves 1 by using mutable ints in C, and solves 2 by temporarily clobbering the data-pointer and shape of theout
array instead of using views.The main speedup will be for small arrays where the optimization in #11971 didn't apply. I'll paste the before/after of
python runtests.py --bench Block
below, the results are pretty good.Now there are no longer two different code-paths to maintain, all arrays go through the same single-copy code path. This version also adds an
out
kwd argument tonp.block
, and has a slightly better error message when there is a shape mismatch.