Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Port single-copy np.block implementation to C #13186

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ahaldane
Copy link
Member

This is a port from python to C of the new single-copy block algorithm developed by @hmaarrfk in #11971.

This port gets a speedup by 1. avoiding the shape-tuple-manipulation needed in python, and 2. Avoiding creating many temporary view objects of the output buffer. It does this by following the example of the np.concatenate code which solves 1 by using mutable ints in C, and solves 2 by temporarily clobbering the data-pointer and shape of the out array instead of using views.

The main speedup will be for small arrays where the optimization in #11971 didn't apply. I'll paste the before/after of python runtests.py --bench Block below, the results are pretty good.

Now there are no longer two different code-paths to maintain, all arrays go through the same single-copy code path. This version also adds an out kwd argument to np.block, and has a slightly better error message when there is a shape mismatch.

@ahaldane
Copy link
Member Author

Results from python runtests.py --bench Block:

BEFORE
[100.00%] ··· ============== ======== ============= =============
              --                                n_chunks         
              ----------------------- ---------------------------
                  shape       dtype       (2, 2)        (4, 4)   
              ============== ======== ============= =============
                 (16, 16)     uint8     36.4±0.6μs     90.5±1μs  
                 (16, 16)     uint16    36.6±0.4μs     89.9±1μs  
                 (16, 16)     uint32    36.4±0.1μs    90.7±0.5μs 
                 (16, 16)     uint64    37.3±0.7μs     91.7±1μs  
                 (32, 32)     uint8     36.7±0.4μs    92.0±0.6μs 
                 (32, 32)     uint16    37.5±0.3μs    92.7±0.7μs 
                 (32, 32)     uint32    37.9±0.5μs    92.3±0.6μs 
                 (32, 32)     uint64    38.7±0.6μs    93.4±0.2μs 
                 (64, 64)     uint8     38.1±0.3μs     94.2±2μs  
                 (64, 64)     uint16    39.3±0.5μs    95.0±0.6μs 
                 (64, 64)     uint32    39.9±0.6μs    95.1±0.7μs 
                 (64, 64)     uint64    42.2±0.4μs     98.1±1μs  
                (128, 128)    uint8     40.1±0.4μs     96.2±1μs  
                (128, 128)    uint16    42.8±0.5μs    101±0.7μs  
                (128, 128)    uint32    49.0±0.5μs     106±1μs   
                (128, 128)    uint64    60.1±0.8μs     116±1μs   
                (256, 256)    uint8     49.5±0.2μs     109±1μs   
                (256, 256)    uint16     61.7±1μs      120±1μs   
                (256, 256)    uint32     82.5±2μs      140±4μs   
                (256, 256)    uint64     123±3μs       176±9μs   
                (512, 512)    uint8      83.3±3μs      145±3μs   
                (512, 512)    uint16     124±3μs       183±6μs   
                (512, 512)    uint32     197±10μs      274±20μs  
                (512, 512)    uint64    1.88±0.6ms     616±40μs  
               (1024, 1024)   uint8      159±2μs       258±10μs  
               (1024, 1024)   uint16     291±5μs       409±8μs   
               (1024, 1024)   uint32     899±20μs    1.04±0.03ms 
               (1024, 1024)   uint64   1.87±0.09ms   2.08±0.04ms 
              ============== ======== ============= =============
AFTER
[100.00%] ··· ============== ======== ============= =============
              --                                n_chunks         
              ----------------------- ---------------------------
                  shape       dtype       (2, 2)        (4, 4)   
              ============== ======== ============= =============
                 (16, 16)     uint8    7.37±0.08μs    16.5±0.1μs 
                 (16, 16)     uint16    7.41±0.1μs   17.2±0.07μs 
                 (16, 16)     uint32    7.62±0.1μs    17.0±0.3μs 
                 (16, 16)     uint64   7.84±0.09μs    17.4±0.4μs 
                 (32, 32)     uint8    7.80±0.06μs    17.2±0.2μs 
                 (32, 32)     uint16    7.81±0.2μs    17.6±0.3μs 
                 (32, 32)     uint32    8.14±0.1μs    17.6±0.2μs 
                 (32, 32)     uint64    8.24±0.1μs    17.9±0.2μs 
                 (64, 64)     uint8     8.16±0.1μs    18.3±0.3μs 
                 (64, 64)     uint16    8.72±0.1μs    18.4±0.3μs 
                 (64, 64)     uint32    8.97±0.2μs    19.0±0.2μs 
                 (64, 64)     uint64    9.83±0.1μs    19.4±0.3μs 
                (128, 128)    uint8     9.68±0.2μs    20.0±0.2μs 
                (128, 128)    uint16    10.7±0.1μs    21.5±0.4μs 
                (128, 128)    uint32    13.3±0.3μs    23.4±0.3μs 
                (128, 128)    uint64    18.4±0.4μs    30.2±0.8μs 
                (256, 256)    uint8     14.3±0.2μs    25.9±0.1μs 
                (256, 256)    uint16    20.5±0.8μs    32.1±0.4μs 
                (256, 256)    uint32    31.5±0.7μs     44.7±2μs  
                (256, 256)    uint64     49.2±1μs     63.4±0.8μs 
                (512, 512)    uint8      35.0±2μs     47.7±0.5μs 
                (512, 512)    uint16     56.8±2μs      71.0±4μs  
                (512, 512)    uint32     90.6±1μs      112±2μs   
                (512, 512)    uint64     207±4μs       224±5μs   
               (1024, 1024)   uint8      107±4μs       133±10μs  
               (1024, 1024)   uint16     222±5μs       245±6μs   
               (1024, 1024)   uint32     810±20μs      791±10μs  
               (1024, 1024)   uint64   1.78±0.04ms   1.79±0.04ms 
              ============== ======== ============= =============

@hmaarrfk
Copy link
Contributor

Awesome stuff.

Im surprised to see such a difference in the performance for large arrays.

I'm wondering if there is a way for the whole copy function to be done with the C api instead of the python objects. It would avoid the need to hold the GIL during the copy

@eric-wieser
Copy link
Member

This makes things harder for downstream libraries wanting to implement block on their duck-types, removing their reference implementation.

@hmaarrfk
Copy link
Contributor

@eric-wieser, are you referring to the PR in general or the request to release the GIL?

@ahaldane
Copy link
Member Author

@hmaarrfk I think one can interpret the results as there being a constant overhead due to views/tuples of ~15-30us for the 2x2 case and 70-100us for 4x4, for all the runs from large to small.

@eric-wieser True, and the same can be said of np.concatenate. Should we have pure-python implementations stored somewhere in the numpy tree to help ducktype developers?

@eric-wieser
Copy link
Member

eric-wieser commented Mar 26, 2019

The other worry I have with this is it ruins my ability to use the existing infrastructure to implement np.unblock, which works in reverse:

# python
a = [b, c]
b, c = a

# numpy
a = np.block([b, c])
np.unblock([b, c], a)

# numpy with syntactic sugar:
a = np.b_[b, c]
np.b_[b, c] = a

One of the things I liked about @hmaarrfk's PR is it made it really easy to implement this.

@ahaldane
Copy link
Member Author

ahaldane commented Mar 26, 2019

This PR makes it trivial to implement unblock as well; I can easily add it if desired.

It's 6-character change: Replace PyArray_AssignArray(out, arr, NULL, NPY_SAME_KIND_CASTING) by PyArray_AssignArray(arr, out, NULL, NPY_SAME_KIND_CASTING).

(That's a cool idea by the way)

@eric-wieser
Copy link
Member

No need to add it now, good to know that door remains open

@mhvk
Copy link
Contributor

mhvk commented Mar 26, 2019

This looks nice!

Something perhaps for the future: might one be able to decouple the two stages? Thinking in terms of subclasses or array mimics, many would likely have the relevant shape/dtype/etc. atttributes that the first stage need, but most will fail the second stage (as for concatenate) as the copies that are being made cannot be overridden (would need going through __setitem__ instead of using PyArray_AssignArray).

Though arguably for this kind of thing is premature - we do not even have something that takes the shapes of all inputs and returns the broadcast shape... which surely many have implemented independently by now.


*dtype = NULL;

if (PyTuple_Check(arrays)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to make this check a bit more robust, so that more is checked than just tuple?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of robustness are yuo thinking of adding? Can you point to a specific issue? This seems like the same logic as the all-python code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it only checks against tuple. Is it possible to check against other non-allowed structures in a more generic way? Something like isinstance(x, list) and not isinstance(x, ndarray).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because we also allow other array-like stuff. For convenience, we allow things like scalars

        # We've 'bottomed out' - arrays is either a scalar or an array
        size = _nx.size(arrays)
        return parent_index, _nx.ndim(arrays), size

@hmaarrfk
Copy link
Contributor

how does this compare to calling cythonize on the py file as is? Is this kind of "C rewriting" worth it?

@hmaarrfk
Copy link
Contributor

@ahaldane in the python implementation, the code was organized such that:

  1. Sanitization + individual array attribute extraction
  2. Shape concatenation
  3. dtype dtermination
  4. Allocation
  5. Assignment

Were are all very independent methods.

In this new proposal, 1-3 are combined into a single recursion. I wonder if that is too much to do in a single function.

@ahaldane
Copy link
Member Author

Responding to a few different comments:

I'm wondering if there is a way for the whole copy function to be done with the C api instead of the python objects. It would avoid the need to hold the GIL during the copy

Not only that, but the creation of the new nested list gives a performance hit. Currently, comparing the times of np.concatenate([np.ones(3)]*100) and np.block([np.ones(3)]*100), the latter is slower, and I think the only slowdown must be due to the construction of a new nested list which np.concatenate does not need to do, since I think the performance of all of the other code is nearly identical in the C implementations. Since it needed to be a variable length nested list structure, using python lists seemed easier than something C. But if we want a further speed improvement, avoiding the nested list creation would be top of the list.

how does this compare to calling cythonize on the py file as is? Is this kind of "C rewriting" worth it?

I can't say for sure unless it gets implemented, but I suspect cython won't save you from the tuple manipulation nor the view creation, which are the main speedups here. As for "is it worth it", I can't say for sure either. It does give a nice 6x speedup for small arrays, but no-one asked for a speedup.

In this new proposal, 1-3 are combined into a single recursion. I wonder if that is too much to do in a single function.

It seemed both faster and avoided C code duplication this way, because in C you need a lot of boilerplate for the recursion logic.

I'll add really all the hard work here was in your PR and with Eric's reviews of it, to figure out the logic and the edge cases to test for. The C port was fairly straightforward building on that. I only did this PR because I had already thought about the C port when reviewing your PR, @hmaarrfk, and I wanted to write it down before I forgot it all. If we want more time to think about alternatives, I'm fine leaving this PR alone for a while.

@seberg seberg self-requested a review May 21, 2019 23:11
@seberg
Copy link
Member

seberg commented May 21, 2019

@ahaldane currently has merge conflicts. I think I will put reviewing this on my TODO list, but ping me if I don't and you want to continue.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented May 21, 2019

Hey guys, generally my comments here is that moving this to dense C code will encourage others to roll their own algorithms (since python users aren't always adept with numpy c code)

Is it possible to maybe roll _block_info_recursion in c and expose it as an experimental numpy API? Not sure if there is an established path forward for that.

I think this would avoid much of the python overhead analyzing tuples and lists, while leaving the two steps of algorithm seperate:

  1. Identify the slices that we need to index using metadata
  2. Slice in the arrays.

I can see this becoming really useful to algorithms that could benefit from hardware acceleration or other distributed computation paradigms.

@mhvk
Copy link
Contributor

mhvk commented May 21, 2019

Agreed that the index-gathering state might well be useful independently for duck arrays.

@seberg seberg removed their request for review August 22, 2019 13:14
Base automatically changed from master to main March 4, 2021 02:04
@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 5, 2022

Do we want to drop the requirement for an index gathering function?

We could open issue copy pasting the code _block_setup and leaving it for a separate PR.

I'm not sure if anybody else has used it.

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 5, 2022

A cursory search on github showed that most references of _block_setup that showup are from people vendoring numpy code.

@mhvk
Copy link
Contributor

mhvk commented Apr 5, 2022

Hmm, weird that the search does not uncover that astropy uses _block_setup in two places... See https://github.com/astropy/astropy/blob/f54df76430eac0f67efb1e61be5a31f4c3029b9d/astropy/utils/masked/function_helpers.py#L455-L476
and https://github.com/astropy/astropy/blob/f54df76430eac0f67efb1e61be5a31f4c3029b9d/astropy/units/quantity_helper/function_helpers.py#L388-L411

In both places, we are using it to help override np.block for use with Masked and Quantity instances. While we can copy & paste, it would seem nicer to keep (and use) it in numpy...

@hmaarrfk
Copy link
Contributor

hmaarrfk commented Apr 5, 2022

I did not thoroughly read through things. thank you for revealing those.

Generally speaking, at scikit-image we learned the hard way not to use functions that start _ due to numpy's decision to remove certain functions.

I believe you can vendor the specific functions you need.

I think it may be useful to make these a public API. the Array interface seems to have matured since 2019 and likely others would like to use this blocking algorithm. It was mostly "theoretical" in 2019, but I am very glad to see that other libraries are using this code!

If this were to be made public, I would suggest that we remove the word recursion from the function name.

@mhvk
Copy link
Contributor

mhvk commented Apr 5, 2022

@hmaarrfk - yes, completely agreed that one should not rely on private functions. We test against numpy-dev to know if things break and obviously will copy the code if needed. Here, my goal was just to note that the function is actually useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting a decision
Development

Successfully merging this pull request may close these issues.

None yet

7 participants