ENH: Port single-copy np.block implementation to C #13186

ahaldane · 2019-03-25T20:52:28Z

This is a port from python to C of the new single-copy block algorithm developed by @hmaarrfk in #11971.

This port gets a speedup by 1. avoiding the shape-tuple-manipulation needed in python, and 2. Avoiding creating many temporary view objects of the output buffer. It does this by following the example of the np.concatenate code which solves 1 by using mutable ints in C, and solves 2 by temporarily clobbering the data-pointer and shape of the out array instead of using views.

The main speedup will be for small arrays where the optimization in #11971 didn't apply. I'll paste the before/after of python runtests.py --bench Block below, the results are pretty good.

Now there are no longer two different code-paths to maintain, all arrays go through the same single-copy code path. This version also adds an out kwd argument to np.block, and has a slightly better error message when there is a shape mismatch.

ahaldane · 2019-03-25T20:53:16Z

Results from python runtests.py --bench Block:

BEFORE

[100.00%] ··· ============== ======== ============= =============
              --                                n_chunks         
              ----------------------- ---------------------------
                  shape       dtype       (2, 2)        (4, 4)   
              ============== ======== ============= =============
                 (16, 16)     uint8     36.4±0.6μs     90.5±1μs  
                 (16, 16)     uint16    36.6±0.4μs     89.9±1μs  
                 (16, 16)     uint32    36.4±0.1μs    90.7±0.5μs 
                 (16, 16)     uint64    37.3±0.7μs     91.7±1μs  
                 (32, 32)     uint8     36.7±0.4μs    92.0±0.6μs 
                 (32, 32)     uint16    37.5±0.3μs    92.7±0.7μs 
                 (32, 32)     uint32    37.9±0.5μs    92.3±0.6μs 
                 (32, 32)     uint64    38.7±0.6μs    93.4±0.2μs 
                 (64, 64)     uint8     38.1±0.3μs     94.2±2μs  
                 (64, 64)     uint16    39.3±0.5μs    95.0±0.6μs 
                 (64, 64)     uint32    39.9±0.6μs    95.1±0.7μs 
                 (64, 64)     uint64    42.2±0.4μs     98.1±1μs  
                (128, 128)    uint8     40.1±0.4μs     96.2±1μs  
                (128, 128)    uint16    42.8±0.5μs    101±0.7μs  
                (128, 128)    uint32    49.0±0.5μs     106±1μs   
                (128, 128)    uint64    60.1±0.8μs     116±1μs   
                (256, 256)    uint8     49.5±0.2μs     109±1μs   
                (256, 256)    uint16     61.7±1μs      120±1μs   
                (256, 256)    uint32     82.5±2μs      140±4μs   
                (256, 256)    uint64     123±3μs       176±9μs   
                (512, 512)    uint8      83.3±3μs      145±3μs   
                (512, 512)    uint16     124±3μs       183±6μs   
                (512, 512)    uint32     197±10μs      274±20μs  
                (512, 512)    uint64    1.88±0.6ms     616±40μs  
               (1024, 1024)   uint8      159±2μs       258±10μs  
               (1024, 1024)   uint16     291±5μs       409±8μs   
               (1024, 1024)   uint32     899±20μs    1.04±0.03ms 
               (1024, 1024)   uint64   1.87±0.09ms   2.08±0.04ms 
              ============== ======== ============= =============

AFTER

[100.00%] ··· ============== ======== ============= =============
              --                                n_chunks         
              ----------------------- ---------------------------
                  shape       dtype       (2, 2)        (4, 4)   
              ============== ======== ============= =============
                 (16, 16)     uint8    7.37±0.08μs    16.5±0.1μs 
                 (16, 16)     uint16    7.41±0.1μs   17.2±0.07μs 
                 (16, 16)     uint32    7.62±0.1μs    17.0±0.3μs 
                 (16, 16)     uint64   7.84±0.09μs    17.4±0.4μs 
                 (32, 32)     uint8    7.80±0.06μs    17.2±0.2μs 
                 (32, 32)     uint16    7.81±0.2μs    17.6±0.3μs 
                 (32, 32)     uint32    8.14±0.1μs    17.6±0.2μs 
                 (32, 32)     uint64    8.24±0.1μs    17.9±0.2μs 
                 (64, 64)     uint8     8.16±0.1μs    18.3±0.3μs 
                 (64, 64)     uint16    8.72±0.1μs    18.4±0.3μs 
                 (64, 64)     uint32    8.97±0.2μs    19.0±0.2μs 
                 (64, 64)     uint64    9.83±0.1μs    19.4±0.3μs 
                (128, 128)    uint8     9.68±0.2μs    20.0±0.2μs 
                (128, 128)    uint16    10.7±0.1μs    21.5±0.4μs 
                (128, 128)    uint32    13.3±0.3μs    23.4±0.3μs 
                (128, 128)    uint64    18.4±0.4μs    30.2±0.8μs 
                (256, 256)    uint8     14.3±0.2μs    25.9±0.1μs 
                (256, 256)    uint16    20.5±0.8μs    32.1±0.4μs 
                (256, 256)    uint32    31.5±0.7μs     44.7±2μs  
                (256, 256)    uint64     49.2±1μs     63.4±0.8μs 
                (512, 512)    uint8      35.0±2μs     47.7±0.5μs 
                (512, 512)    uint16     56.8±2μs      71.0±4μs  
                (512, 512)    uint32     90.6±1μs      112±2μs   
                (512, 512)    uint64     207±4μs       224±5μs   
               (1024, 1024)   uint8      107±4μs       133±10μs  
               (1024, 1024)   uint16     222±5μs       245±6μs   
               (1024, 1024)   uint32     810±20μs      791±10μs  
               (1024, 1024)   uint64   1.78±0.04ms   1.79±0.04ms 
              ============== ======== ============= =============

hmaarrfk · 2019-03-25T21:04:32Z

Awesome stuff.

Im surprised to see such a difference in the performance for large arrays.

I'm wondering if there is a way for the whole copy function to be done with the C api instead of the python objects. It would avoid the need to hold the GIL during the copy

eric-wieser · 2019-03-25T21:26:52Z

This makes things harder for downstream libraries wanting to implement block on their duck-types, removing their reference implementation.

hmaarrfk · 2019-03-25T21:45:37Z

@eric-wieser, are you referring to the PR in general or the request to release the GIL?

ahaldane · 2019-03-25T21:47:49Z

@hmaarrfk I think one can interpret the results as there being a constant overhead due to views/tuples of ~15-30us for the 2x2 case and 70-100us for 4x4, for all the runs from large to small.

@eric-wieser True, and the same can be said of np.concatenate. Should we have pure-python implementations stored somewhere in the numpy tree to help ducktype developers?

eric-wieser · 2019-03-26T03:15:31Z

The other worry I have with this is it ruins my ability to use the existing infrastructure to implement np.unblock, which works in reverse:

# python
a = [b, c]
b, c = a

# numpy
a = np.block([b, c])
np.unblock([b, c], a)

# numpy with syntactic sugar:
a = np.b_[b, c]
np.b_[b, c] = a

One of the things I liked about @hmaarrfk's PR is it made it really easy to implement this.

ahaldane · 2019-03-26T03:49:50Z

This PR makes it trivial to implement unblock as well; I can easily add it if desired.

It's 6-character change: Replace PyArray_AssignArray(out, arr, NULL, NPY_SAME_KIND_CASTING) by PyArray_AssignArray(arr, out, NULL, NPY_SAME_KIND_CASTING).

(That's a cool idea by the way)

eric-wieser · 2019-03-26T04:09:15Z

No need to add it now, good to know that door remains open

mhvk · 2019-03-26T21:24:56Z

This looks nice!

Something perhaps for the future: might one be able to decouple the two stages? Thinking in terms of subclasses or array mimics, many would likely have the relevant shape/dtype/etc. atttributes that the first stage need, but most will fail the second stage (as for concatenate) as the copies that are being made cannot be overridden (would need going through __setitem__ instead of using PyArray_AssignArray).

Though arguably for this kind of thing is premature - we do not even have something that takes the shapes of all inputs and returns the broadcast shape... which surely many have implemented independently by now.

hameerabbasi · 2019-03-27T19:45:23Z

numpy/core/src/multiarray/multiarraymodule.c

+
+    *dtype = NULL;
+
+    if (PyTuple_Check(arrays)) {


Is there a way to make this check a bit more robust, so that more is checked than just tuple?

what kind of robustness are yuo thinking of adding? Can you point to a specific issue? This seems like the same logic as the all-python code.

Currently, it only checks against tuple. Is it possible to check against other non-allowed structures in a more generic way? Something like isinstance(x, list) and not isinstance(x, ndarray).

I think it is because we also allow other array-like stuff. For convenience, we allow things like scalars

# We've 'bottomed out' - arrays is either a scalar or an array size = _nx.size(arrays) return parent_index, _nx.ndim(arrays), size

hmaarrfk · 2019-03-28T01:49:23Z

how does this compare to calling cythonize on the py file as is? Is this kind of "C rewriting" worth it?

hmaarrfk · 2019-03-28T12:02:06Z

@ahaldane in the python implementation, the code was organized such that:

Sanitization + individual array attribute extraction
Shape concatenation
dtype dtermination
Allocation
Assignment

Were are all very independent methods.

In this new proposal, 1-3 are combined into a single recursion. I wonder if that is too much to do in a single function.

ahaldane · 2019-03-28T15:38:54Z

Responding to a few different comments:

I'm wondering if there is a way for the whole copy function to be done with the C api instead of the python objects. It would avoid the need to hold the GIL during the copy

Not only that, but the creation of the new nested list gives a performance hit. Currently, comparing the times of np.concatenate([np.ones(3)]*100) and np.block([np.ones(3)]*100), the latter is slower, and I think the only slowdown must be due to the construction of a new nested list which np.concatenate does not need to do, since I think the performance of all of the other code is nearly identical in the C implementations. Since it needed to be a variable length nested list structure, using python lists seemed easier than something C. But if we want a further speed improvement, avoiding the nested list creation would be top of the list.

how does this compare to calling cythonize on the py file as is? Is this kind of "C rewriting" worth it?

I can't say for sure unless it gets implemented, but I suspect cython won't save you from the tuple manipulation nor the view creation, which are the main speedups here. As for "is it worth it", I can't say for sure either. It does give a nice 6x speedup for small arrays, but no-one asked for a speedup.

In this new proposal, 1-3 are combined into a single recursion. I wonder if that is too much to do in a single function.

It seemed both faster and avoided C code duplication this way, because in C you need a lot of boilerplate for the recursion logic.

I'll add really all the hard work here was in your PR and with Eric's reviews of it, to figure out the logic and the edge cases to test for. The C port was fairly straightforward building on that. I only did this PR because I had already thought about the C port when reviewing your PR, @hmaarrfk, and I wanted to write it down before I forgot it all. If we want more time to think about alternatives, I'm fine leaving this PR alone for a while.

seberg · 2019-05-21T23:12:54Z

@ahaldane currently has merge conflicts. I think I will put reviewing this on my TODO list, but ping me if I don't and you want to continue.

hmaarrfk · 2019-05-21T23:20:20Z

Hey guys, generally my comments here is that moving this to dense C code will encourage others to roll their own algorithms (since python users aren't always adept with numpy c code)

Is it possible to maybe roll _block_info_recursion in c and expose it as an experimental numpy API? Not sure if there is an established path forward for that.

I think this would avoid much of the python overhead analyzing tuples and lists, while leaving the two steps of algorithm seperate:

Identify the slices that we need to index using metadata
Slice in the arrays.

I can see this becoming really useful to algorithms that could benefit from hardware acceleration or other distributed computation paradigms.

mhvk · 2019-05-21T23:44:45Z

Agreed that the index-gathering state might well be useful independently for duck arrays.

hmaarrfk · 2022-04-05T18:05:39Z

Do we want to drop the requirement for an index gathering function?

We could open issue copy pasting the code _block_setup and leaving it for a separate PR.

I'm not sure if anybody else has used it.

hmaarrfk · 2022-04-05T18:06:34Z

A cursory search on github showed that most references of _block_setup that showup are from people vendoring numpy code.

mhvk · 2022-04-05T18:35:51Z

Hmm, weird that the search does not uncover that astropy uses _block_setup in two places... See https://github.com/astropy/astropy/blob/f54df76430eac0f67efb1e61be5a31f4c3029b9d/astropy/utils/masked/function_helpers.py#L455-L476
and https://github.com/astropy/astropy/blob/f54df76430eac0f67efb1e61be5a31f4c3029b9d/astropy/units/quantity_helper/function_helpers.py#L388-L411

In both places, we are using it to help override np.block for use with Masked and Quantity instances. While we can copy & paste, it would seem nicer to keep (and use) it in numpy...

hmaarrfk · 2022-04-05T18:59:09Z

I did not thoroughly read through things. thank you for revealing those.

Generally speaking, at scikit-image we learned the hard way not to use functions that start _ due to numpy's decision to remove certain functions.

I believe you can vendor the specific functions you need.

I think it may be useful to make these a public API. the Array interface seems to have matured since 2019 and likely others would like to use this blocking algorithm. It was mostly "theoretical" in 2019, but I am very glad to see that other libraries are using this code!

If this were to be made public, I would suggest that we remove the word recursion from the function name.

mhvk · 2022-04-05T19:52:34Z

@hmaarrfk - yes, completely agreed that one should not rely on private functions. We test against numpy-dev to know if things break and obviously will copy the code if needed. Here, my goal was just to note that the function is actually useful!

ENH: Port single-copy np.block implementation to C

7f8dc30

ahaldane force-pushed the port_block_C branch from aa5c1d8 to 7f8dc30 Compare March 25, 2019 21:01

charris added 01 - Enhancement 03 - Maintenance component: numpy._core labels Mar 27, 2019

hameerabbasi reviewed Mar 27, 2019

View reviewed changes

seberg self-requested a review May 21, 2019 23:11

shoyer mentioned this pull request Jun 1, 2019

MAINT: avoid nested dispatch in numpy.core.shape_base #13634

Merged

seberg removed their request for review August 22, 2019 13:14

Base automatically changed from master to main March 4, 2021 02:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Port single-copy np.block implementation to C #13186

ENH: Port single-copy np.block implementation to C #13186

ahaldane commented Mar 25, 2019

ahaldane commented Mar 25, 2019

hmaarrfk commented Mar 25, 2019

eric-wieser commented Mar 25, 2019

hmaarrfk commented Mar 25, 2019

ahaldane commented Mar 25, 2019

eric-wieser commented Mar 26, 2019 •

edited

ahaldane commented Mar 26, 2019 •

edited

eric-wieser commented Mar 26, 2019

mhvk commented Mar 26, 2019

hameerabbasi Mar 27, 2019

hmaarrfk Mar 28, 2019

hameerabbasi Mar 28, 2019

hmaarrfk Mar 28, 2019

hmaarrfk commented Mar 28, 2019

hmaarrfk commented Mar 28, 2019

ahaldane commented Mar 28, 2019

seberg commented May 21, 2019

hmaarrfk commented May 21, 2019 •

edited

mhvk commented May 21, 2019

hmaarrfk commented Apr 5, 2022

hmaarrfk commented Apr 5, 2022

mhvk commented Apr 5, 2022

hmaarrfk commented Apr 5, 2022

mhvk commented Apr 5, 2022

ENH: Port single-copy np.block implementation to C #13186

Are you sure you want to change the base?

ENH: Port single-copy np.block implementation to C #13186

Conversation

ahaldane commented Mar 25, 2019

ahaldane commented Mar 25, 2019

hmaarrfk commented Mar 25, 2019

eric-wieser commented Mar 25, 2019

hmaarrfk commented Mar 25, 2019

ahaldane commented Mar 25, 2019

eric-wieser commented Mar 26, 2019 • edited

ahaldane commented Mar 26, 2019 • edited

eric-wieser commented Mar 26, 2019

mhvk commented Mar 26, 2019

hameerabbasi Mar 27, 2019

Choose a reason for hiding this comment

hmaarrfk Mar 28, 2019

Choose a reason for hiding this comment

hameerabbasi Mar 28, 2019

Choose a reason for hiding this comment

hmaarrfk Mar 28, 2019

Choose a reason for hiding this comment

hmaarrfk commented Mar 28, 2019

hmaarrfk commented Mar 28, 2019

ahaldane commented Mar 28, 2019

seberg commented May 21, 2019

hmaarrfk commented May 21, 2019 • edited

mhvk commented May 21, 2019

hmaarrfk commented Apr 5, 2022

hmaarrfk commented Apr 5, 2022

mhvk commented Apr 5, 2022

hmaarrfk commented Apr 5, 2022

mhvk commented Apr 5, 2022

eric-wieser commented Mar 26, 2019 •

edited

ahaldane commented Mar 26, 2019 •

edited

hmaarrfk commented May 21, 2019 •

edited