PERF: allow creation of read-only array views of scalars #29876

mhvk · 2025-10-05T00:28:13Z

In ufuncs in particular, a bottleneck for dealing with scalars is turning them into arrays. This PR adds a short-cut for those classes in PyArray_FromAny, where if the array is not specifically requested to be writeable, a readable view of the scalar is returned. For python scalars, only float is supported; int and complex are more tricky (strings and bytes should presumably be possible, but I didn't add them yet).

Timings:

a = np.float64(1.5)                                                                                                                         
%timeit np.array(a, copy=False)  # Comparing with copy=True on main.                                                                        
167->110 ns for float64, 94 ns for float                                                                                                    
%timeit np.add(a, a)                                                                                                                        
512->431 ns for float64,                                                                                                                    
                                                                                                                                            
a = 1.5                                                                                                                                     
%timeit np.array(a, copy=False)  # Comparing with copy=True on main.                                                                        
167->110 ns for float64, 94 ns for float                                                                                                    
%timeit np.add(a, a)                                                                                                                        
526->414 ns

An annoyance is that quite a bit of code relies on np.asanyarray(scalar) to return a writeable array. As implemented, this is ensured by checking the PyArray_FORCECAST flag, which can be safely unset for copy=False. As a result, however, some of the error messages change in nature. This could be resolved with a new flag, if needed. Alternatively, this whole machinery could just be moved to convert_ufunc_arguments, since those arguably benefit most from this.

In ufuncs in particular, a bottleneck for dealing with scalars is turning them into arrays. This PR adds a short-cut for those classes in PyArray_FromAny, where if the array is not specifically requested to be writeable, a readable view of the scalar is returned. For python scalars, only float is supported; int and complex are more tricky (strings and bytes should presumably be possible, but I didn't add them yet). Timings: ``` a = np.float64(1.5) %timeit np.array(a, copy=False) # Comparing with copy=True on main. 167->110 ns for float64, 94 ns for float %timeit np.add(a, a) 512->431 ns for float64, a = 1.5 %timeit np.array(a, copy=False) # Comparing with copy=True on main. 167->110 ns for float64, 94 ns for float %timeit np.add(a, a) 526->414 ns ``` An annoyance is that quite a bit of code relies on `np.asanyarray(scalar)` to return a writeable array. As implemented, this is ensured by checking the `PyArray_FORCECAST` flag, which can be safely unset for `copy=False`. As a result, however, some of the error messages change in nature. This could be resolved with a new flag, if needed.

seberg · 2025-10-05T10:10:45Z

Hmmmm, returning a read-only view seems totally fine to me. But I am a bit surprised that it makes a 100ns difference.
The allocation for the data should be cached here, so I don't think saving the allocation is actually meaningful, which would mean that PyArray_Pack is slow? But I don't really see why it should be meaningfully slow, since all dtype assignment functions should have a fast-check for their scalar types (I can see that adding 10-20 ns, but 100?).

I am still curious if we can't do fast path for scalar array creation deeper, but I guess I should try that myself a bit.

I am also worried that this very much oversimplifies things to the point of being incorrect. One thing might be to check that in_descr is identical to the actual descriptor.
(One final surprise to me is: this doesn't break arr[indx]["field"] = val?! Maybe I am missing it?)

seberg · 2025-10-05T11:29:37Z

Prodding it a bit, PyArray_Pack actually is rather slow in the sum of it. Some are clearly fixable, one is that PyArray_Pack deals with the possibility that when the scalar dtype may not match the actual one but this is impossible in the scalar case when in_descr == NULL. So for that case at least there definitely is some extra work (because I still think the allocation itself isn't the biggest issue here).

mhvk · 2025-10-05T11:38:06Z

@seberg - I think the problem is real, but I am not sure this is the best way forward - in particular, I don't quite like the hack I did to make asanyarray(scalar) work. I think it would be fair game in ufunc_object or perhaps with a flag that explicitly tells that any type of read-only view is OK.

The reason void still works is that that check is done before the one I added - if I swap them, it indeed doesn't work (found out the hard way of course!).

Note that I did do timings at various places, and the memory allocation is definitely a problem - though not so much the allocation itself, but rather getting mem_handler and tracking the allocation in PyDataMem_UserNEW. Specifically, I tried np.empty(1):

35 ns for entering array_empty
! 64 ns (+30) after parser
76 ns before tp_alloc
! 104 ns (+30) after tp_alloc (unavoidable?)
122 ns before mem_handler
139 ns just before data allocation
!! 201 ns (+60) after data alloc (maybe excess?)
202 ns after updateflags
221 ns prior to exit

By replacing that part with a simple malloc for small sizes, this went down to 138 ns.

With my laptop in performance mode to try to make it more stable (so everything is faster),

%timeit np.empty(1)
122->82 ns
%timeit np.array(1.0)
133->92 ns???  (no tracing via change in alloc.c: 117 ns)
%timeit np.add(1.0, 2.0)
526->429 ns

I think on second thought this may be the better route. The main annoyance is that one has to deal with different options in dealloc, where a warning is raised if no mem_handler is available (that warning says one should go through base, which is how I ended up with the current PR...). Maybe another option for small allocations (or at least for scalars) is to use tp_alloc(type, extra_items).

seberg · 2025-10-05T12:18:36Z

I was prodding it a bit myself @ngoldbaum once suggest samply (which is pretty nice, as it drops you into a browser immediately with just samply -r 10000 -- spin python perf.py and you can remove the stuff you don't want.)

But I got a bit side-tracked in setitem, that won't make a big difference, but I think I may make a PR anyway, in a bit.

By replacing that part with a simple malloc for small sizes, this went down to 138 ns.

I wouldn't even try to use malloc, if anything it is probably worse. When it comes to array creation speed, I can see two things:

If the handler is the default one we could merge the allocations for small or 0-D arrays.
We could consider creating a free-list for small arrays, even to the point of adding space for npy_clongdouble to every array object so that for small allocations (without a custom handler unfortunately) the allocation is always included. Or at least always included for specially marked 0-D arrays that are put on the free-list.

!! 201 ns (+60) after data alloc (maybe excess?)

Yes, we can't avoid a bit at least, but looking at the profile there is a silly amount spend in PyCapsule_GetPointer which we can skip for the default allocator (and this overhead is repeated in the deallocation, it's not much, but it may be 5% of the total time -- if the rest gets significantly faster then this may get visible).

EDIT:

133->92 ns??? (no tracing via change in alloc.c: 117 ns)

Hmmmm, more significant than I would have thought. Maybe a reason for merging allocations for small arrays.

mhvk · 2025-10-07T02:25:18Z

I was prodding it a bit myself @ngoldbaum once suggest samply (which is pretty nice, as it drops you into a browser immediately with just samply -r 10000 -- spin python perf.py and you can remove the stuff you don't want.)

Where can I find this samply? (searching gives me audio samplers/streamers or some distribution sampler...)

seberg · 2025-10-07T07:14:48Z

Hah, true it's a bit hidden, conda seems to have it https://github.com/mstange/samply

mhvk · 2025-10-13T20:27:20Z

Moving this to draft since I think it is not a bad idea but would definitely need to be "on request" -- and it is not clear there is a need for it if storing the data on the array instance works well enough.

With gh-28576, it has become possible to ensure that when calling a ufunc, the output is guaranteed to be an array. This PR uses that to replace np.asanyarray calls in some functions. I'm not sure this is complete -- these were just parts of code for which test failed if np.anyarray(scalar) is made to return a read-only array (see gh-29876). I do think there are all small improvements, though.

mhvk added 01 - Enhancement component: numpy._core labels Oct 5, 2025

mhvk requested review from ngoldbaum and seberg October 5, 2025 00:28

mhvk mentioned this pull request Oct 5, 2025

ENH, PERF: allocate memory as part of the array object for scalars #29878

Draft

This comment was marked as off-topic.

Sign in to view

mhvk marked this pull request as draft October 13, 2025 20:27

mhvk mentioned this pull request Oct 14, 2025

MAINT: replace use of asanyarray with out=... to keep arrays #29951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

PERF: allow creation of read-only array views of scalars #29876

PERF: allow creation of read-only array views of scalars #29876

Uh oh!

mhvk commented Oct 5, 2025

Uh oh!

seberg commented Oct 5, 2025

Uh oh!

seberg commented Oct 5, 2025

Uh oh!

mhvk commented Oct 5, 2025

Uh oh!

seberg commented Oct 5, 2025 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

mhvk commented Oct 7, 2025

Uh oh!

seberg commented Oct 7, 2025

Uh oh!

mhvk commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

PERF: allow creation of read-only array views of scalars #29876

Are you sure you want to change the base?

PERF: allow creation of read-only array views of scalars #29876

Uh oh!

Conversation

mhvk commented Oct 5, 2025

Uh oh!

seberg commented Oct 5, 2025

Uh oh!

seberg commented Oct 5, 2025

Uh oh!

mhvk commented Oct 5, 2025

Uh oh!

seberg commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

mhvk commented Oct 7, 2025

Uh oh!

seberg commented Oct 7, 2025

Uh oh!

mhvk commented Oct 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

seberg commented Oct 5, 2025 •

edited

Loading