Skip to content

Conversation

@mhvk
Copy link
Contributor

@mhvk mhvk commented Oct 5, 2025

In ufuncs in particular, a bottleneck for dealing with scalars is turning them into arrays. This PR adds a short-cut for those classes in PyArray_FromAny, where if the array is not specifically requested to be writeable, a readable view of the scalar is returned. For python scalars, only float is supported; int and complex are more tricky (strings and bytes should presumably be possible, but I didn't add them yet).

Timings:

a = np.float64(1.5)                                                                                                                         
%timeit np.array(a, copy=False)  # Comparing with copy=True on main.                                                                        
167->110 ns for float64, 94 ns for float                                                                                                    
%timeit np.add(a, a)                                                                                                                        
512->431 ns for float64,                                                                                                                    
                                                                                                                                            
a = 1.5                                                                                                                                     
%timeit np.array(a, copy=False)  # Comparing with copy=True on main.                                                                        
167->110 ns for float64, 94 ns for float                                                                                                    
%timeit np.add(a, a)                                                                                                                        
526->414 ns                                                                                                                                 

An annoyance is that quite a bit of code relies on np.asanyarray(scalar) to return a writeable array. As implemented, this is ensured by checking the PyArray_FORCECAST flag, which can be safely unset for copy=False. As a result, however, some of the error messages change in nature. This could be resolved with a new flag, if needed. Alternatively, this whole machinery could just be moved to convert_ufunc_arguments, since those arguably benefit most from this.

In ufuncs in particular, a bottleneck for dealing with scalars is
turning them into arrays.  This PR adds a short-cut for those classes
in PyArray_FromAny, where if the array is not specifically requested
to be writeable, a readable view of the scalar is returned.  For
python scalars, only float is supported; int and complex are more
tricky (strings and bytes should presumably be possible, but I didn't
add them yet).

Timings:
```
a = np.float64(1.5)
%timeit np.array(a, copy=False)  # Comparing with copy=True on main.
167->110 ns for float64, 94 ns for float
%timeit np.add(a, a)
512->431 ns for float64,

a = 1.5
%timeit np.array(a, copy=False)  # Comparing with copy=True on main.
167->110 ns for float64, 94 ns for float
%timeit np.add(a, a)
526->414 ns
```

An annoyance is that quite a bit of code relies on
`np.asanyarray(scalar)` to return a writeable array.  As implemented,
this is ensured by checking the `PyArray_FORCECAST` flag, which can be
safely unset for `copy=False`. As a result, however, some of the error
messages change in nature. This could be resolved with a new flag, if
needed.
@seberg
Copy link
Member

seberg commented Oct 5, 2025

Hmmmm, returning a read-only view seems totally fine to me. But I am a bit surprised that it makes a 100ns difference.
The allocation for the data should be cached here, so I don't think saving the allocation is actually meaningful, which would mean that PyArray_Pack is slow? But I don't really see why it should be meaningfully slow, since all dtype assignment functions should have a fast-check for their scalar types (I can see that adding 10-20 ns, but 100?).

I am still curious if we can't do fast path for scalar array creation deeper, but I guess I should try that myself a bit.

I am also worried that this very much oversimplifies things to the point of being incorrect. One thing might be to check that in_descr is identical to the actual descriptor.
(One final surprise to me is: this doesn't break arr[indx]["field"] = val?! Maybe I am missing it?)

@seberg
Copy link
Member

seberg commented Oct 5, 2025

Prodding it a bit, PyArray_Pack actually is rather slow in the sum of it. Some are clearly fixable, one is that PyArray_Pack deals with the possibility that when the scalar dtype may not match the actual one but this is impossible in the scalar case when in_descr == NULL. So for that case at least there definitely is some extra work (because I still think the allocation itself isn't the biggest issue here).

@mhvk
Copy link
Contributor Author

mhvk commented Oct 5, 2025

@seberg - I think the problem is real, but I am not sure this is the best way forward - in particular, I don't quite like the hack I did to make asanyarray(scalar) work. I think it would be fair game in ufunc_object or perhaps with a flag that explicitly tells that any type of read-only view is OK.

The reason void still works is that that check is done before the one I added - if I swap them, it indeed doesn't work (found out the hard way of course!).

Note that I did do timings at various places, and the memory allocation is definitely a problem - though not so much the allocation itself, but rather getting mem_handler and tracking the allocation in PyDataMem_UserNEW. Specifically, I tried np.empty(1):

  • 35 ns for entering array_empty
  • ! 64 ns (+30) after parser
  • 76 ns before tp_alloc
  • ! 104 ns (+30) after tp_alloc (unavoidable?)
  • 122 ns before mem_handler
  • 139 ns just before data allocation
  • !! 201 ns (+60) after data alloc (maybe excess?)
  • 202 ns after updateflags
  • 221 ns prior to exit

By replacing that part with a simple malloc for small sizes, this went down to 138 ns.

With my laptop in performance mode to try to make it more stable (so everything is faster),

%timeit np.empty(1)
122->82 ns
%timeit np.array(1.0)
133->92 ns???  (no tracing via change in alloc.c: 117 ns)
%timeit np.add(1.0, 2.0)
526->429 ns

I think on second thought this may be the better route. The main annoyance is that one has to deal with different options in dealloc, where a warning is raised if no mem_handler is available (that warning says one should go through base, which is how I ended up with the current PR...). Maybe another option for small allocations (or at least for scalars) is to use tp_alloc(type, extra_items).

@seberg
Copy link
Member

seberg commented Oct 5, 2025

I was prodding it a bit myself @ngoldbaum once suggest samply (which is pretty nice, as it drops you into a browser immediately with just samply -r 10000 -- spin python perf.py and you can remove the stuff you don't want.)

But I got a bit side-tracked in setitem, that won't make a big difference, but I think I may make a PR anyway, in a bit.

By replacing that part with a simple malloc for small sizes, this went down to 138 ns.

I wouldn't even try to use malloc, if anything it is probably worse. When it comes to array creation speed, I can see two things:

  1. If the handler is the default one we could merge the allocations for small or 0-D arrays.
  2. We could consider creating a free-list for small arrays, even to the point of adding space for npy_clongdouble to every array object so that for small allocations (without a custom handler unfortunately) the allocation is always included. Or at least always included for specially marked 0-D arrays that are put on the free-list.

!! 201 ns (+60) after data alloc (maybe excess?)

Yes, we can't avoid a bit at least, but looking at the profile there is a silly amount spend in PyCapsule_GetPointer which we can skip for the default allocator (and this overhead is repeated in the deallocation, it's not much, but it may be 5% of the total time -- if the rest gets significantly faster then this may get visible).

EDIT:

133->92 ns??? (no tracing via change in alloc.c: 117 ns)

Hmmmm, more significant than I would have thought. Maybe a reason for merging allocations for small arrays.

@mhvk

This comment was marked as off-topic.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 7, 2025

I was prodding it a bit myself @ngoldbaum once suggest samply (which is pretty nice, as it drops you into a browser immediately with just samply -r 10000 -- spin python perf.py and you can remove the stuff you don't want.)

Where can I find this samply? (searching gives me audio samplers/streamers or some distribution sampler...)

@seberg
Copy link
Member

seberg commented Oct 7, 2025

Hah, true it's a bit hidden, conda seems to have it https://github.com/mstange/samply

@mhvk
Copy link
Contributor Author

mhvk commented Oct 13, 2025

Moving this to draft since I think it is not a bad idea but would definitely need to be "on request" -- and it is not clear there is a need for it if storing the data on the array instance works well enough.

@mhvk mhvk marked this pull request as draft October 13, 2025 20:27
seberg pushed a commit that referenced this pull request Oct 28, 2025
With gh-28576, it has become possible to ensure that when calling a ufunc, the output is guaranteed to be an array.

This PR uses that to replace np.asanyarray calls in some functions. I'm not sure this is complete -- these were just parts of code for which test failed if np.anyarray(scalar) is made to return a read-only array (see gh-29876).

I do think there are all small improvements, though.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants