ENH: structured_to_unstructured: view more often #23652

aurivus-ph · 2023-04-24T09:12:29Z

Converting with structured_to_unstructured() now returns a view, if all the fields have a constant stride.

This is useful, if a structured array has attributes with multiple types, but all the attributes with the same type are contiguous, e.g. a vertex with x, y, z as float32, and r, g, b as uint8.

This will allow the following now (all the np.shares_memory() calls now return true):

import numpy as np
import numpy.lib.recfunctions as rfn

a = np.array([(1, 2, 3, 50, 127, 255), (4, 5, 6, 80, 100, 120)],
             dtype=[('x', 'f4'), ('y', 'f4'), ('z', 'f4'), ('red', 'u1'), ('green', 'u1'), ('blue', 'u1')])


# We can extract individual fields of a structured array, if they have a constant stride, without needing a copy:
xyz = rfn.structured_to_unstructured(a[['x', 'y', 'z']])
print(np.shares_memory(xyz, a))  # Output: True
print(xyz)
# Output:
# [[1. 2. 3.]
#  [4. 5. 6.]]

rgb = rfn.structured_to_unstructured(a[['red', 'green', 'blue']])
print(np.shares_memory(rgb, a))  # Output: True
print(rgb)
# Output:
# [[ 50 127 255]
#  [ 80 100 120]]


# Changing the order of fields is allowed too, as long as the stride remains constant:
rgb = rfn.structured_to_unstructured(a[['blue', 'red']])
print(np.shares_memory(rgb, a))  # Output: True
print(rgb)
# Output:
# [[255  50]
#  [120  80]]


# Arrays with sub-arrays are supported too:
a2 = np.array([((1, 2, 3), ((4, 5), (6, 7)), 999)], dtype=[('a', ('f4', 3)), ('b', (('f4', 2), 2)), ('unused', 'f4')])
print(a2)

ab = rfn.structured_to_unstructured(a2[['a', 'b']])
print(np.shares_memory(a2, ab))  # Output: True
print(ab)
# Output:
# [[1. 2. 3. 4. 5. 6. 7.]]

Observable changes / issues:

the .bases attribute of the returned array is more deeply nested now
darray subclasses are not preserved anymore, if copy is False
- some subclasses would have strange behavior, e.g. matrix always keeps 2 dimensions, but we want to add a new dimension during unstructuring
- as_strided() does not ensure all dimensions are kept on array subclasses either
- Is this an issue? The only I can think of to fix this is to go through the array_interface.

Converting with structured_to_unstructured() now returns a view, if all the fields have a constant stride.

mattip · 2023-04-25T07:24:16Z

This is a behaviour change. Could you propose it to the mailing list to see if there might be people who do not expect views, or wish to preserve the array class across calls to structured_to_unstructured?

aurivus-ph · 2023-04-25T15:56:57Z

I have written a post on the mailing list: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/Y2RKZHRAI5Q5SPVTG35ISKIASPFZ7J47/

mattip · 2023-04-27T15:23:37Z

Hmm. No one responded on the mailing list, so I guess this is up to the reviewers. Does anyone have concerns about this change in behavior? I am a little concerned that someone may modify the view without realizing they are modifying the original data. If no-one responds in a few days, I would say we should go ahead but

Please document the changed behaviour in a release note, following the upcoming changes README
The docstring for structured_to_unstructured should mention that the returned value can be a view into the original data

mhvk

I like this! I think if one passes in copy=False, one should not get a copy unless absolutely necessary.

But I don't think there should be a behaviour change in subclasses suddenly turning into ndarray -- e.g., I'm fairly sure that would break astropy's tests for its Quantity class.

I think for matrix, the attempt at a view should just not be made (matrix is deprecated anyway, but might as well keep the behaviour like it was).

For other subclasses there would seem to be a choice: either just not take the view path for those either, or assume that they do the right thing -- and really that is up to them! (And if they override __array_function__, they can ensure it will work.) My tendency would be to go for assuming they behave properly.

One thing that might help inform this is to explicitly test this on np.ma.MaskedArray. Though to be honest I somewhat doubt the code actually works on those currently...

mhvk · 2023-04-27T16:05:04Z

numpy/lib/recfunctions.py

+        # stride, we can just return a view
+        common_stride = _common_stride(offsets, counts, out_dtype.itemsize)
+        if common_stride is not None:
+            # ensure that we have a real ndarray; other types (e.g. matrix)


I think we should trust that subclasses do this right, and instead explicitly exclude matrix - this is done elsewhere in the code too, and matrix is deprecated anyway.

I only allow the types ndarray, recarray and memmap, at least for now.

mhvk · 2023-04-27T16:05:48Z

numpy/lib/recfunctions.py

+            new_shape = arr.shape + (sum(counts), out_dtype.itemsize)
+            new_strides = arr.strides + (abs(common_stride), 1)
+
+            arr = arr[..., None].view(np.uint8)  # view as bytes


I'd use np.newaxis - it is an alias of None, but makes clear what the intent is.

mhvk · 2023-04-27T16:07:32Z

numpy/lib/recfunctions.py

+            # have strange slicing behavior
+            arr = arr.view(type=np.ndarray)
+            new_shape = arr.shape + (sum(counts), out_dtype.itemsize)
+            new_strides = arr.strides + (abs(common_stride), 1)


Why not use a negative common_stride? It does mean you have to be careful with the offset below, but there is nothing wrong with a negative stride (right now, your [::-1] at the end does the same thing).

Can you give a suggestion of how you would do that cleanly?

I can of course do something like this:

# first_field_offset is the first byte of the first field ar1 = arr[..., first_field_offset:] # new_strides may be negative here arr2 = np.lib.stride_tricks.as_strided(arr, new_shape, new_strides)

If the stride is negative, all but the last element will be cut off by the slicing operation.
The as_strided() call then alters the strides, so that all elements can be accessed again.

I don't like that, because arr2 is derived from arr1, but it is accessing memory that is out-of-bounds for arr1.

Looking at it again, I think things should just work with a negative stride, as long as you make sure that the offset is guaranteed to be for the first item in the list - for a negative stride, that will be largest offset, so the slice below will not go out of memory.

It may well mean that you have to adjust _common_stide - but I would not be surprised if it actually simplified it!

Looking at it again, I think things should just work with a negative stride, as long as you make sure that the offset is guaranteed to be for the first item in the list - for a negative stride, that will be largest offset, so the slice below will not go out of memory.

You mean the code that I showed in my previous comment, but with first_field_offset == offsets[0]? That is what I meant, and I do not like that approach (see below).

It may well mean that you have to adjust _common_stide - but I would not be surprised if it actually simplified it!

I don't see how that simplifies that function. It would stay identical as far as I can see.
Can you point me to what you would simplify or give an example?

Let me illustrate why I don't like using a negative stride here:

Let's say we have a structured array with two fields 'a' and 'b'. The first field is 'a', but 'b' is the first field in memory.
This would look like this: (b0 is the first byte of b, b1 the second, etc. The array indices are written on top. They are equal to byte offsets at this point, because we are viewing the array as uint8.)

0 1 2 3 4 5 6 7 [b0 b1 b2 b3 a0 a1 a2 a3]

Now we perform the slicing operation to set the correct offset. first_field_offset is 4 in this case.

> arr = arr[..., first_field_offset:]

Now the memory looks like this:
All bytes of the field b are out of bounds.

- - - - 0 1 2 3 [b0 b1 b2 b3 a0 a1 a2 a3]

Now we set the correct offset.
new_shape would be something like (..., 2, 4).
new_strides would be something like (..., -4, 1).

This is the part I do not like, because we are effectively accessing memory, that was (temporarily) out-of-bounds.

> arr = np.lib.stride_tricks.as_strided(arr, new_shape, new_strides)

After that, it looks like this:

(1, 0) (1, 1) (1, 2) (1, 3) (0, 0) (0, 1) (0, 2) (0, 3) [b0 b1 b2 b3 a0 a1 a2 a3 ]

seberg · 2023-04-27T16:37:54Z

In general, I am happy with the change/improvement, should maybe have a release note, but the function already says it tries return views (IIRC). I admit, it's a bit unfortunate complex, but it's not terrible. (I have not looked super closely, yet)

aurivus-ph · 2023-04-28T12:55:46Z

I think for matrix, the attempt at a view should just not be made (matrix is deprecated anyway, but might as well keep the behaviour like it was).

For other subclasses there would seem to be a choice: either just not take the view path for those either, or assume that they do the right thing -- and really that is up to them! (And if they override __array_function__, they can ensure it will work.) My tendency would be to go for assuming they behave properly.

One thing that might help inform this is to explicitly test this on np.ma.MaskedArray. Though to be honest I somewhat doubt the code actually works on those currently...

I had not considered masked arrays before; yes, they are not handled correctly. I honestly don't see how I could write a generic approach that would work with most subclasses. E.g. I don't see a way to handle np.ma.MaskedArray without special-casing it here.

Therefore, I would rather stay on the safe side, and only return views for types where we know that this works (i.e. if the original array is a plain ndarray). I.e. we would whitelist only ndarray, instead of blacklisting matrix and np.ma.MaskedArray.

I don't want to implement a special case for masked arrays for now. That can be implemented in a later PR if needed.

mattip · 2023-04-28T14:12:20Z

we would accept only ndarray

Makes sense to err on the side of caution.

mhvk · 2023-04-28T16:29:29Z

As for when to do this: one possibility is to check whether array.__class__.__array_function__ is ndarray.__array_function__ - if so, the subclass has not overridden functions, so one should probably be conservative and not use the view (this will be the case for Matrix and MaskedArray).

But if array.__class__.__array_function__ is not ndarray.__array_function__, the subclass is all set up to handle possible problems, so one might as well aim to try to let them benefit. Certainly for astropy, this would work - Quantity and all its subclasses would benefit automatically, as would our Masked classes. Indeed, for our Masked classes, this would ensure we do not have to change anything, as its mask and data get passed through independently. Since the mask is guaranteed to be ndarray, while the data can be Quantity, it would be nice if by default they get treated the same.

aurivus-ph · 2023-05-02T07:49:02Z

As for when to do this: one possibility is to check whether array.__class__.__array_function__ is ndarray.__array_function__ - if so, the subclass has not overridden functions, so one should probably be conservative and not use the view (this will be the case for Matrix and MaskedArray).

But if array.__class__.__array_function__ is not ndarray.__array_function__, the subclass is all set up to handle possible problems, so one might as well aim to try to let them benefit. Certainly for astropy, this would work - Quantity and all its subclasses would benefit automatically, as would our Masked classes. Indeed, for our Masked classes, this would ensure we do not have to change anything, as its mask and data get passed through independently. Since the mask is guaranteed to be ndarray, while the data can be Quantity, it would be nice if by default they get treated the same.

Which Masked classes do you mean? Do you mean MaskedArray in both cases?

As I read your first paragraph, you don't want me to return a view for MaskedArray.
But your second paragraph sounds like you are saying that it would just work (i.e. you're saying it would benefit). That is contradictory to me.

I cannot see MaskedArrays working without custom code. It fails at the line where we do a view(np.uint8), because that would require a longer mask.

Apart from that, I have tried it with astropy and I don't think we need to do anything about it. It already works correctly, even if I only whitelist ndarray in our code (and fall back to a copy for all other types), because astropy already overrides structured_to_unstructured.

So in summary, I will go for the simple approach, just whitelisting a few classes that are known to work. All others will fall back to the old implementation.
This already works for astropy, because it overrides structured_to_unstructured.

aurivus-ph · 2023-05-02T11:02:35Z

Updated.

Now the new code is only used on ndarray, memmap and recarray. Other subclasses may be implemented later. Other projects can implement these themselves.

I have had to add a call to __array_wrap__ before returning. I hope it is correct this way.
I think it would be fine for recarray to be viewed as ndarray (because we are unstructuring after all), but this does not match the behavior of the function with copy=True.

mhvk · 2023-05-02T12:43:59Z

Which Masked classes do you mean? Do you mean MaskedArray in both cases?

Sorry that my earlier message was not clear: astropy has its own Masked class, which can interact with ndarray subclasses forming MaskedNDArray, MaskedQuantity, etc. It was needed as numpy's MaskedArray just does not interact well with subclasses at all.

mhvk · 2023-05-02T12:46:15Z

A more philosophical point: I really don't think code in numpy should start deciding for subclasses what's best for them, and thus exclude them from improvements. I think my logic of checking whether a subclass is able to work around things if possible is about as kind as it should get; a blacklist for Matrix and MaskedArray would be fine too.

mhvk · 2023-05-02T12:48:07Z

numpy/lib/recfunctions.py

+
+            if common_stride < 0:
+                arr = arr[..., ::-1]  # reverse, if the stride was negative
+            if type(arr) is not type(wrap.__self__):


There is no need to do this -- by giving subok=True, the wrapping is already done in as_strided (and off course the various views are already OK). If you want to be sure, I suggest adding a test.

I have added a test, that's how I discovered that this is necessary.

It seems the recarray turns into a ndarray once I do the view(np.uint8). I don't think that's usually a problem, because a recarray isn't very useful for a non-structured dtype.
But the old behavior is to keep the subclass anyway (which still happens with copy=True) and I don't want to change that without a good reason.

Ah, fair enough, another class that is showing its age (and informally deprecated...), though I guess it is logical. It is a pity as_strided doesn't take an offsets argument!

But for here, maybe add a note in the code for why we do the wrap?

aurivus-ph · 2023-05-08T06:56:48Z

@seberg @mhvk @mattip
Is there anything left to do? This is finished now from my point of view.

The failed "checks" seem to be unrelated to my changes.

ahaldane · 2023-05-19T20:13:42Z

Original author of this function here... this is a great idea!

Since this function already sometimes returned a view and had an existing copy-argument, it seems like the risk of someone's code getting screwed up by now getting a view instead of a copy is low, so I don't think back-compat is too much of an issue (though I've been wrong before...)

I did not check the logic in as fine detail as @mhvk, but reading through it looked OK.

I suggest updating the docstring for the "copy" argument, and also adding a version note since there is a very tiny risk of back-compat issues. Maybe something like:

copy : bool, optional
        If true, always return a copy. If false, a view is returned if possible, which is when 
        the `dtype`  and strides of the fields are suitable and the array subtype is one of 
        `np.ndarray`, `np.recarray` or `np.memmap`.

        .. versionchanged:: 1.25.0
            A view can now be returned if the fields are separated by a uniform stride.

aurivus-ph · 2023-05-22T07:11:33Z

@ahaldane Thanks! I have updated the docstring with your suggestion.

Note that if there is no padding in between the fields (i.e. stride = itemsize), a view can still be returned for some additional array subtypes, such as matrix, which is the old behavior. However, I could not come up with a concise description that includes this exception for the docstring.

ahaldane · 2023-05-22T16:50:19Z

numpy/lib/recfunctions.py

+                                                  subok=True)
+
+            # cast and drop the last dimension again
+            arr = arr.view(out_dtype)[..., 0]


Just noticed something else: Is there a reason I'm missing for not using np.ndarray instead of the last few lines:

new_shape = arr.shape + (sum(counts),) new_strides = arr.strides + (abs(common_stride),) arr = np.ndarray(buffer=arr, dtype=out_dtype, shape=new_shape, strides=new_strides, offset=min(offsets)).view(type(a))

I was not aware that constructor existed. I will try that out.

I'm facing two issues with that:

The np.ndarray() throws "ValueError: ndarray is not contiguous" in some cases. At least it does when the input array is reversed; maybe in other cases too.

I couldn't find a way to preserve the memmap subclass yet. Just using view() to get it doesn't seem very useful, because that would lose its attributes like filename (if it even works). And its __array_wrap__() seems to deliberately return a plain ndarray instead.

Hmm, thanks for checking. Then the code is fine as-is.

Somehow it seems like an oversight in the API that the np.ndarray(buffer=xxx) can't handle non-contiguous inputs. The memmap issue could probably be fixed by appropriate call to __new__, but that's a moot point.

ahaldane · 2023-05-22T16:54:45Z

For the docstring, you're right it is not totally accurate. How about making it technically accurate by being more open-ended about when exactly a view is returned by changing "which is" to "such as".

        If true, always return a copy. If false, a view is returned if
        possible, such as when the `dtype` and strides of the fields are
        suitable and the array subtype is one of `np.ndarray`, `np.recarray`
        or `np.memmap`.

The other case it can return a view is if the "flattened" field structure is the same as the "flattened" structure with all fields converted to the same dtype, which is a bit too clunky and unclear to say.

aurivus-ph · 2023-05-23T08:35:31Z

[...] How about making it technically accurate by being more open-ended about when exactly a view is returned by changing "which is" to "such as".

I have applied that.

ahaldane · 2023-05-23T16:05:00Z

Thanks @aurivus-ph ! The PR looks ready to merge to me.

I don't have commit rights right now, but anyone else please merge.

mattip · 2023-05-23T20:27:33Z

Thanks @aurivus-ph and the reviewers

aurivus-ph · 2023-05-24T06:44:23Z

Thank you very much for the review and getting this merged!

github-actions bot added the 01 - Enhancement label Apr 24, 2023

ENH: structured_to_unstructured: view more often

5497568

Converting with structured_to_unstructured() now returns a view, if all the fields have a constant stride.

aurivus-ph force-pushed the faster-structured-to-unstructured branch from 582d6b4 to 5497568 Compare April 24, 2023 09:14

mhvk reviewed Apr 27, 2023

View reviewed changes

aurivus-ph added 2 commits April 28, 2023 15:22

only unstructure a few array types

326f3bb

use np.newaxis instead of None

a304021

aurivus-ph added 3 commits May 2, 2023 10:46

whitespace. remove redundant if

6501fc9

changelog

d40a01a

ensure that ndarray subclasses are passed through

ed29bda

lint

fe4543b

aurivus-ph force-pushed the faster-structured-to-unstructured branch from a315043 to fe4543b Compare May 2, 2023 11:06

mhvk reviewed May 2, 2023

View reviewed changes

add note for the wrap call

3fa6295

mattip requested a review from mhvk May 9, 2023 13:30

updated docstring

a8752d2

ahaldane reviewed May 22, 2023

View reviewed changes

updated docstring again

836735e

mattip merged commit 2540554 into numpy:main May 23, 2023
59 checks passed

aurivus-ph deleted the faster-structured-to-unstructured branch May 24, 2023 06:43

ENH: structured_to_unstructured: view more often #23652

ENH: structured_to_unstructured: view more often #23652

Conversation

aurivus-ph commented Apr 24, 2023

mattip commented Apr 25, 2023

aurivus-ph commented Apr 25, 2023

mattip commented Apr 27, 2023

mhvk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

seberg commented Apr 27, 2023

aurivus-ph commented Apr 28, 2023

mattip commented Apr 28, 2023

mhvk commented Apr 28, 2023

aurivus-ph commented May 2, 2023

aurivus-ph commented May 2, 2023

mhvk commented May 2, 2023

mhvk commented May 2, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aurivus-ph commented May 8, 2023

ahaldane commented May 19, 2023

aurivus-ph commented May 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ahaldane commented May 22, 2023

aurivus-ph commented May 23, 2023

ahaldane commented May 23, 2023

mattip commented May 23, 2023

aurivus-ph commented May 24, 2023