Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: structured_to_unstructured: view more often #23652

Merged
merged 10 commits into from May 23, 2023

Conversation

aurivus-ph
Copy link
Contributor

Converting with structured_to_unstructured() now returns a view, if all the fields have a constant stride.

This is useful, if a structured array has attributes with multiple types, but all the attributes with the same type are contiguous, e.g. a vertex with x, y, z as float32, and r, g, b as uint8.

This will allow the following now (all the np.shares_memory() calls now return true):

import numpy as np
import numpy.lib.recfunctions as rfn

a = np.array([(1, 2, 3, 50, 127, 255), (4, 5, 6, 80, 100, 120)],
             dtype=[('x', 'f4'), ('y', 'f4'), ('z', 'f4'), ('red', 'u1'), ('green', 'u1'), ('blue', 'u1')])


# We can extract individual fields of a structured array, if they have a constant stride, without needing a copy:
xyz = rfn.structured_to_unstructured(a[['x', 'y', 'z']])
print(np.shares_memory(xyz, a))  # Output: True
print(xyz)
# Output:
# [[1. 2. 3.]
#  [4. 5. 6.]]

rgb = rfn.structured_to_unstructured(a[['red', 'green', 'blue']])
print(np.shares_memory(rgb, a))  # Output: True
print(rgb)
# Output:
# [[ 50 127 255]
#  [ 80 100 120]]


# Changing the order of fields is allowed too, as long as the stride remains constant:
rgb = rfn.structured_to_unstructured(a[['blue', 'red']])
print(np.shares_memory(rgb, a))  # Output: True
print(rgb)
# Output:
# [[255  50]
#  [120  80]]


# Arrays with sub-arrays are supported too:
a2 = np.array([((1, 2, 3), ((4, 5), (6, 7)), 999)], dtype=[('a', ('f4', 3)), ('b', (('f4', 2), 2)), ('unused', 'f4')])
print(a2)

ab = rfn.structured_to_unstructured(a2[['a', 'b']])
print(np.shares_memory(a2, ab))  # Output: True
print(ab)
# Output:
# [[1. 2. 3. 4. 5. 6. 7.]]

Observable changes / issues:

  • the .bases attribute of the returned array is more deeply nested now
  • darray subclasses are not preserved anymore, if copy is False
    • some subclasses would have strange behavior, e.g. matrix always keeps 2 dimensions, but we want to add a new dimension during unstructuring
    • as_strided() does not ensure all dimensions are kept on array subclasses either
    • Is this an issue? The only I can think of to fix this is to go through the array_interface.

Converting with structured_to_unstructured() now returns a view, if all
the fields have a constant stride.
@aurivus-ph aurivus-ph force-pushed the faster-structured-to-unstructured branch from 582d6b4 to 5497568 Compare April 24, 2023 09:14
@mattip
Copy link
Member

mattip commented Apr 25, 2023

This is a behaviour change. Could you propose it to the mailing list to see if there might be people who do not expect views, or wish to preserve the array class across calls to structured_to_unstructured?

@aurivus-ph
Copy link
Contributor Author

@mattip
Copy link
Member

mattip commented Apr 27, 2023

Hmm. No one responded on the mailing list, so I guess this is up to the reviewers. Does anyone have concerns about this change in behavior? I am a little concerned that someone may modify the view without realizing they are modifying the original data. If no-one responds in a few days, I would say we should go ahead but

  • Please document the changed behaviour in a release note, following the upcoming changes README
  • The docstring for structured_to_unstructured should mention that the returned value can be a view into the original data

Copy link
Contributor

@mhvk mhvk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this! I think if one passes in copy=False, one should not get a copy unless absolutely necessary.

But I don't think there should be a behaviour change in subclasses suddenly turning into ndarray -- e.g., I'm fairly sure that would break astropy's tests for its Quantity class.

I think for matrix, the attempt at a view should just not be made (matrix is deprecated anyway, but might as well keep the behaviour like it was).

For other subclasses there would seem to be a choice: either just not take the view path for those either, or assume that they do the right thing -- and really that is up to them! (And if they override __array_function__, they can ensure it will work.) My tendency would be to go for assuming they behave properly.

One thing that might help inform this is to explicitly test this on np.ma.MaskedArray. Though to be honest I somewhat doubt the code actually works on those currently...

# stride, we can just return a view
common_stride = _common_stride(offsets, counts, out_dtype.itemsize)
if common_stride is not None:
# ensure that we have a real ndarray; other types (e.g. matrix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should trust that subclasses do this right, and instead explicitly exclude matrix - this is done elsewhere in the code too, and matrix is deprecated anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only allow the types ndarray, recarray and memmap, at least for now.

new_shape = arr.shape + (sum(counts), out_dtype.itemsize)
new_strides = arr.strides + (abs(common_stride), 1)

arr = arr[..., None].view(np.uint8) # view as bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use np.newaxis - it is an alias of None, but makes clear what the intent is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# have strange slicing behavior
arr = arr.view(type=np.ndarray)
new_shape = arr.shape + (sum(counts), out_dtype.itemsize)
new_strides = arr.strides + (abs(common_stride), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a negative common_stride? It does mean you have to be careful with the offset below, but there is nothing wrong with a negative stride (right now, your [::-1] at the end does the same thing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a suggestion of how you would do that cleanly?

I can of course do something like this:

# first_field_offset is the first byte of the first field
ar1 = arr[..., first_field_offset:]
# new_strides may be negative here
arr2 = np.lib.stride_tricks.as_strided(arr, new_shape, new_strides)

If the stride is negative, all but the last element will be cut off by the slicing operation.
The as_strided() call then alters the strides, so that all elements can be accessed again.

I don't like that, because arr2 is derived from arr1, but it is accessing memory that is out-of-bounds for arr1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again, I think things should just work with a negative stride, as long as you make sure that the offset is guaranteed to be for the first item in the list - for a negative stride, that will be largest offset, so the slice below will not go out of memory.

It may well mean that you have to adjust _common_stide - but I would not be surprised if it actually simplified it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again, I think things should just work with a negative stride, as long as you make sure that the offset is guaranteed to be for the first item in the list - for a negative stride, that will be largest offset, so the slice below will not go out of memory.

You mean the code that I showed in my previous comment, but with first_field_offset == offsets[0]? That is what I meant, and I do not like that approach (see below).

It may well mean that you have to adjust _common_stide - but I would not be surprised if it actually simplified it!

I don't see how that simplifies that function. It would stay identical as far as I can see.
Can you point me to what you would simplify or give an example?


Let me illustrate why I don't like using a negative stride here:

Let's say we have a structured array with two fields 'a' and 'b'. The first field is 'a', but 'b' is the first field in memory.
This would look like this: (b0 is the first byte of b, b1 the second, etc. The array indices are written on top. They are equal to byte offsets at this point, because we are viewing the array as uint8.)

 0  1  2  3  4  5  6  7
[b0 b1 b2 b3 a0 a1 a2 a3]

Now we perform the slicing operation to set the correct offset. first_field_offset is 4 in this case.

> arr = arr[..., first_field_offset:]

Now the memory looks like this:
All bytes of the field b are out of bounds.

 -  -  -  -  0  1  2  3  
[b0 b1 b2 b3 a0 a1 a2 a3]

Now we set the correct offset.
new_shape would be something like (..., 2, 4).
new_strides would be something like (..., -4, 1).

This is the part I do not like, because we are effectively accessing memory, that was (temporarily) out-of-bounds.

> arr = np.lib.stride_tricks.as_strided(arr,
                                        new_shape,
                                        new_strides)

After that, it looks like this:

 (1, 0) (1, 1) (1, 2) (1, 3) (0, 0) (0, 1) (0, 2) (0, 3)
[b0     b1     b2     b3     a0     a1     a2     a3    ]

@seberg
Copy link
Member

seberg commented Apr 27, 2023

In general, I am happy with the change/improvement, should maybe have a release note, but the function already says it tries return views (IIRC). I admit, it's a bit unfortunate complex, but it's not terrible. (I have not looked super closely, yet)

@aurivus-ph
Copy link
Contributor Author

I think for matrix, the attempt at a view should just not be made (matrix is deprecated anyway, but might as well keep the behaviour like it was).

For other subclasses there would seem to be a choice: either just not take the view path for those either, or assume that they do the right thing -- and really that is up to them! (And if they override __array_function__, they can ensure it will work.) My tendency would be to go for assuming they behave properly.

One thing that might help inform this is to explicitly test this on np.ma.MaskedArray. Though to be honest I somewhat doubt the code actually works on those currently...

I had not considered masked arrays before; yes, they are not handled correctly. I honestly don't see how I could write a generic approach that would work with most subclasses. E.g. I don't see a way to handle np.ma.MaskedArray without special-casing it here.

Therefore, I would rather stay on the safe side, and only return views for types where we know that this works (i.e. if the original array is a plain ndarray). I.e. we would whitelist only ndarray, instead of blacklisting matrix and np.ma.MaskedArray.

I don't want to implement a special case for masked arrays for now. That can be implemented in a later PR if needed.

@mattip
Copy link
Member

mattip commented Apr 28, 2023

we would accept only ndarray

Makes sense to err on the side of caution.

@mhvk
Copy link
Contributor

mhvk commented Apr 28, 2023

As for when to do this: one possibility is to check whether array.__class__.__array_function__ is ndarray.__array_function__ - if so, the subclass has not overridden functions, so one should probably be conservative and not use the view (this will be the case for Matrix and MaskedArray).

But if array.__class__.__array_function__ is not ndarray.__array_function__, the subclass is all set up to handle possible problems, so one might as well aim to try to let them benefit. Certainly for astropy, this would work - Quantity and all its subclasses would benefit automatically, as would our Masked classes. Indeed, for our Masked classes, this would ensure we do not have to change anything, as its mask and data get passed through independently. Since the mask is guaranteed to be ndarray, while the data can be Quantity, it would be nice if by default they get treated the same.

@aurivus-ph
Copy link
Contributor Author

As for when to do this: one possibility is to check whether array.__class__.__array_function__ is ndarray.__array_function__ - if so, the subclass has not overridden functions, so one should probably be conservative and not use the view (this will be the case for Matrix and MaskedArray).

But if array.__class__.__array_function__ is not ndarray.__array_function__, the subclass is all set up to handle possible problems, so one might as well aim to try to let them benefit. Certainly for astropy, this would work - Quantity and all its subclasses would benefit automatically, as would our Masked classes. Indeed, for our Masked classes, this would ensure we do not have to change anything, as its mask and data get passed through independently. Since the mask is guaranteed to be ndarray, while the data can be Quantity, it would be nice if by default they get treated the same.

Which Masked classes do you mean? Do you mean MaskedArray in both cases?

As I read your first paragraph, you don't want me to return a view for MaskedArray.
But your second paragraph sounds like you are saying that it would just work (i.e. you're saying it would benefit). That is contradictory to me.

I cannot see MaskedArrays working without custom code. It fails at the line where we do a view(np.uint8), because that would require a longer mask.

Apart from that, I have tried it with astropy and I don't think we need to do anything about it. It already works correctly, even if I only whitelist ndarray in our code (and fall back to a copy for all other types), because astropy already overrides structured_to_unstructured.


So in summary, I will go for the simple approach, just whitelisting a few classes that are known to work. All others will fall back to the old implementation.
This already works for astropy, because it overrides structured_to_unstructured.

@aurivus-ph
Copy link
Contributor Author

Updated.

Now the new code is only used on ndarray, memmap and recarray. Other subclasses may be implemented later. Other projects can implement these themselves.

I have had to add a call to __array_wrap__ before returning. I hope it is correct this way.
I think it would be fine for recarray to be viewed as ndarray (because we are unstructuring after all), but this does not match the behavior of the function with copy=True.

@aurivus-ph aurivus-ph force-pushed the faster-structured-to-unstructured branch from a315043 to fe4543b Compare May 2, 2023 11:06
@mhvk
Copy link
Contributor

mhvk commented May 2, 2023

Which Masked classes do you mean? Do you mean MaskedArray in both cases?

Sorry that my earlier message was not clear: astropy has its own Masked class, which can interact with ndarray subclasses forming MaskedNDArray, MaskedQuantity, etc. It was needed as numpy's MaskedArray just does not interact well with subclasses at all.

@mhvk
Copy link
Contributor

mhvk commented May 2, 2023

A more philosophical point: I really don't think code in numpy should start deciding for subclasses what's best for them, and thus exclude them from improvements. I think my logic of checking whether a subclass is able to work around things if possible is about as kind as it should get; a blacklist for Matrix and MaskedArray would be fine too.


if common_stride < 0:
arr = arr[..., ::-1] # reverse, if the stride was negative
if type(arr) is not type(wrap.__self__):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to do this -- by giving subok=True, the wrapping is already done in as_strided (and off course the various views are already OK). If you want to be sure, I suggest adding a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a test, that's how I discovered that this is necessary.

It seems the recarray turns into a ndarray once I do the view(np.uint8). I don't think that's usually a problem, because a recarray isn't very useful for a non-structured dtype.
But the old behavior is to keep the subclass anyway (which still happens with copy=True) and I don't want to change that without a good reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, fair enough, another class that is showing its age (and informally deprecated...), though I guess it is logical. It is a pity as_strided doesn't take an offsets argument!

But for here, maybe add a note in the code for why we do the wrap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@aurivus-ph
Copy link
Contributor Author

@seberg @mhvk @mattip
Is there anything left to do? This is finished now from my point of view.

The failed "checks" seem to be unrelated to my changes.

@mattip mattip requested a review from mhvk May 9, 2023 13:30
@ahaldane
Copy link
Member

Original author of this function here... this is a great idea!

Since this function already sometimes returned a view and had an existing copy-argument, it seems like the risk of someone's code getting screwed up by now getting a view instead of a copy is low, so I don't think back-compat is too much of an issue (though I've been wrong before...)

I did not check the logic in as fine detail as @mhvk, but reading through it looked OK.

I suggest updating the docstring for the "copy" argument, and also adding a version note since there is a very tiny risk of back-compat issues. Maybe something like:

copy : bool, optional
        If true, always return a copy. If false, a view is returned if possible, which is when 
        the `dtype`  and strides of the fields are suitable and the array subtype is one of 
        `np.ndarray`, `np.recarray` or `np.memmap`.

        .. versionchanged:: 1.25.0
            A view can now be returned if the fields are separated by a uniform stride.

@aurivus-ph
Copy link
Contributor Author

@ahaldane Thanks! I have updated the docstring with your suggestion.

Note that if there is no padding in between the fields (i.e. stride = itemsize), a view can still be returned for some additional array subtypes, such as matrix, which is the old behavior. However, I could not come up with a concise description that includes this exception for the docstring.

subok=True)

# cast and drop the last dimension again
arr = arr.view(out_dtype)[..., 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed something else: Is there a reason I'm missing for not using np.ndarray instead of the last few lines:

new_shape = arr.shape + (sum(counts),)
new_strides = arr.strides + (abs(common_stride),)
arr = np.ndarray(buffer=arr, dtype=out_dtype, shape=new_shape, strides=new_strides, offset=min(offsets)).view(type(a))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware that constructor existed. I will try that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm facing two issues with that:

  1. The np.ndarray() throws "ValueError: ndarray is not contiguous" in some cases. At least it does when the input array is reversed; maybe in other cases too.
  2. I couldn't find a way to preserve the memmap subclass yet. Just using view() to get it doesn't seem very useful, because that would lose its attributes like filename (if it even works). And its __array_wrap__() seems to deliberately return a plain ndarray instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thanks for checking. Then the code is fine as-is.

Somehow it seems like an oversight in the API that the np.ndarray(buffer=xxx) can't handle non-contiguous inputs. The memmap issue could probably be fixed by appropriate call to __new__, but that's a moot point.

@ahaldane
Copy link
Member

For the docstring, you're right it is not totally accurate. How about making it technically accurate by being more open-ended about when exactly a view is returned by changing "which is" to "such as".

        If true, always return a copy. If false, a view is returned if
        possible, such as when the `dtype` and strides of the fields are
        suitable and the array subtype is one of `np.ndarray`, `np.recarray`
        or `np.memmap`.

The other case it can return a view is if the "flattened" field structure is the same as the "flattened" structure with all fields converted to the same dtype, which is a bit too clunky and unclear to say.

@aurivus-ph
Copy link
Contributor Author

[...] How about making it technically accurate by being more open-ended about when exactly a view is returned by changing "which is" to "such as".

I have applied that.

@ahaldane
Copy link
Member

Thanks @aurivus-ph ! The PR looks ready to merge to me.

I don't have commit rights right now, but anyone else please merge.

@mattip mattip merged commit 2540554 into numpy:main May 23, 2023
59 checks passed
@mattip
Copy link
Member

mattip commented May 23, 2023

Thanks @aurivus-ph and the reviewers

@aurivus-ph aurivus-ph deleted the faster-structured-to-unstructured branch May 24, 2023 06:43
@aurivus-ph
Copy link
Contributor Author

Thank you very much for the review and getting this merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants