New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: structured_to_unstructured: view more often #23652
ENH: structured_to_unstructured: view more often #23652
Conversation
Converting with structured_to_unstructured() now returns a view, if all the fields have a constant stride.
582d6b4
to
5497568
Compare
This is a behaviour change. Could you propose it to the mailing list to see if there might be people who do not expect views, or wish to preserve the array class across calls to |
I have written a post on the mailing list: https://mail.python.org/archives/list/numpy-discussion@python.org/thread/Y2RKZHRAI5Q5SPVTG35ISKIASPFZ7J47/ |
Hmm. No one responded on the mailing list, so I guess this is up to the reviewers. Does anyone have concerns about this change in behavior? I am a little concerned that someone may modify the view without realizing they are modifying the original data. If no-one responds in a few days, I would say we should go ahead but
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this! I think if one passes in copy=False
, one should not get a copy unless absolutely necessary.
But I don't think there should be a behaviour change in subclasses suddenly turning into ndarray
-- e.g., I'm fairly sure that would break astropy
's tests for its Quantity
class.
I think for matrix
, the attempt at a view should just not be made (matrix
is deprecated anyway, but might as well keep the behaviour like it was).
For other subclasses there would seem to be a choice: either just not take the view
path for those either, or assume that they do the right thing -- and really that is up to them! (And if they override __array_function__
, they can ensure it will work.) My tendency would be to go for assuming they behave properly.
One thing that might help inform this is to explicitly test this on np.ma.MaskedArray
. Though to be honest I somewhat doubt the code actually works on those currently...
numpy/lib/recfunctions.py
Outdated
# stride, we can just return a view | ||
common_stride = _common_stride(offsets, counts, out_dtype.itemsize) | ||
if common_stride is not None: | ||
# ensure that we have a real ndarray; other types (e.g. matrix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should trust that subclasses do this right, and instead explicitly exclude matrix
- this is done elsewhere in the code too, and matrix
is deprecated anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only allow the types ndarray
, recarray
and memmap
, at least for now.
numpy/lib/recfunctions.py
Outdated
new_shape = arr.shape + (sum(counts), out_dtype.itemsize) | ||
new_strides = arr.strides + (abs(common_stride), 1) | ||
|
||
arr = arr[..., None].view(np.uint8) # view as bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use np.newaxis
- it is an alias of None
, but makes clear what the intent is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
# have strange slicing behavior | ||
arr = arr.view(type=np.ndarray) | ||
new_shape = arr.shape + (sum(counts), out_dtype.itemsize) | ||
new_strides = arr.strides + (abs(common_stride), 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use a negative common_stride
? It does mean you have to be careful with the offset below, but there is nothing wrong with a negative stride (right now, your [::-1]
at the end does the same thing).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give a suggestion of how you would do that cleanly?
I can of course do something like this:
# first_field_offset is the first byte of the first field
ar1 = arr[..., first_field_offset:]
# new_strides may be negative here
arr2 = np.lib.stride_tricks.as_strided(arr, new_shape, new_strides)
If the stride is negative, all but the last element will be cut off by the slicing operation.
The as_strided()
call then alters the strides, so that all elements can be accessed again.
I don't like that, because arr2
is derived from arr1
, but it is accessing memory that is out-of-bounds for arr1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it again, I think things should just work with a negative stride, as long as you make sure that the offset is guaranteed to be for the first item in the list - for a negative stride, that will be largest offset, so the slice below will not go out of memory.
It may well mean that you have to adjust _common_stide
- but I would not be surprised if it actually simplified it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it again, I think things should just work with a negative stride, as long as you make sure that the offset is guaranteed to be for the first item in the list - for a negative stride, that will be largest offset, so the slice below will not go out of memory.
You mean the code that I showed in my previous comment, but with first_field_offset == offsets[0]
? That is what I meant, and I do not like that approach (see below).
It may well mean that you have to adjust _common_stide - but I would not be surprised if it actually simplified it!
I don't see how that simplifies that function. It would stay identical as far as I can see.
Can you point me to what you would simplify or give an example?
Let me illustrate why I don't like using a negative stride here:
Let's say we have a structured array with two fields 'a' and 'b'. The first field is 'a', but 'b' is the first field in memory.
This would look like this: (b0 is the first byte of b, b1 the second, etc. The array indices are written on top. They are equal to byte offsets at this point, because we are viewing the array as uint8.)
0 1 2 3 4 5 6 7
[b0 b1 b2 b3 a0 a1 a2 a3]
Now we perform the slicing operation to set the correct offset. first_field_offset
is 4 in this case.
> arr = arr[..., first_field_offset:]
Now the memory looks like this:
All bytes of the field b are out of bounds.
- - - - 0 1 2 3
[b0 b1 b2 b3 a0 a1 a2 a3]
Now we set the correct offset.
new_shape
would be something like (..., 2, 4).
new_strides
would be something like (..., -4, 1).
This is the part I do not like, because we are effectively accessing memory, that was (temporarily) out-of-bounds.
> arr = np.lib.stride_tricks.as_strided(arr,
new_shape,
new_strides)
After that, it looks like this:
(1, 0) (1, 1) (1, 2) (1, 3) (0, 0) (0, 1) (0, 2) (0, 3)
[b0 b1 b2 b3 a0 a1 a2 a3 ]
In general, I am happy with the change/improvement, should maybe have a release note, but the function already says it tries return views (IIRC). I admit, it's a bit unfortunate complex, but it's not terrible. (I have not looked super closely, yet) |
I had not considered masked arrays before; yes, they are not handled correctly. I honestly don't see how I could write a generic approach that would work with most subclasses. E.g. I don't see a way to handle Therefore, I would rather stay on the safe side, and only return views for types where we know that this works (i.e. if the original array is a plain ndarray). I.e. we would whitelist only I don't want to implement a special case for masked arrays for now. That can be implemented in a later PR if needed. |
Makes sense to err on the side of caution. |
As for when to do this: one possibility is to check whether But if |
Which As I read your first paragraph, you don't want me to return a view for I cannot see MaskedArrays working without custom code. It fails at the line where we do a Apart from that, I have tried it with astropy and I don't think we need to do anything about it. It already works correctly, even if I only whitelist So in summary, I will go for the simple approach, just whitelisting a few classes that are known to work. All others will fall back to the old implementation. |
Updated. Now the new code is only used on I have had to add a call to |
a315043
to
fe4543b
Compare
Sorry that my earlier message was not clear: astropy has its own |
A more philosophical point: I really don't think code in numpy should start deciding for subclasses what's best for them, and thus exclude them from improvements. I think my logic of checking whether a subclass is able to work around things if possible is about as kind as it should get; a blacklist for |
|
||
if common_stride < 0: | ||
arr = arr[..., ::-1] # reverse, if the stride was negative | ||
if type(arr) is not type(wrap.__self__): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no need to do this -- by giving subok=True
, the wrapping is already done in as_strided
(and off course the various views are already OK). If you want to be sure, I suggest adding a test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a test, that's how I discovered that this is necessary.
It seems the recarray
turns into a ndarray
once I do the view(np.uint8)
. I don't think that's usually a problem, because a recarray
isn't very useful for a non-structured dtype.
But the old behavior is to keep the subclass anyway (which still happens with copy=True
) and I don't want to change that without a good reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, fair enough, another class that is showing its age (and informally deprecated...), though I guess it is logical. It is a pity as_strided
doesn't take an offsets argument!
But for here, maybe add a note in the code for why we do the wrap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Original author of this function here... this is a great idea! Since this function already sometimes returned a view and had an existing copy-argument, it seems like the risk of someone's code getting screwed up by now getting a view instead of a copy is low, so I don't think back-compat is too much of an issue (though I've been wrong before...) I did not check the logic in as fine detail as @mhvk, but reading through it looked OK. I suggest updating the docstring for the "copy" argument, and also adding a version note since there is a very tiny risk of back-compat issues. Maybe something like:
|
@ahaldane Thanks! I have updated the docstring with your suggestion. Note that if there is no padding in between the fields (i.e. stride = itemsize), a view can still be returned for some additional array subtypes, such as |
subok=True) | ||
|
||
# cast and drop the last dimension again | ||
arr = arr.view(out_dtype)[..., 0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noticed something else: Is there a reason I'm missing for not using np.ndarray
instead of the last few lines:
new_shape = arr.shape + (sum(counts),)
new_strides = arr.strides + (abs(common_stride),)
arr = np.ndarray(buffer=arr, dtype=out_dtype, shape=new_shape, strides=new_strides, offset=min(offsets)).view(type(a))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not aware that constructor existed. I will try that out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm facing two issues with that:
- The
np.ndarray()
throws "ValueError: ndarray is not contiguous" in some cases. At least it does when the input array is reversed; maybe in other cases too. - I couldn't find a way to preserve the
memmap
subclass yet. Just usingview()
to get it doesn't seem very useful, because that would lose its attributes likefilename
(if it even works). And its__array_wrap__()
seems to deliberately return a plain ndarray instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, thanks for checking. Then the code is fine as-is.
Somehow it seems like an oversight in the API that the np.ndarray(buffer=xxx)
can't handle non-contiguous inputs. The memmap issue could probably be fixed by appropriate call to __new__
, but that's a moot point.
For the docstring, you're right it is not totally accurate. How about making it technically accurate by being more open-ended about when exactly a view is returned by changing "which is" to "such as".
The other case it can return a view is if the "flattened" field structure is the same as the "flattened" structure with all fields converted to the same dtype, which is a bit too clunky and unclear to say. |
I have applied that. |
Thanks @aurivus-ph ! The PR looks ready to merge to me. I don't have commit rights right now, but anyone else please merge. |
Thanks @aurivus-ph and the reviewers |
Thank you very much for the review and getting this merged! |
Converting with
structured_to_unstructured()
now returns a view, if all the fields have a constant stride.This is useful, if a structured array has attributes with multiple types, but all the attributes with the same type are contiguous, e.g. a vertex with x, y, z as
float32
, and r, g, b asuint8
.This will allow the following now (all the
np.shares_memory()
calls now return true):Observable changes / issues:
as_strided()
does not ensure all dimensions are kept on array subclasses either