Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: structured_to_unstructured: view more often #23652

Merged
merged 10 commits into from May 23, 2023
72 changes: 71 additions & 1 deletion numpy/lib/recfunctions.py
Expand Up @@ -885,6 +885,52 @@ def count_elem(dt):
fields.extend([(d, c, o + i*size) for d, c, o in subfields])
return fields

def _common_stride(offsets, counts, itemsize):
"""
Returns the stride between the fields, or None if the stride is not
constant. The values in "counts" designate the lengths of
sub-arrays. Sub-arrays are treated as many contiguous fields, with
always positive stride.
"""

if len(offsets) <= 1:
return itemsize

negative = offsets[1] < offsets[0] # negative stride
if negative:
# reverse, so offsets will be ascending
it = zip(reversed(offsets), reversed(counts))
else:
it = zip(offsets, counts)

prev_offset = None
stride = None
for offset, count in it:
if count != 1: # sub array: always c-contiguous
if negative:
return None # sub-arrays can never have a negative stride
if stride is None:
stride = itemsize
if stride != itemsize:
return None
end_offset = offset + (count - 1) * itemsize
else:
end_offset = offset

if prev_offset is not None:
new_stride = offset - prev_offset
if stride is None:
stride = new_stride
if stride != new_stride:
return None

prev_offset = end_offset

if stride is not None:
if negative:
return -stride
return stride


def _structured_to_unstructured_dispatcher(arr, dtype=None, copy=None,
casting=None):
Expand Down Expand Up @@ -960,7 +1006,7 @@ def structured_to_unstructured(arr, dtype=None, copy=False, casting='unsafe'):
if dtype is None:
out_dtype = np.result_type(*[dt.base for dt in dts])
else:
out_dtype = dtype
out_dtype = np.dtype(dtype)

# Use a series of views and casts to convert to an unstructured array:

Expand All @@ -972,6 +1018,30 @@ def structured_to_unstructured(arr, dtype=None, copy=False, casting='unsafe'):
'itemsize': arr.dtype.itemsize})
arr = arr.view(flattened_fields)

if (not copy) and all(dt.base == out_dtype for dt in dts):
# all elements have the right dtype already; if they have a common
# stride, we can just return a view
common_stride = _common_stride(offsets, counts, out_dtype.itemsize)
if common_stride is not None:
# ensure that we have a real ndarray; other types (e.g. matrix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should trust that subclasses do this right, and instead explicitly exclude matrix - this is done elsewhere in the code too, and matrix is deprecated anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only allow the types ndarray, recarray and memmap, at least for now.

# have strange slicing behavior
arr = arr.view(type=np.ndarray)
new_shape = arr.shape + (sum(counts), out_dtype.itemsize)
new_strides = arr.strides + (abs(common_stride), 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a negative common_stride? It does mean you have to be careful with the offset below, but there is nothing wrong with a negative stride (right now, your [::-1] at the end does the same thing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give a suggestion of how you would do that cleanly?

I can of course do something like this:

# first_field_offset is the first byte of the first field
ar1 = arr[..., first_field_offset:]
# new_strides may be negative here
arr2 = np.lib.stride_tricks.as_strided(arr, new_shape, new_strides)

If the stride is negative, all but the last element will be cut off by the slicing operation.
The as_strided() call then alters the strides, so that all elements can be accessed again.

I don't like that, because arr2 is derived from arr1, but it is accessing memory that is out-of-bounds for arr1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again, I think things should just work with a negative stride, as long as you make sure that the offset is guaranteed to be for the first item in the list - for a negative stride, that will be largest offset, so the slice below will not go out of memory.

It may well mean that you have to adjust _common_stide - but I would not be surprised if it actually simplified it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again, I think things should just work with a negative stride, as long as you make sure that the offset is guaranteed to be for the first item in the list - for a negative stride, that will be largest offset, so the slice below will not go out of memory.

You mean the code that I showed in my previous comment, but with first_field_offset == offsets[0]? That is what I meant, and I do not like that approach (see below).

It may well mean that you have to adjust _common_stide - but I would not be surprised if it actually simplified it!

I don't see how that simplifies that function. It would stay identical as far as I can see.
Can you point me to what you would simplify or give an example?


Let me illustrate why I don't like using a negative stride here:

Let's say we have a structured array with two fields 'a' and 'b'. The first field is 'a', but 'b' is the first field in memory.
This would look like this: (b0 is the first byte of b, b1 the second, etc. The array indices are written on top. They are equal to byte offsets at this point, because we are viewing the array as uint8.)

 0  1  2  3  4  5  6  7
[b0 b1 b2 b3 a0 a1 a2 a3]

Now we perform the slicing operation to set the correct offset. first_field_offset is 4 in this case.

> arr = arr[..., first_field_offset:]

Now the memory looks like this:
All bytes of the field b are out of bounds.

 -  -  -  -  0  1  2  3  
[b0 b1 b2 b3 a0 a1 a2 a3]

Now we set the correct offset.
new_shape would be something like (..., 2, 4).
new_strides would be something like (..., -4, 1).

This is the part I do not like, because we are effectively accessing memory, that was (temporarily) out-of-bounds.

> arr = np.lib.stride_tricks.as_strided(arr,
                                        new_shape,
                                        new_strides)

After that, it looks like this:

 (1, 0) (1, 1) (1, 2) (1, 3) (0, 0) (0, 1) (0, 2) (0, 3)
[b0     b1     b2     b3     a0     a1     a2     a3    ]


arr = arr[..., None].view(np.uint8) # view as bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use np.newaxis - it is an alias of None, but makes clear what the intent is.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

arr = arr[..., min(offsets):] # remove the leading unused data
arr = np.lib.stride_tricks.as_strided(arr,
new_shape,
new_strides)

# cast and drop the last dimension again
arr = arr.view(out_dtype)[..., 0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed something else: Is there a reason I'm missing for not using np.ndarray instead of the last few lines:

new_shape = arr.shape + (sum(counts),)
new_strides = arr.strides + (abs(common_stride),)
arr = np.ndarray(buffer=arr, dtype=out_dtype, shape=new_shape, strides=new_strides, offset=min(offsets)).view(type(a))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware that constructor existed. I will try that out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm facing two issues with that:

  1. The np.ndarray() throws "ValueError: ndarray is not contiguous" in some cases. At least it does when the input array is reversed; maybe in other cases too.
  2. I couldn't find a way to preserve the memmap subclass yet. Just using view() to get it doesn't seem very useful, because that would lose its attributes like filename (if it even works). And its __array_wrap__() seems to deliberately return a plain ndarray instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thanks for checking. Then the code is fine as-is.

Somehow it seems like an oversight in the API that the np.ndarray(buffer=xxx) can't handle non-contiguous inputs. The memmap issue could probably be fixed by appropriate call to __new__, but that's a moot point.


if common_stride < 0:
arr = arr[..., ::-1] # reverse, if the stride was negative
return arr

# next cast to a packed format with all fields converted to new dtype
packed_fields = np.dtype({'names': names,
'formats': [(out_dtype, dt.shape) for dt in dts]})
Expand Down
35 changes: 31 additions & 4 deletions numpy/lib/tests/test_recfunctions.py
Expand Up @@ -263,8 +263,13 @@ def test_structured_to_unstructured(self):
dtype=[('x', 'i4'), ('y', 'i4'), ('z', 'i4')])
dd = structured_to_unstructured(d)
ddd = unstructured_to_structured(dd, d.dtype)
assert_(dd.base is d)
assert_(ddd.base is d)
assert_(np.shares_memory(dd, d))
assert_(np.shares_memory(ddd, d))

# check that reversing the order of attributes works
dd_attrib_rev = structured_to_unstructured(d[['z', 'x']])
assert_equal(dd_attrib_rev, [[5, 1], [7, 4], [11, 7], [12, 10]])
assert_(np.shares_memory(dd_attrib_rev, d))

# including uniform fields with subarrays unpacked
d = np.array([(1, [2, 3], [[ 4, 5], [ 6, 7]]),
Expand All @@ -273,8 +278,30 @@ def test_structured_to_unstructured(self):
('x2', ('i4', (2, 2)))])
dd = structured_to_unstructured(d)
ddd = unstructured_to_structured(dd, d.dtype)
assert_(dd.base is d)
assert_(ddd.base is d)
assert_(np.shares_memory(dd, d))
assert_(np.shares_memory(ddd, d))

# check that reversing with sub-arrays works as expected
d_rev = d[::-1]
dd_rev = structured_to_unstructured(d_rev)
assert_equal(dd_rev, [[8, 9, 10, 11, 12, 13, 14],
[1, 2, 3, 4, 5, 6, 7]])

# check that sub-arrays keep the order of their values
d_attrib_rev = d[['x2', 'x1', 'x0']]
dd_attrib_rev = structured_to_unstructured(d_attrib_rev)
assert_equal(dd_attrib_rev, [[4, 5, 6, 7, 2, 3, 1],
[11, 12, 13, 14, 9, 10, 8]])

# with ignored field at the end
d = np.array([(1, [2, 3], [[4, 5], [6, 7]], 32),
(8, [9, 10], [[11, 12], [13, 14]], 64)],
dtype=[('x0', 'i4'), ('x1', ('i4', 2)),
('x2', ('i4', (2, 2))), ('ignored', 'u1')])
dd = structured_to_unstructured(d[['x0', 'x1', 'x2']])
assert_(np.shares_memory(dd, d))
assert_equal(dd, [[1, 2, 3, 4, 5, 6, 7],
[8, 9, 10, 11, 12, 13, 14]])

# test that nested fields with identical names don't break anything
point = np.dtype([('x', int), ('y', int)])
Expand Down