Initializing arrays with subarray dtype #17511

mhvk · 2020-10-08T21:21:47Z

#17419, which deprecated the use of structured arrays composed of just a subarray, is causing a few problems in astropy (see astropy/astropy#10815). While these can be fixed with the suggestion made by @seberg, one of the fixes is not very nice: In our case, it turns out all problems are with np.array([], dtype=subarray), which means we have to special-case an empty list.

But I wonder if it doesn't make sense to try to move towards the case where

np.array(data, dtype)

always returns an array with the requested dtype, and for data=[] it would be equivalent to np.empty(0, dtype). Certainly, it is strange that the first of the following works but the second does not:

>>> import numpy as np
>>> np.array(([1, 2], 3), dtype='(2,)i,i')
array(([1, 2], 3), dtype=[('f0', '<i4', (2,)), ('f1', '<i4')])
>>> np.array(([1, 2],), dtype='(2,)i')
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.

Of course, I realize it is not easy to move from arguably mistaken behaviour to more logical one, but looking at our problem cases at least, I realize that in those we basically assumed the more logical one (i.e., they were bugs that somehow failed to get exposed)...

Concretely, I wonder if one could warn and move to more logical behaviour in one go (or give the option to use the more logical behaviour or so).

The text was updated successfully, but these errors were encountered:

mhvk · 2020-10-08T21:42:28Z

Never mind. This behaviour is more ingrained than I thought, so will be hard to change, I think. Let me just close this.

seberg · 2020-10-08T21:54:05Z

Well, one issue is that arr.astype("(2)i") or the equivalent np.array(arr, dtype="(2)i") misbehave. Now it may be possible to retain the current behaviour for nested list of tuples (not the second case you are listed above though!).

In that case, I think we would have to tag on a FutureWarning on the array-like cases above, but that may be fine. Those are normally errors currently.

mhvk · 2020-10-08T23:04:42Z

Actually, let me re-open, as I got only more and more confused: Why is it OK that np.empty returns a dtype inconsistent with that requested, but not for np.array([], dtype) to just do the same thing?

np.empty(0, dtype='(2,)f4')
Out[20]: array([], shape=(0, 2), dtype=float32)

mhvk · 2020-10-08T23:06:07Z

I guess if the semantics are really that the shape of a single subarray gets appended to the shape of the data, then maybe it is best to just try to be more consistent in that?

seberg · 2020-10-08T23:35:55Z

Yes, the semantics of appending the dimension is well defined. The problem is the sematics of assigning the data to the array. If we assign after "appending" the dtype shape into the array shape, you basically have incorrect broadcasting.

This currently works fine if you include tuples in a nested list (the new code has more difficulties with it). But when there is an array-like (or nested array-likes), the problem is that:

dtype = np.dtype("(2)i")
res = np.array(arr)

becomes:

res = np.empty(arr.shape + dtype.shape, dtype=dtype.base)
res[...] = arr

Which means that arr suddenly gets broadcast against dimensions that were not there previously.

For arbitrary input, the shape can be discovered with from numpy.core._multiarray_umath import _discover_array_parameters on master. And this function stops when it finds the first tuple(), so it does not discover tuples as dimensions.

The new code currently has slight issues with that. The old code basically just worked in that case, as long as the tuples had the right "depth" (number of dimensions), since discovered_shape + dtype.shape works out right.

Now for the desired behaviour: Arguably, the correct behaviour is this:

res_with_subarray = np.empty(shape, dtype=dtype, do_not_append_dimensions=True)
res_with_subarray[...] = arr
res = res_with_subarray.append_subarray_dimensions()

Which I can do for the tuple case, but behaves different from the (arguably broken) case array case. I am happy to try that, we just have to make a choice to:

Add a bit of hacks to try and give a warning for the currently broken array case
Rip of the band-aid on it and hope that nobody has working code relying on the arguably buggy behaviour.

In either case, we internally need the do_not_append_dimensions=True, but that is a pretty simple view on the C-side.

(Hopefully this made sense.)

mhvk · 2020-10-09T00:10:15Z

From the purely selfish astropy perspective, it would be nice if np.array([], dtype) continued to be equivalent to np.empty(0, dtype) - though if that means special-casing yet more perhaps better not!

More generally, it would seem better to move further towards the correct behaviour you describe. I think warning on any array-like input might make sense - it would seem better to restrict for those rathre than include warning for cases like the example in #17173, which seems unambiguous. But I obviously haven't really thought this through...

Separately, it would be nice if there were an option to get the dtype one actually asked for, even if it were a subarray (though I guess one can always fake it with things like '(2,2)f4,V0'; it seems remarkably difficult to do otherwise!).

seberg · 2020-10-09T01:48:31Z

OK, still good to know that this affects projects like astropy, have to put this on my pre 1.20 todo list...

Giving a warning may for array-likes may be slightly annoying, but we could gamble that it is fine to give it pretty indiscriminately and a bit imprecise (i.e. no matter if the result is currently an error and will start working, or will change behaviour). The main issue is that I do not see how to opt in to the new behaviour.

I agree the list of tuples case is clear, and (hopefully) well defined. The way it works in 1.19, is not the way I say above, but that probably ends up doing the same thing (or so I hope).

I don't really feel like going into the depth of actually allowing such dtypes to be attached, you can get far with a structure with a single element, but specifically for coercion from lists of tuples that has different behaviour (expects one extra nesting level in the tuple).
In the end, that would probably require some type of flag. I am not convinced that the current behaviour of appending dimensions to the array is useful outside of indexing (where it could be special cased). But, honestly, I don't see how to change that -- aside from creating a new structured (and maybe subarray) DType to replace void, which probably makes sense in any case.

mhvk · 2020-10-10T00:47:36Z

A warning just for sub-array dtype and just array-likes or non-simple tuples would at least be less intrusive.

Agreed that it is hard to see an opt-in path forward apart from a special dtype. And perhaps it also matters less than I thought: I realize that I was wrong about it being so hard to get the right dtype - it really should have been a dtype like [('field', '(2,2)f4')], in which case the data in the np.array call has to be different too, but that is fine. I guess I've always been a bit confused that something like 2f4,2f4 becomes a structured array, but 2f4 does not.

This currently appends the subarray dtype dimensions first and then tries to assign to the result array which uses incorrect broadcasting (broadcasting against the subarray dimensions instead of repeating each element according to the subarray dimensions). This also fixes the python scalar pathway `np.array(2, dtype="(2)f4,")` which previously only filled the first value. I consider that a clear bug fix. Closes numpygh-17511

mhvk added component: numpy._core component: numpy.dtype labels Oct 8, 2020

This was referenced Oct 8, 2020

numpy-dev using a dtype with a subarray field is deprecated astropy/astropy#10815

Closed

Numpy dev fixes for subarray dtype astropy/astropy#10818

Closed

mhvk closed this as completed Oct 8, 2020

mhvk reopened this Oct 8, 2020

seberg added this to the 1.20.0 release milestone Oct 9, 2020

seberg mentioned this issue Oct 20, 2020

DEP,BUG: Coercion/cast of array to a subarray dtype will be fixed #17596

Merged

mattip closed this as completed in #17596 Oct 29, 2020

takluyver mentioned this issue Feb 16, 2021

Unable to modify dataset attribute if it is a nested array h5py/h5py#1814

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initializing arrays with subarray dtype #17511

Initializing arrays with subarray dtype #17511

mhvk commented Oct 8, 2020

mhvk commented Oct 8, 2020

seberg commented Oct 8, 2020

mhvk commented Oct 8, 2020

mhvk commented Oct 8, 2020

seberg commented Oct 8, 2020

mhvk commented Oct 9, 2020

seberg commented Oct 9, 2020

mhvk commented Oct 10, 2020

Initializing arrays with subarray dtype #17511

Initializing arrays with subarray dtype #17511

Comments

mhvk commented Oct 8, 2020

mhvk commented Oct 8, 2020

seberg commented Oct 8, 2020

mhvk commented Oct 8, 2020

mhvk commented Oct 8, 2020

seberg commented Oct 8, 2020

mhvk commented Oct 9, 2020

seberg commented Oct 9, 2020

mhvk commented Oct 10, 2020