Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initializing arrays with subarray dtype #17511

Closed
mhvk opened this issue Oct 8, 2020 · 8 comments · Fixed by #17596
Closed

Initializing arrays with subarray dtype #17511

mhvk opened this issue Oct 8, 2020 · 8 comments · Fixed by #17596

Comments

@mhvk
Copy link
Contributor

mhvk commented Oct 8, 2020

#17419, which deprecated the use of structured arrays composed of just a subarray, is causing a few problems in astropy (see astropy/astropy#10815). While these can be fixed with the suggestion made by @seberg, one of the fixes is not very nice: In our case, it turns out all problems are with np.array([], dtype=subarray), which means we have to special-case an empty list.

But I wonder if it doesn't make sense to try to move towards the case where

np.array(data, dtype)

always returns an array with the requested dtype, and for data=[] it would be equivalent to np.empty(0, dtype). Certainly, it is strange that the first of the following works but the second does not:

>>> import numpy as np
>>> np.array(([1, 2], 3), dtype='(2,)i,i')
array(([1, 2], 3), dtype=[('f0', '<i4', (2,)), ('f1', '<i4')])
>>> np.array(([1, 2],), dtype='(2,)i')
TypeError: int() argument must be a string, a bytes-like object or a number, not 'tuple'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: setting an array element with a sequence.

Of course, I realize it is not easy to move from arguably mistaken behaviour to more logical one, but looking at our problem cases at least, I realize that in those we basically assumed the more logical one (i.e., they were bugs that somehow failed to get exposed)...

Concretely, I wonder if one could warn and move to more logical behaviour in one go (or give the option to use the more logical behaviour or so).

@mhvk
Copy link
Contributor Author

mhvk commented Oct 8, 2020

Never mind. This behaviour is more ingrained than I thought, so will be hard to change, I think. Let me just close this.

@mhvk mhvk closed this as completed Oct 8, 2020
@seberg
Copy link
Member

seberg commented Oct 8, 2020

Well, one issue is that arr.astype("(2)i") or the equivalent np.array(arr, dtype="(2)i") misbehave. Now it may be possible to retain the current behaviour for nested list of tuples (not the second case you are listed above though!).

In that case, I think we would have to tag on a FutureWarning on the array-like cases above, but that may be fine. Those are normally errors currently.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 8, 2020

Actually, let me re-open, as I got only more and more confused: Why is it OK that np.empty returns a dtype inconsistent with that requested, but not for np.array([], dtype) to just do the same thing?

np.empty(0, dtype='(2,)f4')
Out[20]: array([], shape=(0, 2), dtype=float32)

@mhvk mhvk reopened this Oct 8, 2020
@mhvk
Copy link
Contributor Author

mhvk commented Oct 8, 2020

I guess if the semantics are really that the shape of a single subarray gets appended to the shape of the data, then maybe it is best to just try to be more consistent in that?

@seberg
Copy link
Member

seberg commented Oct 8, 2020

Yes, the semantics of appending the dimension is well defined. The problem is the sematics of assigning the data to the array. If we assign after "appending" the dtype shape into the array shape, you basically have incorrect broadcasting.

This currently works fine if you include tuples in a nested list (the new code has more difficulties with it). But when there is an array-like (or nested array-likes), the problem is that:

dtype = np.dtype("(2)i")
res = np.array(arr)

becomes:

res = np.empty(arr.shape + dtype.shape, dtype=dtype.base)
res[...] = arr

Which means that arr suddenly gets broadcast against dimensions that were not there previously.

For arbitrary input, the shape can be discovered with from numpy.core._multiarray_umath import _discover_array_parameters on master. And this function stops when it finds the first tuple(), so it does not discover tuples as dimensions.

The new code currently has slight issues with that. The old code basically just worked in that case, as long as the tuples had the right "depth" (number of dimensions), since discovered_shape + dtype.shape works out right.


Now for the desired behaviour: Arguably, the correct behaviour is this:

res_with_subarray = np.empty(shape, dtype=dtype, do_not_append_dimensions=True)
res_with_subarray[...] = arr
res = res_with_subarray.append_subarray_dimensions()

Which I can do for the tuple case, but behaves different from the (arguably broken) case array case. I am happy to try that, we just have to make a choice to:

  1. Add a bit of hacks to try and give a warning for the currently broken array case
  2. Rip of the band-aid on it and hope that nobody has working code relying on the arguably buggy behaviour.

In either case, we internally need the do_not_append_dimensions=True, but that is a pretty simple view on the C-side.

(Hopefully this made sense.)

@mhvk
Copy link
Contributor Author

mhvk commented Oct 9, 2020

From the purely selfish astropy perspective, it would be nice if np.array([], dtype) continued to be equivalent to np.empty(0, dtype) - though if that means special-casing yet more perhaps better not!

More generally, it would seem better to move further towards the correct behaviour you describe. I think warning on any array-like input might make sense - it would seem better to restrict for those rathre than include warning for cases like the example in #17173, which seems unambiguous. But I obviously haven't really thought this through...

Separately, it would be nice if there were an option to get the dtype one actually asked for, even if it were a subarray (though I guess one can always fake it with things like '(2,2)f4,V0'; it seems remarkably difficult to do otherwise!).

@seberg seberg added this to the 1.20.0 release milestone Oct 9, 2020
@seberg
Copy link
Member

seberg commented Oct 9, 2020

OK, still good to know that this affects projects like astropy, have to put this on my pre 1.20 todo list...

Giving a warning may for array-likes may be slightly annoying, but we could gamble that it is fine to give it pretty indiscriminately and a bit imprecise (i.e. no matter if the result is currently an error and will start working, or will change behaviour). The main issue is that I do not see how to opt in to the new behaviour.

I agree the list of tuples case is clear, and (hopefully) well defined. The way it works in 1.19, is not the way I say above, but that probably ends up doing the same thing (or so I hope).

I don't really feel like going into the depth of actually allowing such dtypes to be attached, you can get far with a structure with a single element, but specifically for coercion from lists of tuples that has different behaviour (expects one extra nesting level in the tuple).
In the end, that would probably require some type of flag. I am not convinced that the current behaviour of appending dimensions to the array is useful outside of indexing (where it could be special cased). But, honestly, I don't see how to change that -- aside from creating a new structured (and maybe subarray) DType to replace void, which probably makes sense in any case.

@mhvk
Copy link
Contributor Author

mhvk commented Oct 10, 2020

A warning just for sub-array dtype and just array-likes or non-simple tuples would at least be less intrusive.

Agreed that it is hard to see an opt-in path forward apart from a special dtype. And perhaps it also matters less than I thought: I realize that I was wrong about it being so hard to get the right dtype - it really should have been a dtype like [('field', '(2,2)f4')], in which case the data in the np.array call has to be different too, but that is fine. I guess I've always been a bit confused that something like 2f4,2f4 becomes a structured array, but 2f4 does not.

seberg added a commit to seberg/numpy that referenced this issue Oct 21, 2020
This currently appends the subarray dtype dimensions first and then
tries to assign to the result array which uses incorrect broadcasting
(broadcasting against the subarray dimensions instead of repeating
each element according to the subarray dimensions).

This also fixes the python scalar pathway `np.array(2, dtype="(2)f4,")`
which previously only filled the first value.  I consider that a clear
bug fix.

Closes numpygh-17511
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants