Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected compound datatype construction #15638

Open
cjw85 opened this issue Feb 24, 2020 · 9 comments
Open

Unexpected compound datatype construction #15638

cjw85 opened this issue Feb 24, 2020 · 9 comments

Comments

@cjw85
Copy link

cjw85 commented Feb 24, 2020

When constructing compound types giving a tuple of tuples, instead of a list of tuples as is common in docs, leads to unexpected results.

Reproducing code example:

import numpy as np
print(np.__version__)
print()

dts = [
    [('int',int), ('float',float)],
    (('int',int), ('float',float))
]

for dt in dts:
    print(dt)
    print(np.dtype(dt))
    print(np.full(5, 10, dtype=dt))
    print()

On my system this is producing:

1.18.1 3.6.5 (default, Jun 17 2018, 12:13:06)
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)]

[('int', <class 'int'>), ('float', <class 'float'>)]
[('int', '<i8'), ('float', '<f8')]
[(10, 10.) (10, 10.) (10, 10.) (10, 10.) (10, 10.)]

(('int', <class 'int'>), ('float', <class 'float'>))
int64
[10 10 10 10 10]```

@rossbar
Copy link
Contributor

rossbar commented Feb 25, 2020

The listing of how the dtype constructor interprets input objects is here. From the docs, the accepted inputs to the dtype constructor with type tuple are supposed to have either of the following formats: (flexible_dtype, itemsize) or (fixed_dtype, shape). Your input clearly doesn't conform to either and should fail, but you may have found a corner case - note that the field names you chose (i.e. 'int' and 'float') are also valid type strings. If you take the relevant component of your example and replace the field names with invalid type strings, you get an error:

>>> my_type_specification = (('one', int), ('two', float)) 
>>> np.type(my_type_specification
TypeError                                 Traceback (most recent call last)
<ipython-input-6-dc4df5e4645d> in <module>
----> 1 dt = np.dtype(my_type_specification)

TypeError: data type 'one' not understood

This warrants a closer look - thanks for reporting!

@cjw85
Copy link
Author

cjw85 commented Feb 25, 2020

There are similar examples of my second form in the docs when talking about union types, it is noted though:

This usage is discouraged, however, and the union mechanism is preferred.

@rossbar
Copy link
Contributor

rossbar commented Feb 25, 2020

Good point, that's another tuple-based input that I failed to list above (though the union types are unrelated to the original example).

Note that tuple and list inputs are not interchangeable when it comes to specifying dtypes. For example:

# Specifying a structured dtype with a list
>>> type_specifier = [('time', float), ('temp', float), ('alt', int)]
>>> my_dtype = np.dtype(type_specifier)
>>> my_dtype
dtype([('time', '<f8'), ('temp', '<f8'), ('alt', '<i8')])
>>> my_dtype = np.dtype(tuple(type_specifier))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-61-380fa230aff7> in <module>
----> 1 my_dtype = np.dtype(tuple(my_type))

TypeError: Tuple must have size 2, but has size 3

This is the expected behavior.

The problem arises when tuples with len == 2 are used, as in your original example:

bad_type_specifier = (('int',int), ('float',float))
# This *should* raise a TypeError as the type specifier doesn't
# match one of the accepted input specifications for tuples
my_dtype = np.dtype(bad_type_specifier)

I believe the bug originates in the input checking here:

_convert_from_tuple(PyObject *obj, int align)

@seberg
Copy link
Member

seberg commented Feb 25, 2020

Maybe something like this could work to make things more strict:

diff --git a/numpy/core/src/multiarray/descriptor.c b/numpy/core/src/multiarray/descriptor.c
index eb4f68959..faa416231 100644
--- a/numpy/core/src/multiarray/descriptor.c
+++ b/numpy/core/src/multiarray/descriptor.c
@@ -857,16 +857,27 @@ _try_convert_from_inherit_tuple(PyArray_Descr *type, PyObject *newobj)
     if (new == NULL) {
         goto fail;
     }
+    if (_validate_union_object_dtype(new, conv) < 0) {
+        Py_DECREF(new);
+        goto fail;
+    }
+    else if (!PyDataType_HASFIELDS(conv)) {
+        PyErr_SetString(PyExc_TypeError,
+                "Only structured dtypes can be used as a union base.");
+        goto fail;
+    }
+    else if (PyDataType_HASFIELDS(new) || PyDataType_ISUSERDEF(new)) {
+        PyErr_SetString(PyExc_TypeError,
+                "Can only create union dtype for basic NumPy dtypes and not "
+                "structured ones.");
+        goto fail;
+    }
     if (PyDataType_ISUNSIZED(new)) {
         new->elsize = conv->elsize;
     }
     else if (new->elsize != conv->elsize) {
         PyErr_SetString(PyExc_ValueError,
-                "mismatch in size of old and new data-descriptor");
-        Py_DECREF(new);
-        goto fail;
-    }
-    else if (_validate_union_object_dtype(new, conv) < 0) {
+                        "mismatch in size of old and new data-descriptor");
         Py_DECREF(new);
         goto fail;
     }

@rossbar
Copy link
Contributor

rossbar commented Feb 25, 2020

Thanks for the time spent chasing this down @seberg. An approach along the lines you proposed + additional tests seems like a good approach (+ some extra time for fully grokking the change to the logic :) ).

I'd also advocate adding an additional example to the doc string to highlight the "special" behavior of len-2 tuples for specifying dtypes.

@seberg
Copy link
Member

seberg commented Feb 25, 2020

I think we should do that. Not limiting user datatypes would be possible, but I am currently not sure what the point of an int64 + fields is. You are going to lose that extra informtion in ufuncs, etc. anyway typically. So it seems to me that basic view or a maybe a new union dtype/structured dtype with overlapping fields makes more sense.
In other words: Rather than not restricting user dtypes as I did above, I would prefer to think about deprecating the whole thing unless someone finds a good use-case. It seems way too smart to me.

@seberg
Copy link
Member

seberg commented Jun 27, 2023

A deprecation might have been nice, but overall, I think that we should remove any union/compound dtype construction (except void based ones):

  • There are probably not really any relevant users.
  • In most cases it should be reasonable to work-around with views even if you find a good use-case. Plus, it should lead to (relatively) clean errors and no bad results.

Since this is weird, and removing it should allow simplifications and just make code paths less brittle, we should disallow this for NumPy 2.0. (TBH, I would be surprised if this isn't buggy in places.)

@seberg seberg self-assigned this Oct 25, 2023
@seberg
Copy link
Member

seberg commented Nov 17, 2023

@rossbar do you remember this a bit more? I dug a bit into this, and it sounds like h5py at least at some point was using these. Maybe they stopped, maybe not but...

It makes me think though that it would maybe still be good to force more sanity here. I could imagine mainly:

  1. enforce that if we support a tuple, the tuple consists of (..., dtype_instance), i.e. don't convert the second one to a dtype
  2. Create a dedicated function or kwarg(?) to explicitly create these abominations (sure, you will have to write a hack to fall back if you want to support both old and new).

Aunion_fields=[] kwarg seems like it might be the "clearest" way to allow this in some (assuming i don't hear back from h5py that they don't care). Considering the hot mess it is, I would just remove the old paths here TBH, and force the lone users to put a version switch in place...

@cjw85
Copy link
Author

cjw85 commented Nov 17, 2023

For context, I was using h5py a lot three years ago which is probably how I came across this; though I really don't recall!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Descoped
Development

No branches or pull requests

3 participants