ENH: better handle dtype creation with metadata #15962

dmbelov · 2020-04-13T03:04:29Z

This change fixes the issue #15488: inconsistent behavior of dtype.descr with metadata. Actually the problem with handling metadata appeared in 2016 in the pool request gh-8235 (it even mentioned there that it breaks creation of dtype with metdata). I fixed the code in such a way that:

one can create dtype with metatdata as np.dtype([('a', ('S8', {'info': 'hello'}))]);
it does not break regression tests added in BUG: add checks for some invalid structured dtypes. Fixes #2865. #8235.

mattip · 2020-04-13T06:00:32Z

numpy/core/src/multiarray/descriptor.c

@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len)
    return 0;
 }

+// This function checks if val == {'name': any tuple}.
+// See gh-2865, issue #8235 for details
+inline npy_bool _check_if_size_1_and_contains_name_and_tuple(PyObject* val) {


fix formatting and add static

Suggested change

inline npy_bool _check_if_size_1_and_contains_name_and_tuple(PyObject* val) {

static inline npy_bool

_check_if_size_1_and_contains_name_and_tuple(PyObject* val) {

Also change to int

mattip · 2020-04-13T06:04:37Z

numpy/lib/tests/test_format.py

    # recursive: metadata on the field of a dtype
    (np.dtype({'names': ['a', 'b'], 'formats': [
        float, np.dtype({'names': ['c'], 'formats': [np.dtype(int, metadata={})]})
-    ]}), False)
+    ]}), False, False)


It seems we are now missing cases where fail==True?

Correct. It does not fail anymore on np.load. Should I remove option fail completely?

eric-wieser · 2020-04-13T17:09:57Z

numpy/core/src/multiarray/descriptor.c

+    npy_bool is_ok = PyDict_Size(val) == 1;
+    if (is_ok) {
+        PyObject* name = PyUnicode_FromString("name");
+        if (!name) return 1;


Don't silence memoryerror here:

Suggested change

if (!name) return 1;

if (!name) {

return -1;

}

The caller will need to handle this specially too.

numpy/lib/tests/test_format.py

eric-wieser · 2020-04-13T17:12:45Z

numpy/core/src/multiarray/descriptor.c

@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len)
    return 0;
 }

+// This function checks if val == {'name': any tuple}.
+// See gh-2865, issue #8235 for details


For reviewers, gh-2865

eric-wieser · 2020-04-13T17:15:09Z

numpy/core/src/multiarray/descriptor.c

@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len)
    return 0;
 }

+// This function checks if val == {'name': any tuple}.


I don't follow what's special about the word "name"

I agree, I made things more complicated than they should be. The misunderstood how "inherited" dtype is initialized. I made a change. Please check again.

dmbelov · 2020-04-19T18:19:46Z

@eric-wieser, @mattip I implemented your comments. Can u take a look again? Will this patch be included in the nearest NumPy release?

seberg · 2020-04-19T18:44:11Z

numpy/core/src/multiarray/descriptor.c

@@ -238,6 +238,22 @@ is_datetime_typestr(char const *type, Py_ssize_t len)
    return 0;
 }

+// This function checks if val is mistyped dict of inherited dtype,
+// e.g val = {key: tuple of size >= 2}; see gh-2865, issue #8235 for details


I am a bit confused about what actually happens here. Could you give an example for when this function does not return True for is_mistyped? And how does such a descriptor differ from a mistyped one? What about the possibility that a metadata actually includes a correctly typed dtype?
So is it possible to create dtypes, e.g. with odd metadata, where this logic is broken?

There are two contradicting to each other usages of dictionary in a tuple that defines dtype. Consider the following expression np.dtype(('i4', adict)) ('i4' can be replaced by any other valid type):

The dictionary adict here should be interpreted as metadata
dt = np.dtype(('i4', {'comment': 'qty'})) dt.metadata # mappingproxy({'comment': 'qty'})

But if we do this we will break historically working code convert_from_iherit_tuple (see BUG: add checks for some invalid structured dtypes. Fixes #2865. #8235) that allows one to split a given type into parts and use up-to 2 names to access them. For example
dt = np.dtype(('i4', {'part_1': ('i2', 0, 'Part 1'), 'part_2': ('i2', 2, 'Part 2')})) dt.descr # [(('Part 1', 'part_1'), '<i2'), (('Part 2', 'part_2'), '<i2')]
In this format tuple inside dictionary must be of size 2 or 3.

Thus, given a dictionary adict one has to decide if it is of type 1 or 2. I've just submitted a change that makes this choice very clear. Specifically, adict is mistyped dict of type 2 if all of the following is true:

all adict values are tuples of size >= 2;

either of the following is true for at least one tuple
a. val[1] is not an integer (this is required to pass existing regression tests);
b. val[1] is an integer but val[1] < 0;
c. len(val) >= 4.

I added description of this algo to docstring of the function numpy.core._internal._is_mistyped_inherited_type_dict.

Does this resolve your confusion?

Yes, so what I need to figure out is whether we can deprecate that whole other meaning?

Because otherwise we are guessing what the user wants and guessing is bad. E.g.:

In [13]: dt = np.dtype(np.float64, metadata={"original_type": (np.int8, 8)}) In [14]: np.dtype([("f1", dt)]).descr Out[14]: [('f1', ('<f8', {'original_type': (numpy.int8, 8)}))]

In which case your code will misinterpret the metadata for a dtype when converting the .descr.
Sure, its a corner case, but if you want to build reliable software based on this guess, I feel the inconsistency will bite you at some point.

I agree with you, reliability and consistency of software (especially widely used one) is very important . It is hard decision and only one of you can make it.

Note that we are already in that inconsistent territory. Recall my original concern: descr of dtype cannot be evaluated back into dtype when metadata is provided.

So my feeling is that we should only do this, if it is reasonable to put a FutureWarning on the other branch (i.e. remove it). Even in that case, it would be nicer if we can put in the FutureWarning at least one version before going here.
But, I am not sure how valid the existing use-case is. It only works with offsets which I assume are not super-common, but if you use offsets, then it actually seems like a fairly inconsistent and nice way to spell this out.

I still would be a bit more curious about either dropping metadata entirely for .descr or just creating a new way to do this... maybe expose the buffer-protocol representation (which has its problems as well though).

Another idea is to extend format of the tuple that defines dtype: the tuple with metadata must have size 3.

np.dtype((type, dict)) will be parsed as inherited dtype.

np.dtype((type, None or dict_1, metadata)) will by parsed with metadata.

I have a better idea that will keep the old code (and users of inherited dtype): require metadata to be stored in the dictionary under name metadata. For example,

np.dtype(('i4', {'part_1': ('i2', 0), 'part_2': ('i2', 2), 'metadata': {"x": 1}})) np.dtype(('i8', {'metadata': {'x': 1}})) np.dtype({'names': ['a', 'b'], 'formats': ['i2', 'i2'], 'metatdata': {'x': 1}})

In this way we can unambiguously infer dtype from dict: dict is either inherited dtype or dict with keys 'names', 'formats', 'offsets', ...

What do you think?

Why do we need that when we have np.dtype(type, metadata=...) though? (Apologies if I've forgotten something, I've not looked at this in a while).

As far as I understand,
we want to be able to create dtype from a list and be able to specify metadata for individual fields, e.g.

np.dtype([('a', ('i8', {'x': 1})), ('b', ('f8', {'y': 1}))], metadata={'z': 1})

Keyword metadata only allows one to specify metadata for the final dtype.

The whole discussion appeared because adding metadata to dtype breaks construction of dtype from descr.

- Fixed issue #15488; - Added tests; - Take into account a fix introduced in gh-2865.

…n np.load anymore.

…_mistyped_inherited_type_dict.

dmbelov mentioned this pull request Apr 13, 2020

Inconsistent behavior of dtype.descr with metadata #15488

Open

mattip reviewed Apr 13, 2020

View reviewed changes

mattip changed the title ~~Fixed code that creates dtype with metadata, see issue #15488.~~ ENH: better handle dtype creation with metadata Apr 13, 2020

eric-wieser reviewed Apr 13, 2020

View reviewed changes

numpy/lib/tests/test_format.py Outdated Show resolved Hide resolved

eric-wieser reviewed Apr 13, 2020

View reviewed changes

numpy deleted a comment from eric-wieser Apr 13, 2020

charris added component: numpy._core component: numpy.lib 01 - Enhancement labels Apr 19, 2020

seberg reviewed Apr 19, 2020

View reviewed changes

dmbelov added 5 commits April 25, 2020 15:50

Fixed code that creats dtype with metadata, see issue #15488.

c0be11e

- Fixed issue #15488; - Added tests; - Take into account a fix introduced in gh-2865.

Fixed test_fromat.py: non of the tests in test_metadata_dtype fails i…

0f9a4d7

…n np.load anymore.

fix

85274d7

Replaced function _check_if_size_1_and_contains_name_and_tuple by _is…

d032cf6

…_mistyped_inherited_type_dict.

Improved clarity of the code. Added more tests.

9d0ee49

Base automatically changed from master to main March 4, 2021 02:04

dmbelov closed this by deleting the head repository Dec 30, 2022

This was referenced Feb 7, 2023

BUG: unable to read .npy file with non-eval'able metadata #23169

Closed

ENH: Dtype without metadata #23185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: better handle dtype creation with metadata #15962

ENH: better handle dtype creation with metadata #15962

dmbelov commented Apr 13, 2020

mattip Apr 13, 2020

eric-wieser Apr 13, 2020

mattip Apr 13, 2020

dmbelov Apr 13, 2020

eric-wieser Apr 13, 2020

eric-wieser Apr 13, 2020

eric-wieser Apr 13, 2020

dmbelov Apr 14, 2020

dmbelov commented Apr 19, 2020

seberg Apr 19, 2020

dmbelov Apr 25, 2020 •

edited

Loading

seberg Apr 28, 2020

dmbelov May 3, 2020

seberg May 3, 2020

dmbelov May 9, 2020 •

edited

Loading

dmbelov May 9, 2020

eric-wieser May 9, 2020 •

edited

Loading

dmbelov May 9, 2020

	inline npy_bool _check_if_size_1_and_contains_name_and_tuple(PyObject* val) {
	static inline npy_bool
	_check_if_size_1_and_contains_name_and_tuple(PyObject* val) {

-        if (!name) return 1;
+        if (!name) {
+            return -1;
+        }

ENH: better handle dtype creation with metadata #15962

ENH: better handle dtype creation with metadata #15962

Conversation

dmbelov commented Apr 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmbelov commented Apr 19, 2020

Choose a reason for hiding this comment

dmbelov Apr 25, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmbelov May 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eric-wieser May 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmbelov Apr 25, 2020 •

edited

Loading

dmbelov May 9, 2020 •

edited

Loading

eric-wieser May 9, 2020 •

edited

Loading