Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: better handle dtype creation with metadata #15962

Closed
wants to merge 5 commits into from
Closed

ENH: better handle dtype creation with metadata #15962

wants to merge 5 commits into from

Conversation

dmbelov
Copy link
Contributor

@dmbelov dmbelov commented Apr 13, 2020

This change fixes the issue #15488: inconsistent behavior of dtype.descr with metadata. Actually the problem with handling metadata appeared in 2016 in the pool request gh-8235 (it even mentioned there that it breaks creation of dtype with metdata). I fixed the code in such a way that:

@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len)
return 0;
}

// This function checks if val == {'name': any tuple}.
// See gh-2865, issue #8235 for details
inline npy_bool _check_if_size_1_and_contains_name_and_tuple(PyObject* val) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix formatting and add static

Suggested change
inline npy_bool _check_if_size_1_and_contains_name_and_tuple(PyObject* val) {
static inline npy_bool
_check_if_size_1_and_contains_name_and_tuple(PyObject* val) {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also change to int

# recursive: metadata on the field of a dtype
(np.dtype({'names': ['a', 'b'], 'formats': [
float, np.dtype({'names': ['c'], 'formats': [np.dtype(int, metadata={})]})
]}), False)
]}), False, False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we are now missing cases where fail==True?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. It does not fail anymore on np.load. Should I remove option fail completely?

@mattip mattip changed the title Fixed code that creates dtype with metadata, see issue #15488. ENH: better handle dtype creation with metadata Apr 13, 2020
npy_bool is_ok = PyDict_Size(val) == 1;
if (is_ok) {
PyObject* name = PyUnicode_FromString("name");
if (!name) return 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't silence memoryerror here:

Suggested change
if (!name) return 1;
if (!name) {
return -1;
}

The caller will need to handle this specially too.

@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len)
return 0;
}

// This function checks if val == {'name': any tuple}.
// See gh-2865, issue #8235 for details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reviewers, gh-2865

@@ -238,6 +238,20 @@ is_datetime_typestr(char const *type, Py_ssize_t len)
return 0;
}

// This function checks if val == {'name': any tuple}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't follow what's special about the word "name"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I made things more complicated than they should be. The misunderstood how "inherited" dtype is initialized. I made a change. Please check again.

@dmbelov
Copy link
Contributor Author

dmbelov commented Apr 19, 2020

@eric-wieser, @mattip I implemented your comments. Can u take a look again? Will this patch be included in the nearest NumPy release?

@@ -238,6 +238,22 @@ is_datetime_typestr(char const *type, Py_ssize_t len)
return 0;
}

// This function checks if val is mistyped dict of inherited dtype,
// e.g val = {key: tuple of size >= 2}; see gh-2865, issue #8235 for details
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused about what actually happens here. Could you give an example for when this function does not return True for is_mistyped? And how does such a descriptor differ from a mistyped one? What about the possibility that a metadata actually includes a correctly typed dtype?
So is it possible to create dtypes, e.g. with odd metadata, where this logic is broken?

Copy link
Contributor Author

@dmbelov dmbelov Apr 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two contradicting to each other usages of dictionary in a tuple that defines dtype. Consider the following expression np.dtype(('i4', adict)) ('i4' can be replaced by any other valid type):

  1. The dictionary adict here should be interpreted as metadata
    dt =  np.dtype(('i4', {'comment': 'qty'}))
    dt.metadata
    # mappingproxy({'comment': 'qty'})
  2. But if we do this we will break historically working code convert_from_iherit_tuple (see BUG: add checks for some invalid structured dtypes. Fixes #2865. #8235) that allows one to split a given type into parts and use up-to 2 names to access them. For example
    dt = np.dtype(('i4', {'part_1': ('i2', 0, 'Part 1'), 'part_2': ('i2', 2, 'Part 2')}))
    dt.descr
    # [(('Part 1', 'part_1'), '<i2'), (('Part 2', 'part_2'), '<i2')]
    In this format tuple inside dictionary must be of size 2 or 3.

Thus, given a dictionary adict one has to decide if it is of type 1 or 2. I've just submitted a change that makes this choice very clear. Specifically, adict is mistyped dict of type 2 if all of the following is true:

  1. all adict values are tuples of size >= 2;
  2. either of the following is true for at least one tuple
    a. val[1] is not an integer (this is required to pass existing regression tests);
    b. val[1] is an integer but val[1] < 0;
    c. len(val) >= 4.

I added description of this algo to docstring of the function numpy.core._internal._is_mistyped_inherited_type_dict.

Does this resolve your confusion?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, so what I need to figure out is whether we can deprecate that whole other meaning?

Because otherwise we are guessing what the user wants and guessing is bad. E.g.:

In [13]: dt = np.dtype(np.float64, metadata={"original_type": (np.int8, 8)})                                            
In [14]: np.dtype([("f1", dt)]).descr                                                                                   
Out[14]: [('f1', ('<f8', {'original_type': (numpy.int8, 8)}))]

In which case your code will misinterpret the metadata for a dtype when converting the .descr.
Sure, its a corner case, but if you want to build reliable software based on this guess, I feel the inconsistency will bite you at some point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you, reliability and consistency of software (especially widely used one) is very important . It is hard decision and only one of you can make it.

Note that we are already in that inconsistent territory. Recall my original concern: descr of dtype cannot be evaluated back into dtype when metadata is provided.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my feeling is that we should only do this, if it is reasonable to put a FutureWarning on the other branch (i.e. remove it). Even in that case, it would be nicer if we can put in the FutureWarning at least one version before going here.
But, I am not sure how valid the existing use-case is. It only works with offsets which I assume are not super-common, but if you use offsets, then it actually seems like a fairly inconsistent and nice way to spell this out.

I still would be a bit more curious about either dropping metadata entirely for .descr or just creating a new way to do this... maybe expose the buffer-protocol representation (which has its problems as well though).

Copy link
Contributor Author

@dmbelov dmbelov May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea is to extend format of the tuple that defines dtype: the tuple with metadata must have size 3.

  1. np.dtype((type, dict)) will be parsed as inherited dtype.
  2. np.dtype((type, None or dict_1, metadata)) will by parsed with metadata.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a better idea that will keep the old code (and users of inherited dtype): require metadata to be stored in the dictionary under name metadata. For example,

np.dtype(('i4', {'part_1': ('i2', 0), 'part_2': ('i2', 2), 'metadata': {"x": 1}}))
np.dtype(('i8', {'metadata': {'x': 1}}))
np.dtype({'names': ['a', 'b'], 'formats': ['i2', 'i2'], 'metatdata': {'x': 1}})

In this way we can unambiguously infer dtype from dict: dict is either inherited dtype or dict with keys 'names', 'formats', 'offsets', ...

What do you think?

Copy link
Member

@eric-wieser eric-wieser May 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need that when we have np.dtype(type, metadata=...) though? (Apologies if I've forgotten something, I've not looked at this in a while).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I understand,
we want to be able to create dtype from a list and be able to specify metadata for individual fields, e.g.

np.dtype([('a', ('i8', {'x': 1})), ('b', ('f8', {'y': 1}))], metadata={'z': 1})

Keyword metadata only allows one to specify metadata for the final dtype.

The whole discussion appeared because adding metadata to dtype breaks construction of dtype from descr.

Base automatically changed from master to main March 4, 2021 02:04
@dmbelov dmbelov closed this by deleting the head repository Dec 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

None yet

5 participants