Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: add checks for some invalid structured dtypes. Fixes #2865. #8235

Merged
merged 1 commit into from
Nov 25, 2016

Conversation

J-Sand
Copy link
Contributor

@J-Sand J-Sand commented Nov 4, 2016

This also fixes a similar bug I found while investigating the issue: with structured dtypes of the form ('i', {'name': ('i', offset, 'optional title')}), if the inner tuple has the wrong number or types of items, a bad python API call is made, resulting in a SystemError.

I'm completely new to the numpy source, so I may have made some mistakes. In particular, I wasn't certain how to deal with a dtype like ('O', [('name', 'O')]). As far as I can tell, it's the only structured dtype of this form containing an object dtype that currently works, so I've allowed it as a special case. On the other hand, it seems pretty useless.

@charris
Copy link
Member

charris commented Nov 5, 2016

@ahaldane Heads up.

Copy link
Member

@ahaldane ahaldane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, nice catch and solution!

@@ -287,7 +287,7 @@ _convert_from_tuple(PyObject *obj)
type->elsize = itemsize;
}
}
else if (PyDict_Check(val) || PyDictProxy_Check(val)) {
else if (type->metadata && (PyDict_Check(val) || PyDictProxy_Check(val))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine, but I just want to note for posterity that this whole if-block is weird undocumented behavior. (as are a number of things related to metadata).

(Actually, conceivably the "correct" behavior here if type->metadata == NULL might be to do type->metadata = val, eg compare to the end of convert_from_dict. But that's not clear.).

In the future we might consider removing the block anyway, so I'm fine with this as-is.

goto fail;
}
}
else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this whole else block be replaced by

if (PyDataType_REFCHK(conv)) {
    goto fail;
}

?

That would also catch the case there are objects in nested structures.

assert_raises(ValueError, np.dtype,
('i', {'name': ('i', 'wrongtype', 'title')}))
# this one is allowed as a special case though
a = np.ones(1, dtype=('O', [('name', 'O')]))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it important that this example works? If not, I would be tempted to altogether disallow the (base_dtype, new_dtype) form of specification if either base_dtype or new_dtype have objects (tested with PyDataType_REFCHK).

Note that similar operations are already disallowed:

>>> np.zeros(3, dtype='O').view(dtype([('name', 'O')]))
TypeError: Cannot change data-type for object array.

That wouldn't allow your special case though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the feedback. I noticed that test_dtype.py already has a test that requires that this case works:

def test_base_dtype_with_object_type(self):
    # Issue gh-2798, should not error.
    np.array(['a'], dtype="O").astype(("O", [("name", "O")]))

There is a discussion at #2798, but I don't really understand the rationale. Hmm, I also notice that you can currently have a dtype with object fields in the first element of the tuple, like this:

>>> np.zeros(1, dtype=('O,O', 'O,i8'))
array([(0, 0)], 
      dtype=[('f0', 'O'), ('f1', '<i8')])

or even multiple fields in the first element and just 'O' in the second one, like this:

>>> np.zeros(1, dtype=('i4,i4', 'O'))
array([(0, 0)], 
      dtype=[('f0', '<i4'), ('f1', '<i4')])

but this seems especially useless, since one of the two dtypes in the tuple is just ignored (as long as it has the correct size), even if it doesn't make sense to map it to the other one.

So should we allow all of the dtypes that currently work just in case somebody is using them for some reason, or just ones that kind of make sense (e.g. ('O,O', 'O,O')), or just the one that is already tested for, or none of them?

Copy link
Member

@ahaldane ahaldane Nov 23, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, #2798 suggests that some people (eg h5py) have wanted to be able to create dtype(("O", [("name", "O")])), in order to add a field name. I would have liked to simply disallow that, but maybe we can't.

Its probably safe to assume no-one is doing anything like dtype(('O,O','O,i8')), so my suggestion is to disallow all cases involving objects (by RECHCK on both elements of the tuple) except the special case you already have for a single field.

I am a little tempted to add a deprecation warning for a release cycle, but it's probably such a rare problem that I would rather just go ahead with the change. But since it can break code, we should clearly document it. So please add something to doc/release/1.13.0-notes.rst decsribing what will break.

By the way, the example causing problems in h5py is in h5py/h5py#217. I tried the example, but it looks like it no longer involves the special case dtype. Also, incidentally, after #6053 gets in they will be able to write np.zeros(3, dtype='O').astype([('name', 'O')]) instead (this is currently disallowed). So there will be a better syntax for adding a field.

Also, by the way, I once wrote a lot of code in #5548 (in _check_field_overlap) to carefully check whether views like 'O,O' to 'O,i8' would clobber pointers or not, by searching out the positions of all the pointers. But it was a huge complicated affair we got rid of later because it was too slow. That's why I favor the stupid solution of simply disallowing views involving objects.

@charris
Copy link
Member

charris commented Nov 23, 2016

As future reference, it is preferable to put the Fixes ... comment in the commit message body. Maybe need to update some documentation somewhere...

@J-Sand J-Sand force-pushed the invalid-structured-dtypes-fix branch from a91be86 to cb65c8a Compare November 24, 2016 00:17
@J-Sand
Copy link
Contributor Author

J-Sand commented Nov 24, 2016

OK, I think I've covered everything, though I was slightly unsure where to put the release note.

Copy link
Member

@ahaldane ahaldane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you address the two comments and squash I think it's good to go.

* people have been using to add a field to an object array without fields
*/
static int
validate_structured_object_dtype(PyArray_Descr *new, PyArray_Descr *conv)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might like a less general function name. Maybe invalid_union_object_dtype?

if (!PyDataType_REFCHK(new) && !PyDataType_REFCHK(conv)) {
return 0;
}
if (PyDataType_HASFIELDS(new)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by organizing the tests this way you are trying to allow cases like np.dtype(('i4', 'O')). But I think we should rule that out too: Consider the strange fact that np.dtype(('i8', 'O')).hasobject is True.

Maybe reorganize the if-statements here a little, eg start with if (new->kind != 'O'), then if (!PyDataType_HASFIELDS(conv)) , then if (PyTuple_GET_SIZE(names) != 1) an so on, an fail if any is true.

@J-Sand J-Sand force-pushed the invalid-structured-dtypes-fix branch from 452061b to 9fe73dd Compare November 24, 2016 21:02
@ahaldane
Copy link
Member

Looks good. I'll merge in a few minutes.

Thanks @J-Sand !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants