Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: dtype comparison coerces generously leading to hash problems #7242

Open
NeilGirdhar opened this issue Feb 13, 2016 · 23 comments
Open

BUG: dtype comparison coerces generously leading to hash problems #7242

NeilGirdhar opened this issue Feb 13, 2016 · 23 comments
Labels
00 - Bug 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. component: numpy.array_api component: numpy.dtype

Comments

@NeilGirdhar
Copy link
Contributor

NeilGirdhar commented Feb 13, 2016

Summary

Make DType objects, their corresponding types, and their string names all compare unequal.

Historical issue

For some reason, dtype objects and the numpy types they are based off of compare equal. They don't hash equal of course, which goes against Python's docs that say

The only required property is that objects which compare equal have the same hash value…

Can we make them compare unequal? Is there any reason for them to compare equal?

In [6]: np.dtype(np.float) == np.float
Out[6]: True
@shoyer
Copy link
Member

shoyer commented Feb 13, 2016

Can we make them compare unequal? Is there any reason for them to compare equal?

Even worse, dtypes compare equal to strings!

>>> np.dtype(float) == 'float'
True

This a long standing API issue in NumPy. If we were starting over, dtypes definitely should not compare equal to types/strings, but it's too late to change that without a major compatibility break.

@NeilGirdhar
Copy link
Contributor Author

Out of curiosity, where do people use this functionality?

@shoyer
Copy link
Member

shoyer commented Feb 13, 2016

All over the place, in my experience. I don't have any particular examples handy, though :)

@NeilGirdhar
Copy link
Contributor Author

@shoyer Okay, thanks :)

@insertinterestingnamehere
Copy link
Contributor

This would matter any time you want to use both numpy dtypes and numpy types as keys in a dictionary. I've been wanting to do something like this to map between numpy types and dynd types. Fortunately, in most cases where you would want to build a mapping from numpy types to some other set of types, numpy's types and dtypes will have the same output value, which ought to hide the bug. It'd be great to get this fixed though.

@njsmith
Copy link
Member

njsmith commented Feb 13, 2016

Unfortunately, the only way we could possibly fix this bug is to first deprecate auto-coercion-to-dtype in ==, and then wait something like 1-2 years for people to clean up their code :-( Without removing auto-coercion it's impossible to make dtype hash values work correctly, because hash(float) and hash("float") are different (notice that this is a fact about CPython itself, nothing to do with numpy), but they're both == to np.dtype(float).

@NeilGirdhar
Copy link
Contributor Author

@njsmith Could we start with the deprecation? At least in the docs?

@njsmith
Copy link
Member

njsmith commented Feb 14, 2016

Deprecations start on the mailing list, so you'd have to bring it up there.

There is a ton of code that does if arr.dtype == float: ..., or other implicit coercions-to-dtype like if issubdtype(arr.dtype, bool): or array(..., dtype=float), including stuff I've personally written with in the last year -- it's the quickest simplest way to do these kinds of checks in numpy, so you're going to want to think carefully about which ones you want to deprecate and if there are ways the transition could can be made less painful, and you'll have an uphill battle ahead in general. Which doesn't necessarily mean it's a bad idea -- dtypes do seriously flaunt Python's normal rules for == -- but it at least gives me pause :-)

@NeilGirdhar
Copy link
Contributor Author

@njsmith I understand. Thanks for explaining. It might be a good idea to at least have easy ways to write that code that does not rely on the weird ==. How would you write the above lines more "cleanly"? The last one is obviously array(…, dtype=np.float)

@njsmith
Copy link
Member

njsmith commented Feb 14, 2016

np.float is float, and neither is a dtype object. (Nor is np.float64.)

In all of these cases the obvious way to fix the implicit coercion would be to make it explicit: arr.dtype == np.dtype(float), np.issubdtype(arr.dtype, np.dtype(bool)), etc. This isn't necessarily very user friendly though... So figuring out if there's a better way to do this would be part of the challenge :-)

@NeilGirdhar
Copy link
Contributor Author

@njsmith Ah, of course. Right, so that's probably the first step.

I don't understand why is np.float is float and not np.float64? When I do np.zeros(10, dtype=np.float).dtype I get np.float64. That is confusing.

@rkern
Copy link
Member

rkern commented Feb 14, 2016

Early on, numpy.float did indeed used to be an alias for the numpy.float64. However, that shadowed the builtin float type when people did from numpy import * at the interactive prompt. Since it had been released and people were using dtype=numpy.float, we could not just remove the name, so we just changed it to alias the builtin float object instead. dtype=float creates float64 arrays because Python float objects are C doubles underneath (i.e. virtually always a float64).

@rkern
Copy link
Member

rkern commented Feb 14, 2016

The upshot is: don't use numpy.float. It's a legacy, superfluous, deprecated alias.

@NeilGirdhar
Copy link
Contributor Author

@rkern Wow, yeah that's really unfortunate.

@NeilGirdhar
Copy link
Contributor Author

@rkern Also I don't think that Python promises that float will be 64 bits. I thought it was a CPython design decision?

@rkern
Copy link
Member

rkern commented Feb 14, 2016

Not in so many words, but numpy will map it to whichever dtype is appropriate (i.e. virtually always a float64).

@mattip
Copy link
Member

mattip commented May 7, 2019

@shoyer did you reopen this intentionally?

@shoyer
Copy link
Member

shoyer commented May 7, 2019

Yes, I think so? I imagine I was contemplating this as something to consider for an eventual API cleanup.

@shoyer shoyer added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label May 10, 2019
@seberg seberg changed the title dtype comparison should match its hash BUG: dtype comparison should match its hash Feb 24, 2021
@seberg seberg changed the title BUG: dtype comparison should match its hash BUG: dtype comparison coerces generously leading to problems hashes Feb 24, 2021
@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Dec 23, 2022

Would it be possible to repair this comparison in numpy.array_api? That is, make np.array_api.bool != 'bool' != np.array_api.bool.type?

And perhaps also edit the Array API standard to make this invariant guaranteed. (Added data-apis/array-api#582 to that effect.)

@NeilGirdhar NeilGirdhar changed the title BUG: dtype comparison coerces generously leading to problems hashes BUG: DType objects, their corresponding types, and their string names all compare equal, which violates Python's hashing invariant Dec 23, 2022
@mattip
Copy link
Member

mattip commented Dec 23, 2022

@asmeurer for the Array API standard change

@NeilGirdhar NeilGirdhar changed the title BUG: DType objects, their corresponding types, and their string names all compare equal, which violates Python's hashing invariant BUG: dtype comparison coerces generously leading to hash problems Dec 23, 2022
@asmeurer
Copy link
Member

asmeurer commented Jan 3, 2023

I got a little lazy and decided not to wrap the dtype objects in numpy.array_api (except for making it so that a.dtype is always the same object as np.array_api.<dtype>, i.e., removing the distinction between dtype classes and instances). That makes it not as strict as it could be since NumPy dtype objects do a lot more than is required by the spec. If this lack of strictness causes issues for people I'm OK with making basic dtype wrapper classes in that submodule.

Regarding the standard, see the cross-referenced issue right above this comment.

@NeilGirdhar
Copy link
Contributor Author

NeilGirdhar commented Jan 3, 2023

I got a little lazy and decided not to wrap the dtype objects in numpy.array_api

Right, I understand the simplicity there, but unfortunately that means that the Array API inherited the unpythonic behaviour of dtypes in this issue.

If this lack of strictness causes issues for people I'm OK with making basic dtype wrapper classes in that submodule.

I wouldn't say it's an issue for me. However, if you see this comment data-apis/array-api#582 (comment), it appears that the Array API will forbid that behaviour. If the language in the linked comment is accepted, the original dtype objects won't be compliant. (You would need new objects that don't compare equal to strings or the numpy types they contain.)

Also, your implementation of numpy's array API is so elegant. I love how ndarray has only a few properties (data-apis/array-api#583 (comment)). So if the new dtype objects end up being created it would be really cool to pare down their interfaces too. This is just my personal opinion. I know you're allowed to expose whatever methods you want.

(Also, numpy's array API is very exciting!)

@rhshadrach
Copy link

This was also the cause of nondeterministic behavior in pandas DataFrame.isin (pandas-dev/pandas#32443). In short: because the hash of e.g. np.dtype("int64") is based on a string, and this is nondeterministic by default in Python, you get nondeterministic hash collisions. This happens frequently when your hash table is small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
00 - Bug 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. component: numpy.array_api component: numpy.dtype
Projects
None yet
Development

No branches or pull requests

9 participants