BUG: dtype comparison coerces generously leading to hash problems #7242

NeilGirdhar · 2016-02-13T18:46:14Z

Summary

Make DType objects, their corresponding types, and their string names all compare unequal.

Historical issue

For some reason, dtype objects and the numpy types they are based off of compare equal. They don't hash equal of course, which goes against Python's docs that say

The only required property is that objects which compare equal have the same hash value…

Can we make them compare unequal? Is there any reason for them to compare equal?

In [6]: np.dtype(np.float) == np.float
Out[6]: True

The text was updated successfully, but these errors were encountered:

shoyer · 2016-02-13T19:27:49Z

Can we make them compare unequal? Is there any reason for them to compare equal?

Even worse, dtypes compare equal to strings!

>>> np.dtype(float) == 'float'
True

This a long standing API issue in NumPy. If we were starting over, dtypes definitely should not compare equal to types/strings, but it's too late to change that without a major compatibility break.

NeilGirdhar · 2016-02-13T19:31:33Z

Out of curiosity, where do people use this functionality?

shoyer · 2016-02-13T20:04:37Z

All over the place, in my experience. I don't have any particular examples handy, though :)

NeilGirdhar · 2016-02-13T20:05:15Z

@shoyer Okay, thanks :)

insertinterestingnamehere · 2016-02-13T20:55:55Z

This would matter any time you want to use both numpy dtypes and numpy types as keys in a dictionary. I've been wanting to do something like this to map between numpy types and dynd types. Fortunately, in most cases where you would want to build a mapping from numpy types to some other set of types, numpy's types and dtypes will have the same output value, which ought to hide the bug. It'd be great to get this fixed though.

njsmith · 2016-02-13T21:01:47Z

Unfortunately, the only way we could possibly fix this bug is to first deprecate auto-coercion-to-dtype in ==, and then wait something like 1-2 years for people to clean up their code :-( Without removing auto-coercion it's impossible to make dtype hash values work correctly, because hash(float) and hash("float") are different (notice that this is a fact about CPython itself, nothing to do with numpy), but they're both == to np.dtype(float).

NeilGirdhar · 2016-02-14T03:37:44Z

@njsmith Could we start with the deprecation? At least in the docs?

njsmith · 2016-02-14T06:19:43Z

Deprecations start on the mailing list, so you'd have to bring it up there.

There is a ton of code that does if arr.dtype == float: ..., or other implicit coercions-to-dtype like if issubdtype(arr.dtype, bool): or array(..., dtype=float), including stuff I've personally written with in the last year -- it's the quickest simplest way to do these kinds of checks in numpy, so you're going to want to think carefully about which ones you want to deprecate and if there are ways the transition could can be made less painful, and you'll have an uphill battle ahead in general. Which doesn't necessarily mean it's a bad idea -- dtypes do seriously flaunt Python's normal rules for == -- but it at least gives me pause :-)

NeilGirdhar · 2016-02-14T06:36:16Z

@njsmith I understand. Thanks for explaining. It might be a good idea to at least have easy ways to write that code that does not rely on the weird ==. How would you write the above lines more "cleanly"? The last one is obviously array(…, dtype=np.float)

njsmith · 2016-02-14T22:44:29Z

np.float is float, and neither is a dtype object. (Nor is np.float64.)

In all of these cases the obvious way to fix the implicit coercion would be to make it explicit: arr.dtype == np.dtype(float), np.issubdtype(arr.dtype, np.dtype(bool)), etc. This isn't necessarily very user friendly though... So figuring out if there's a better way to do this would be part of the challenge :-)

NeilGirdhar · 2016-02-14T22:49:36Z

@njsmith Ah, of course. Right, so that's probably the first step.

I don't understand why is np.float is float and not np.float64? When I do np.zeros(10, dtype=np.float).dtype I get np.float64. That is confusing.

rkern · 2016-02-14T22:53:58Z

Early on, numpy.float did indeed used to be an alias for the numpy.float64. However, that shadowed the builtin float type when people did from numpy import * at the interactive prompt. Since it had been released and people were using dtype=numpy.float, we could not just remove the name, so we just changed it to alias the builtin float object instead. dtype=float creates float64 arrays because Python float objects are C doubles underneath (i.e. virtually always a float64).

rkern · 2016-02-14T22:55:26Z

The upshot is: don't use numpy.float. It's a legacy, superfluous, deprecated alias.

NeilGirdhar · 2016-02-14T22:56:01Z

@rkern Wow, yeah that's really unfortunate.

NeilGirdhar · 2016-02-14T22:58:36Z

@rkern Also I don't think that Python promises that float will be 64 bits. I thought it was a CPython design decision?

rkern · 2016-02-14T23:03:20Z

Not in so many words, but numpy will map it to whichever dtype is appropriate (i.e. virtually always a float64).

mattip · 2019-05-07T05:29:27Z

@shoyer did you reopen this intentionally?

shoyer · 2019-05-07T07:09:00Z

Yes, I think so? I imagine I was contemplating this as something to consider for an eventual API cleanup.

NeilGirdhar · 2022-12-23T12:00:47Z

Would it be possible to repair this comparison in numpy.array_api? That is, make np.array_api.bool != 'bool' != np.array_api.bool.type?

And perhaps also edit the Array API standard to make this invariant guaranteed. (Added data-apis/array-api#582 to that effect.)

mattip · 2022-12-23T12:42:55Z

@asmeurer for the Array API standard change

asmeurer · 2023-01-03T04:04:02Z

I got a little lazy and decided not to wrap the dtype objects in numpy.array_api (except for making it so that a.dtype is always the same object as np.array_api.<dtype>, i.e., removing the distinction between dtype classes and instances). That makes it not as strict as it could be since NumPy dtype objects do a lot more than is required by the spec. If this lack of strictness causes issues for people I'm OK with making basic dtype wrapper classes in that submodule.

Regarding the standard, see the cross-referenced issue right above this comment.

NeilGirdhar · 2023-01-03T04:32:25Z

I got a little lazy and decided not to wrap the dtype objects in numpy.array_api

Right, I understand the simplicity there, but unfortunately that means that the Array API inherited the unpythonic behaviour of dtypes in this issue.

If this lack of strictness causes issues for people I'm OK with making basic dtype wrapper classes in that submodule.

I wouldn't say it's an issue for me. However, if you see this comment data-apis/array-api#582 (comment), it appears that the Array API will forbid that behaviour. If the language in the linked comment is accepted, the original dtype objects won't be compliant. (You would need new objects that don't compare equal to strings or the numpy types they contain.)

Also, your implementation of numpy's array API is so elegant. I love how ndarray has only a few properties (data-apis/array-api#583 (comment)). So if the new dtype objects end up being created it would be really cool to pare down their interfaces too. This is just my personal opinion. I know you're allowed to expose whatever methods you want.

(Also, numpy's array API is very exciting!)

rhshadrach · 2023-10-31T00:54:01Z

This was also the cause of nondeterministic behavior in pandas DataFrame.isin (pandas-dev/pandas#32443). In short: because the hash of e.g. np.dtype("int64") is based on a string, and this is nondeterministic by default in Python, you get nondeterministic hash collisions. This happens frequently when your hash table is small.

NeilGirdhar closed this as completed Feb 13, 2016

shoyer mentioned this issue May 29, 2018

Fix dtype=S1 encoding in to_netcdf() pydata/xarray#2158

Merged

4 tasks

shoyer reopened this May 29, 2018

seberg added 00 - Bug component: numpy.dtype labels May 7, 2019

shoyer added the 62 - Python API Changes or additions to the Python API. Mailing list should usually be notified. label May 10, 2019

NeilGirdhar mentioned this issue May 7, 2020

issubdtype errors when using dtype constructor google/jax#3001

Closed

seberg mentioned this issue Feb 24, 2021

dtype richcompare doesn't respect protocol #5345

Closed

seberg changed the title ~~dtype comparison should match its hash~~ BUG: dtype comparison should match its hash Feb 24, 2021

seberg mentioned this issue Feb 24, 2021

in operator with dtype and set does not work as expected #16208

Closed

seberg changed the title ~~BUG: dtype comparison should match its hash~~ BUG: dtype comparison coerces generously leading to problems hashes Feb 24, 2021

brandon-b-miller mentioned this issue Jul 12, 2021

[FEA] Use extension dtype for numerical columns rapidsai/cudf#8707

Closed

NeilGirdhar changed the title ~~BUG: dtype comparison coerces generously leading to problems hashes~~ BUG: DType objects, their corresponding types, and their string names all compare equal, which violates Python's hashing invariant Dec 23, 2022

mattip added the component: numpy.array_api label Dec 23, 2022

NeilGirdhar mentioned this issue Dec 23, 2022

RFC: require that dtypes obey Python hashing rules data-apis/array-api#582

Open

NeilGirdhar changed the title ~~BUG: DType objects, their corresponding types, and their string names all compare equal, which violates Python's hashing invariant~~ BUG: dtype comparison coerces generously leading to hash problems Dec 23, 2022

rhshadrach mentioned this issue Oct 31, 2023

incosistencies in dataframe.dtype.isin() - I am getting inconsistent with different runs on same code. pandas-dev/pandas#32443

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: dtype comparison coerces generously leading to hash problems #7242

BUG: dtype comparison coerces generously leading to hash problems #7242

NeilGirdhar commented Feb 13, 2016 •

edited

shoyer commented Feb 13, 2016

NeilGirdhar commented Feb 13, 2016

shoyer commented Feb 13, 2016

NeilGirdhar commented Feb 13, 2016

insertinterestingnamehere commented Feb 13, 2016

njsmith commented Feb 13, 2016

NeilGirdhar commented Feb 14, 2016

njsmith commented Feb 14, 2016

NeilGirdhar commented Feb 14, 2016

njsmith commented Feb 14, 2016

NeilGirdhar commented Feb 14, 2016

rkern commented Feb 14, 2016

rkern commented Feb 14, 2016

NeilGirdhar commented Feb 14, 2016

NeilGirdhar commented Feb 14, 2016

rkern commented Feb 14, 2016

mattip commented May 7, 2019

shoyer commented May 7, 2019

NeilGirdhar commented Dec 23, 2022 •

edited

mattip commented Dec 23, 2022

asmeurer commented Jan 3, 2023

NeilGirdhar commented Jan 3, 2023 •

edited

rhshadrach commented Oct 31, 2023

BUG: dtype comparison coerces generously leading to hash problems #7242

BUG: dtype comparison coerces generously leading to hash problems #7242

Comments

NeilGirdhar commented Feb 13, 2016 • edited

Summary

Historical issue

shoyer commented Feb 13, 2016

NeilGirdhar commented Feb 13, 2016

shoyer commented Feb 13, 2016

NeilGirdhar commented Feb 13, 2016

insertinterestingnamehere commented Feb 13, 2016

njsmith commented Feb 13, 2016

NeilGirdhar commented Feb 14, 2016

njsmith commented Feb 14, 2016

NeilGirdhar commented Feb 14, 2016

njsmith commented Feb 14, 2016

NeilGirdhar commented Feb 14, 2016

rkern commented Feb 14, 2016

rkern commented Feb 14, 2016

NeilGirdhar commented Feb 14, 2016

NeilGirdhar commented Feb 14, 2016

rkern commented Feb 14, 2016

mattip commented May 7, 2019

shoyer commented May 7, 2019

NeilGirdhar commented Dec 23, 2022 • edited

mattip commented Dec 23, 2022

asmeurer commented Jan 3, 2023

NeilGirdhar commented Jan 3, 2023 • edited

rhshadrach commented Oct 31, 2023

NeilGirdhar commented Feb 13, 2016 •

edited

NeilGirdhar commented Dec 23, 2022 •

edited

NeilGirdhar commented Jan 3, 2023 •

edited