Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datetime64 breaks equality and hash invariant. #3836

Open
mdickinson opened this issue Sep 30, 2013 · 10 comments
Open

datetime64 breaks equality and hash invariant. #3836

mdickinson opened this issue Sep 30, 2013 · 10 comments

Comments

@mdickinson
Copy link
Contributor

It looks as though the datetime64 dtype breaks the Python rule that x == y should imply hash(x) == hash(y). This broke a Pandas application that was grouping on dates, and then doing a dictionary lookup to find the lines of a DataFrame associated to a particular date.

Python 3.3.2 (default, May 21 2013, 11:48:51) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> numpy.__version__
'1.7.1'
>>> import datetime
>>> x = numpy.datetime64(datetime.datetime.now())
>>> y = numpy.datetime64(x, 'ns')
>>> x == y  # gives the expected 'True'
True
>>> hash(x) == hash(y)  # expected True
False
@juliantaylor
Copy link
Contributor

very likely related to #3800 which fixes integer hashing. Probably a similar fix can be applied.

@juliantaylor
Copy link
Contributor

currently numpy just hashes the int64, which breaks the invariant if two datetime represent the same time in different units.
preserving this invariant is actually pretty tricky as one has to convert all datetimes to a common unit which is not always possible with int64.
Best I can think of is doing the conversion in PyLong for arbitrary precision, it could possibly be improved in speed by using a int128 type if available (e.g. gcc >= 4.6)

can timezone based times (Y/M/W/D) ever compare equal to non zone units (s/ms...)?

@njsmith
Copy link
Member

njsmith commented Oct 2, 2013

I don't think there's any real concern about speed here, e.g. just blowing
everything up into two 64 bit integers would be totally fine?

@juliantaylor
Copy link
Contributor

I don't know anything about datetime (for what is this undocumented num variable in the metadata(?!)), but I think this should work:

def hash(input)
  basemeta = smallest_unit // attoseconds
  PyLong num, PyLong denom = get_base_conversion(input, basemeta)
  if timezone:
     base = get_timezone_to_D_base_conversion() // 356 12, 7
     num, denom = mix(base, num, denom)

  PyLong result = convert_to_base_unit(num, denom, input.asPyLong)
  return hash(result)

which basically means reimplementing a bunch of the existing conversion stuff (for casting) in PyLong.

@juliantaylor
Copy link
Contributor

do we still want to do this for 1.8 or defer it to a 1.8.1?

@njsmith
Copy link
Member

njsmith commented Oct 5, 2013

This bug is already there in 1.7, right? And has no PR yet? I think we just leave it for 1.8.1/1.9.

@knutae
Copy link

knutae commented May 24, 2019

Ah, this explains some unexpected behavior I've been seeing.

Python 3.6.8 (default, Apr  9 2019, 04:59:38) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import numpy
>>> x = numpy.datetime64(1512345678910, "ms")
>>> y = numpy.datetime64(x, "ns")
>>> x == y
True
>>> [x] == [y]
True
>>> {x} == {y}
False
>>> y in [x]
True
>>> y in {x}
False

It means that datetime64 objects, or tuples containing them, cannot consistently be used in sets, or as keys in dictionaries.

@seberg seberg self-assigned this May 24, 2019
@seberg seberg removed their assignment Oct 1, 2019
@eric-wieser
Copy link
Member

Another way in which this fails today:

now = datetime.datetime.now()
np_now = np.datetime64(now)
assert now == np_now
assert hash(now) == hash(np_now)  # fails

@seberg
Copy link
Member

seberg commented Jan 24, 2023

Pandas has a PR open now about this here: pandas-dev/pandas#50960 which may make sense to upstream as is or similar. (Sorry had the wrong link first)

@seberg
Copy link
Member

seberg commented Feb 15, 2023

For (a bit) of visibility: The PR currently converts to datetime and if that is out of range does something similar to a tuple hash: hash((year, month, day, minute, second, us, ps, as)).
I don't think this is very fixed, I am relatively happy with that scheme, although I might consider combining multiple values (i.e. hash((..., minute * 60 + second))) when we upstream.

But, if anyone has a thought on a better scheme that would also be good. We do need to extract the year anyway to decide whether we need to use the builtin python hashing.
(The builtin datetime hash, hashes the raw bytes of the datetime struct, so isn't nice to steal.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants