BUG: pd.Timedelta breaks hash invariant #44504

harahu · 2021-11-17T22:06:04Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

pd_td = pd.Timedelta(0)
np_td = np.timedelta64(0)

# One of these should be true
assert (pd_td == np_td and hash(pd_td) == hash(np_td)) or pd_td != np_td

Issue Description

Ref: https://bugs.python.org/issue45832

I was prompted by @rhettinger to bring this up with you.

pd.Timedelta and np.timedelta64 are in violation of the python hash invariant.

Specifically:

Hashable objects which compare equal must have the same hash value.

Expected Behavior

# Should pass
assert (pd_td == np_td and hash(pd_td) == hash(np_td)) or pd_td != np_td

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.10.0.final.0
python-bits : 64
OS : Darwin
OS-release : 21.1.0
Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 6.0.0
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2021-11-19T11:41:01Z

@harahu Thanks for the report.

The problem is that pd.Timedelta can also evaluate equal with the standard library datetime.timedelta. So adding one more to your example:

pd_0_dt = pd.Timedelta(0)
np_0_dt = np.timedelta64(0)
stdlib_0_dt= datetime.timedelta(0)

we get:

In [30]: pd_0_dt == stdlib_0_dt
Out[30]: True

In [31]: np_0_dt == stdlib_0_dt
Out[31]: False

In [32]: hash(stdlib_0_dt)
Out[32]: 3010437511937009226

In [33]: hash(pd_0_dt)
Out[33]: 3010437511937009226

In [34]: hash(np_0_dt)
Out[34]: 0

So the hash invariance holds between pd.Timedelta and datetime.timedelta. But since the numpy timedelta doesn't evaluate equal with datetime.timedelta (and has a different hash), it's impossible in the current situation to comply with hash invariance in both cases at the same time.

(in fact in this case the __hash__ being used for pd.Timedelta is the one inherited from datetime.timedelta)

harahu · 2021-11-19T15:57:33Z

@jorisvandenbossche

I understand that is is a desirable property to have both of pd_0_dt == stdlib_0_dt and pd_0_dt == np_0_dt. But serving as a "bridge", crossing the np_0_dt != stdlib_0_dt gap puts pandas in a dangerous position. I don't know how much thought the numpy devs have put into the choice of being unequal to the datetime.timedelta. But if if is a conscious choice, then the smart thing to do would probably be for pandas to make a decision as to which side it wants to land on. The other sensible alternative, I guess, is to try to lobby for the datetime and numpy hashes to be aligned.

Having pd.Timedelta not being considered hashable (in the strict sense), seems like the worst option to me at least, and if it remains true for a prolonged period, it should probably be well documented as a potential pitfall for users. I'll give a (very simplified) example of how this bit me:

import pandas as pd

s = pd.Series([pd.Timedelta(value=i, unit="D") for i in range(5)])

uniques = s.unique()
reprs = dict(zip(uniques, (repr(u) for u in uniques)))
reprs_keys = list(reprs.keys())
for td in s:
    assert td in reprs_keys
    # Assuming all is good and I have no worries at this point
    
    # Say hello to unexpected KeyError: Timedelta('0 days 00:00:00')
    print(reprs[td])

Although this example might seem contrived, that is only the case because it is, as mentioned, simplified. When objects get processed in complex library code, sometimes out of your control, it is just a question of time before the lack of hash invariance leads to subtle or not-so-subtle errors. Looking at the example above, the only reasonable outcomes are one of:

Process finished with exit code 0
AssertionError

jbrockmendel · 2021-11-22T18:07:59Z

Would the particular example with s.unique be solved by #42741?

Some observations:

we actually match the timedelta64 hash for Timedeltas with nonzero nanoseconds
np.timedelta64.__hash__ is faster than Timedelta.__hash__ (about 58ns vs 127 ns)
for Timestamp we also match the stdlib (unless we have nanos)

harahu · 2021-11-22T21:24:49Z

Would the particular example with s.unique be solved by #42741?

Seems that way. If I understand it correctly, one would then have the members of s.unique() be of type pd.Timedelta, which should be ok.

jbrockmendel · 2021-12-28T01:38:58Z

xref numpy/numpy#3836

harahu · 2021-12-29T14:45:06Z

xref numpy/numpy#3836

Good find!

harahu added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 17, 2021

mroeschke added Needs Discussion Requires discussion from core team before further action Timedelta Timedelta data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 21, 2021

jbrockmendel mentioned this issue Apr 10, 2022

ENH: implement non-nano Timedelta scalar #46688

Merged

4 tasks

jbrockmendel mentioned this issue Oct 1, 2022

BUG: DataFrame from dict with non-nano Timedelta #48901

Merged

5 tasks

jbrockmendel mentioned this issue Jan 16, 2023

API/DES: Non-Nanosecond Tracker #46587

Open

25 tasks

harahu mentioned this issue Jul 10, 2023

TST: Add test for Timedelta hash invariance #54035

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pd.Timedelta breaks hash invariant #44504

BUG: pd.Timedelta breaks hash invariant #44504

harahu commented Nov 17, 2021 •

edited

INSTALLED VERSIONS

jorisvandenbossche commented Nov 19, 2021

harahu commented Nov 19, 2021

jbrockmendel commented Nov 22, 2021

harahu commented Nov 22, 2021

jbrockmendel commented Dec 28, 2021

harahu commented Dec 29, 2021

BUG: pd.Timedelta breaks hash invariant #44504

BUG: pd.Timedelta breaks hash invariant #44504

Comments

harahu commented Nov 17, 2021 • edited

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

jorisvandenbossche commented Nov 19, 2021

harahu commented Nov 19, 2021

jbrockmendel commented Nov 22, 2021

harahu commented Nov 22, 2021

jbrockmendel commented Dec 28, 2021

harahu commented Dec 29, 2021

harahu commented Nov 17, 2021 •

edited