Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: pd.Timedelta breaks hash invariant #44504

Open
3 tasks done
Tracked by #46587
harahu opened this issue Nov 17, 2021 · 6 comments
Open
3 tasks done
Tracked by #46587

BUG: pd.Timedelta breaks hash invariant #44504

harahu opened this issue Nov 17, 2021 · 6 comments
Labels
Bug Needs Discussion Requires discussion from core team before further action Timedelta Timedelta data type

Comments

@harahu
Copy link
Contributor

harahu commented Nov 17, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the master branch of pandas.

Reproducible Example

import pandas as pd
import numpy as np

pd_td = pd.Timedelta(0)
np_td = np.timedelta64(0)

# One of these should be true
assert (pd_td == np_td and hash(pd_td) == hash(np_td)) or pd_td != np_td

Issue Description

Ref: https://bugs.python.org/issue45832

I was prompted by @rhettinger to bring this up with you.

pd.Timedelta and np.timedelta64 are in violation of the python hash invariant.

Specifically:

Hashable objects which compare equal must have the same hash value.

Expected Behavior

# Should pass
assert (pd_td == np_td and hash(pd_td) == hash(np_td)) or pd_td != np_td

Installed Versions

INSTALLED VERSIONS

commit : 945c9ed
python : 3.10.0.final.0
python-bits : 64
OS : Darwin
OS-release : 21.1.0
Version : Darwin Kernel Version 21.1.0: Wed Oct 13 17:33:23 PDT 2021; root:xnu-8019.41.5~1/RELEASE_X86_64
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : None
LOCALE : en_US.UTF-8

pandas : 1.3.4
numpy : 1.21.1
pytz : 2021.3
dateutil : 2.8.2
pip : 21.3.1
setuptools : 58.3.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 3.0.3
IPython : 7.29.0
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 6.0.0
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None

@harahu harahu added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 17, 2021
@jorisvandenbossche
Copy link
Member

@harahu Thanks for the report.

The problem is that pd.Timedelta can also evaluate equal with the standard library datetime.timedelta. So adding one more to your example:

pd_0_dt = pd.Timedelta(0)
np_0_dt = np.timedelta64(0)
stdlib_0_dt= datetime.timedelta(0)

we get:

In [30]: pd_0_dt == stdlib_0_dt
Out[30]: True

In [31]: np_0_dt == stdlib_0_dt
Out[31]: False

In [32]: hash(stdlib_0_dt)
Out[32]: 3010437511937009226

In [33]: hash(pd_0_dt)
Out[33]: 3010437511937009226

In [34]: hash(np_0_dt)
Out[34]: 0

So the hash invariance holds between pd.Timedelta and datetime.timedelta. But since the numpy timedelta doesn't evaluate equal with datetime.timedelta (and has a different hash), it's impossible in the current situation to comply with hash invariance in both cases at the same time.

(in fact in this case the __hash__ being used for pd.Timedelta is the one inherited from datetime.timedelta)

@harahu
Copy link
Contributor Author

harahu commented Nov 19, 2021

@jorisvandenbossche

I understand that is is a desirable property to have both of pd_0_dt == stdlib_0_dt and pd_0_dt == np_0_dt. But serving as a "bridge", crossing the np_0_dt != stdlib_0_dt gap puts pandas in a dangerous position. I don't know how much thought the numpy devs have put into the choice of being unequal to the datetime.timedelta. But if if is a conscious choice, then the smart thing to do would probably be for pandas to make a decision as to which side it wants to land on. The other sensible alternative, I guess, is to try to lobby for the datetime and numpy hashes to be aligned.

Having pd.Timedelta not being considered hashable (in the strict sense), seems like the worst option to me at least, and if it remains true for a prolonged period, it should probably be well documented as a potential pitfall for users. I'll give a (very simplified) example of how this bit me:

import pandas as pd

s = pd.Series([pd.Timedelta(value=i, unit="D") for i in range(5)])

uniques = s.unique()
reprs = dict(zip(uniques, (repr(u) for u in uniques)))
reprs_keys = list(reprs.keys())
for td in s:
    assert td in reprs_keys
    # Assuming all is good and I have no worries at this point
    
    # Say hello to unexpected KeyError: Timedelta('0 days 00:00:00')
    print(reprs[td])

Although this example might seem contrived, that is only the case because it is, as mentioned, simplified. When objects get processed in complex library code, sometimes out of your control, it is just a question of time before the lack of hash invariance leads to subtle or not-so-subtle errors. Looking at the example above, the only reasonable outcomes are one of:

  • Process finished with exit code 0
  • AssertionError

@mroeschke mroeschke added Needs Discussion Requires discussion from core team before further action Timedelta Timedelta data type and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 21, 2021
@jbrockmendel
Copy link
Member

Would the particular example with s.unique be solved by #42741?

Some observations:

  • we actually match the timedelta64 hash for Timedeltas with nonzero nanoseconds
  • np.timedelta64.__hash__ is faster than Timedelta.__hash__ (about 58ns vs 127 ns)
  • for Timestamp we also match the stdlib (unless we have nanos)

@harahu
Copy link
Contributor Author

harahu commented Nov 22, 2021

Would the particular example with s.unique be solved by #42741?

Seems that way. If I understand it correctly, one would then have the members of s.unique() be of type pd.Timedelta, which should be ok.

@jbrockmendel
Copy link
Member

xref numpy/numpy#3836

@harahu
Copy link
Contributor Author

harahu commented Dec 29, 2021

xref numpy/numpy#3836

Good find!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Needs Discussion Requires discussion from core team before further action Timedelta Timedelta data type
Projects
None yet
Development

No branches or pull requests

4 participants