Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Index.get_indexer with mixed-reso datetime64s #50690

Open
Tracked by #46587
jbrockmendel opened this issue Jan 12, 2023 · 9 comments
Open
Tracked by #46587

BUG: Index.get_indexer with mixed-reso datetime64s #50690

jbrockmendel opened this issue Jan 12, 2023 · 9 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Non-Nano datetime64/timedelta64 with non-nanosecond resolution Upstream issue Issue related to pandas dependency

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Jan 12, 2023

import numpy as np
import pandas as pd

ms = np.datetime64(1, "ms")
us = np.datetime64(1000, "us")

left = pd.Index([ms], dtype=object)
right = pd.Index([us], dtype=object)

assert left[0] == right[0]
assert (left == right).all()

>>> left[0] in right  # <- wrong
False
>>> right[0] in left  # <- wrong
False

>>> left.get_loc(right[0])  # <- raises, incorrectly
>>> right.get_loc(left[0])  # <- raises, incorrectly

>>> left.get_indexer(right)  # works correctly AFAICT bc it doesnt use hashtable

# But in a non-monotonic case...
sec = np.datetime64("9999-01-01", "s")
day = np.datetime64("2016-01-01", "D")
left2 = pd.Index([ms, sec, day], dtype=object)

>>> left2[:1].get_indexer(right)
array([0])
>>> left2.get_indexer(right)  # <- wrong
array([-1])

IIUC the issue is in the hashing of the datetime64 objects, which do not follow the invariance x == y \Rightarrow hash(x) == hash(y) (xref numpy/numpy#3836)

When implementing non-nanosecond support for Timestamp/Timedelta, we implemented __hash__ to retain this invariance (at the cost of performance).

Unless numpy changes its behavior, I think to fix this we need to patch how we treat datetime64 objects in our khash code, likely mirroring Timestamp.__hash__. cc @realead thoughts?

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jan 12, 2023
@realead
Copy link
Contributor

realead commented Jan 17, 2023

The correct place is to fix the issue is the __hash__ function of the np.datetime64. Treating datetime64 would have cost for all Python-objects, even if no datetime64-objects are used. Costs for Float and Complex aren't very high because PyFloat_CheckExact and PyComplex_CheckExact are very fast. Not sure there is something similarly fast for np.datetime64.

@jbrockmendel
Copy link
Member Author

The correct place is to fix the issue is the hash function of the np.datetime64

Agreed. Unfortunately numpy/numpy#3836 doesn't seem to be a high priority for numpy.

Not sure there is something similarly fast for np.datetime64

PyObject_TypeCheck(obj, &PyDatetimeArrType_Type). Since we can make this check performant, the cost in dt64-free cases shouldn't be too bad right?

@realead
Copy link
Contributor

realead commented Jan 19, 2023

@jbrockmendel you probably mean PyDatetimeScalarObject (https://github.com/numpy/numpy/blob/548bc6826b597ab79b9c1451b79ec8d23db9d444/numpy/core/include/numpy/arrayscalars.h#L119-L123)?

Really, it doesn't look too bad for performance (I wrongly assumed we would need to go via isinstance)...

@jbrockmendel
Copy link
Member Author

you probably mean PyDatetimeScalarObject

I wish. PyObject_TypeCheck(obj, &PyDatetimeArrType_Type) is how we check for np.datetime64 objects in our cython code (grep is_datetime64_object). Maybe we can do something in C that we can't in cython?

@realead
Copy link
Contributor

realead commented Jan 20, 2023

@jbrockmendel Oh, you are correct.

I guess we could do something for "sane" cases. np.datatime64 is quite a strange class, see for example:

import numpy as np
x=np.datetime64(2_000_000_000_000, "Y")
y=np.datetime64(7_773_671_778_871_345_152, "s")
x,y, x==y
# (numpy.datetime64('2000000001970'), numpy.datetime64('246337854208-06-08T02:59:12'), True)

As you can see the objects represent quite different time, but are equal (probably due to an overflow in the compare-operator)... to ensure, that the hash in this case is the same would be quite a challenge.

A simply idea would to transform all times in attosecond ('as'), which will be in range [ -2^63*31536000*10^18 ... (2^63-1)*31536000*10^18]) and then to calculate the hash-value.

@jbrockmendel
Copy link
Member Author

I'd want to do roughly the same thing we did in Timestamp.__hash__ when implementing non-nano:

def hash_dt64_obj(obj):
    if can_be_cast_losslessly_to_pydatetime(obj):
        pydt = as_pydatetime(obj)
        return hash(pydt)

    if obj_reso == attoseconds:
        return obj.view("i8")

    try:
        obj2 = astype_overflowsafe(obj, one_reso_down)
    except OverflowError:
        return obj.view("i8")
    else:
        return hash(obj2)

astype_overflowsafe is implemented in cython, might need to move it to C to make it accessible for the hashing code

@jbrockmendel
Copy link
Member Author

@WillAyd im trying to implement this in khash_python.h and struggling to make #include <../../tslibs/src/datetime/np_datetime.h> work the way I expect (to get pandas_datetime_to_datetimestruct). At import time I get ImportError: dlopen(/Users/brock/Desktop/pd/bug-hash-dt64/pandas/_libs/hashtable.cpython-310-darwin.so, 0x0002): symbol not found in flat namespace '_pandas_datetime_to_datetimestruct'. Main thing I've tried is adding tseries_depends to the _libs.hashtable and _libs.interval entries in setup.py.

Thoughts?

@WillAyd
Copy link
Member

WillAyd commented Jan 24, 2023

Given your issue is with dlopen its likely not an issue with any of your header includes (and similarly the depends argument) but rather with the arguments to the linker. I'm not entirely clear on how setuptools handles things but you might have luck trying to add the sources argument of "pandas/_libs/tslibs/src/datetime/np_datetime.c" that you see on some other extensions.

If that doesn't work feel free to push up a PR and can take a look. Also might help if you post the gcc command that setuptools is generating.

@jbrockmendel
Copy link
Member Author

add the sources argument

bingo, thanks.

@jbrockmendel jbrockmendel added Indexing Related to indexing on series/frames, not to indexes themselves Upstream issue Issue related to pandas dependency Non-Nano datetime64/timedelta64 with non-nanosecond resolution and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Non-Nano datetime64/timedelta64 with non-nanosecond resolution Upstream issue Issue related to pandas dependency
Projects
None yet
Development

No branches or pull requests

3 participants