Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: hashing datetime64 objects #50960

Closed
wants to merge 64 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
7761ecd
BUG: hashing datetime64 objects
jbrockmendel Jan 24, 2023
610b0c6
handle cases out of pydatetime bounds
jbrockmendel Jan 24, 2023
0e140ae
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 24, 2023
919383c
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 24, 2023
ae8c0bb
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 25, 2023
92a39eb
troubleshoot CI builds
jbrockmendel Jan 25, 2023
2f67805
troubleshoot CI builds
jbrockmendel Jan 25, 2023
0635f86
troubleshoot CI builds
jbrockmendel Jan 25, 2023
229ab72
troubleshoot CI builds
jbrockmendel Jan 25, 2023
6e96805
troubleshoot CI builds
jbrockmendel Jan 25, 2023
24fda82
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 26, 2023
058b666
suggested edits
jbrockmendel Jan 27, 2023
7398991
Merge branch 'main' into bug-hash-dt64
jbrockmendel Jan 31, 2023
3fdf564
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 8, 2023
6e4836e
use sebergs suggestion
jbrockmendel Feb 8, 2023
f55337a
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 9, 2023
a97dfc9
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 9, 2023
1338ca2
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 9, 2023
9fb1987
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 10, 2023
818682c
suggested edits
jbrockmendel Feb 10, 2023
74ab540
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 10, 2023
037ba05
remove unnecessary casts
jbrockmendel Feb 11, 2023
7a4b1ab
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 11, 2023
47e5247
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 11, 2023
dcd09dd
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 11, 2023
32d479b
Merge branch 'bug-hash-dt64' of github.com:jbrockmendel/pandas into b…
jbrockmendel Feb 11, 2023
d47cfd8
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 12, 2023
c091317
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 12, 2023
1ce791e
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 13, 2023
704fb69
suggested edit for PyDateTime_IMPORT
jbrockmendel Feb 13, 2023
6d962b0
Merge branch 'bug-hash-dt64' of github.com:jbrockmendel/pandas into b…
jbrockmendel Feb 13, 2023
f838953
revert delay
jbrockmendel Feb 13, 2023
0d8500a
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 13, 2023
58a29b6
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 14, 2023
95069e0
restore check
jbrockmendel Feb 14, 2023
7362f3e
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 14, 2023
dd08670
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 15, 2023
998a4cc
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 15, 2023
b75730b
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 17, 2023
5c57a5e
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 20, 2023
6b4460f
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 21, 2023
c94609b
add test
jbrockmendel Feb 22, 2023
afe9493
Merge branch 'main' into bug-hash-dt64
jbrockmendel Feb 26, 2023
4fecc97
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 3, 2023
b4cc41e
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 8, 2023
c620339
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 9, 2023
3633653
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 11, 2023
c55f182
shot in the dark
jbrockmendel Mar 11, 2023
6e2bbf0
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 12, 2023
23c2826
capsule stuff
jbrockmendel Mar 12, 2023
143b3a3
guessing
jbrockmendel Mar 13, 2023
ffb8365
still tryin
jbrockmendel Mar 13, 2023
1fdfd64
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 13, 2023
5513721
macro
jbrockmendel Mar 13, 2023
875d6af
revert sources
jbrockmendel Mar 13, 2023
a29a56a
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 13, 2023
40e6e17
Move np_datetime64_object_hash to np_datetime.c
jbrockmendel Mar 13, 2023
15a701c
import_pandas_datetime more
jbrockmendel Mar 13, 2023
af25f40
troubleshoot
jbrockmendel Mar 13, 2023
9d5cb46
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 14, 2023
d5a031d
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 14, 2023
bd7d432
post-merge merges
jbrockmendel Mar 14, 2023
394d86e
frickin guess
jbrockmendel Mar 14, 2023
1766bc3
Merge branch 'main' into bug-hash-dt64
jbrockmendel Mar 20, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions pandas/_libs/src/klib/khash_python.h
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@
typedef npy_complex64 khcomplex64_t;
typedef npy_complex128 khcomplex128_t;

// get pandas_datetime_to_datetimestruct
#include <../../tslibs/src/datetime/np_datetime.h>
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

#include "datetime.h"

// khash should report usage to tracemalloc
#if PY_VERSION_HEX >= 0x03060000
Expand Down Expand Up @@ -330,6 +333,40 @@ Py_hash_t PANDAS_INLINE tupleobject_hash(PyTupleObject* key) {
}


// TODO: same thing for timedelta64 objects
Py_hash_t PANDAS_INLINE np_datetime64_object_hash(PyObject* key) {
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
// GH#50690 numpy's hash implementation does not preserve comparabity
// either across resolutions or with standard library objects.
NPY_DATETIMEUNIT unit = (NPY_DATETIMEUNIT)((PyDatetimeScalarObject*)key)->obmeta.base;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should tighten up the error handling instead of just doing a blind cast. From a quick glance in the numpy source I think they have functions like PyArray_IsScalar(key, Datetime) and PyArray_IsScalar(key, Timedellta)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is copy-pasted from our get_datetime64_unit and get_datetime64_value. i haven't found a way to do a c/cython declaration for a np.datetime64 object other than PyObject*/object. open to suggestion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea there's nothing wrong with the PyObject * type, but we are missing the runtime type introspection here and where you copied it from. Does the suggestion above not compile?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so start out with something like?

    if (!PyArray_IsScalar(key, Datetime)) {
        PyErr_Format(PyExc_ValueError, "Expected np.datetime64 object");
        return NULL;
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea that's definitely headed in the right direction

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You return -1 here, but that would work if you want to have runtime checking here (which I consider up to you, could add an assert if you don't want it).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea still want to make sure we have a runtime check; internally our error handling is a mess with extensions so refactoring is a huge effort. Hoping to have a more consistent standard going forward

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PyDatetimeScalarObject cast here shouldn't be needed any more since you refactored the function signature. Does numpy provide a macro to access the unit? Would be even better if we can not do any casting on our end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On review of some numpy internals I don't even think you need a macro. Can blow away the casts altogether - if that doesn't work let me know and can take a closer look

https://github.com/numpy/numpy/blob/6dadb8c40451e934075904f6acdfe341d3b8762e/numpy/core/src/multiarray/scalartypes.c.src#L545

npy_datetime value = ((PyDatetimeScalarObject*)key)->obval;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove this cast too

npy_datetimestruct dts;
PyObject* dt;
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved

pandas_datetime_to_datetimestruct(value, unit, &dts);

if ((dts.year > 0) && (dts.year <= 9999) && (dts.ps == 0) && (dts.as == 0)) {
// we CAN cast to pydatetime, so use that hash to ensure we compare
// as matching standard library datetimes (and pd.Timestamps)
PyDateTime_IMPORT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to move this out of the function body? Maybe it can go into the global hashtable code instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that causes a complaint at build time. there is another use of PyDateTime_IMPORT in the json code that is also runtime

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does not seem clean to be honest. You are effectively doing a lookup datetime.datetime_CAPI and unpacking the capsule.

This being in a header makes things a bit tricky maybe. The import can be on the top level of the cython file, but not of a C-file and thus even less of a header. If this is only used in Cython, then you could just put it there (if it gets used without the import you get a segfault, which should be relatively clean to read in gdb/lldb, but of course not super easy, maybe a comment on the PyDateTime_FromDateAndTime so someone who finds a crash there knows why?).

If this was in a C file, you would need to add a datetime_init() function and make sure it gets called on import.

Long story short: This is used once, so probably move PyDateTime_IMPORT to cython top-level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this not work to move to the global hashtable code in the cython file? Also agree with @seberg would like to avoid using the macro here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ill defer to you guys on this, but im uncomfortable with having this c file have an implicit dependency on the higher-level cython file

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note there's also a coment in np_datetime.c saying it would be helpful to get PyDateTime_IMPORT to work in that file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the implicit dependency is not nice, but I think that is just something to live with. You need some initialization, which needs to be a function call and that cannot happen in a header file. I would just say: If you forget, you get a clean segfault pointing at the first use of Datetime code in gdb/lldb. We put a comment there explaining things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some (hopefully helpful) context - the header file never gets "executed" in any sense, so just putting the macro in a header file doesn't by itself just run the code you've inserted.

The entry point for C extensions is a module initialization function like you see in the Python documentation - this gets executed when the Python interpreter loads the extension:

https://docs.python.org/3/extending/extending.html#the-module-s-method-table-and-initialization-function

I am assuming that the global namespace of the Cython file is akin to the module initialization; if that is true then it would be natural for the import to go there rather than the header.

With that said the workaround suggested here is fine, but I don't think the dependency problem you were worried about is too big of a deal

Copy link
Contributor

@seberg seberg Feb 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PyDateTime_IMPORT;
if (PyDateTimeAPI == NULL) {
/* delayed import, may be nice to move to import time */
PyDateTime_IMPORT;
if (PyDateTimeAPI == NULL) {
return -1;
}
}

I think this is decent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ops had, added an error check, although PyDateTime_IMPORT probably can't fail here anyway (I am sure by this time datetime had been imported somewhere).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pushed commit with this (but not the NULL check) and it failed on the CI, so reverted it for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had misspelled PyDateTimeAPI, with a lower T instead of capital one.

dt = PyDateTime_FromDateAndTime(
dts.year, dts.month, dts.day, dts.hour, dts.min, dts.sec, dts.us
WillAyd marked this conversation as resolved.
Show resolved Hide resolved
);
return PyObject_Hash(dt);
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the Py_DECREF(dt) here, probably best to write as:

hash = PyObject_Hash(dt);
Py_DECREF(dt);
return hash;

I would not add explicit error handling in this case as we are at the end of the function. The error we return from the function is -1, so is the value you would guard against.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I now doubt that -1 is even reachable as a value here, since anything that might turn out as -1 would be handled by the Python hasher. So if you want to be pedantic and find out if this should be there, have to return -1 explicitly, I guess.

}

if (unit == NPY_FR_as) {
// nothing higher to cast to, so use value. Lower-resolution
// cases are responsible for matching this.
return value;
}

// TODO: see if we can cast to the next-highest unit without overflow.
// If so, return the hash of _that_ reso. Otherwise, return value.
// See also Timestamp.__hash__
return value;
}


khuint32_t PANDAS_INLINE kh_python_hash_func(PyObject* key) {
Py_hash_t hash;
// For PyObject_Hash holds:
Expand All @@ -351,6 +388,10 @@ khuint32_t PANDAS_INLINE kh_python_hash_func(PyObject* key) {
else if (PyTuple_CheckExact(key)) {
hash = tupleobject_hash((PyTupleObject*)key);
}
else if (PyObject_TypeCheck(key, &PyDatetimeArrType_Type)) {
// GH#50690
hash = np_datetime64_object_hash(key);
jbrockmendel marked this conversation as resolved.
Show resolved Hide resolved
}
else {
hash = PyObject_Hash(key);
}
Expand Down
26 changes: 26 additions & 0 deletions pandas/tests/indexes/object/test_indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -187,3 +187,29 @@ def test_slice_locs_dup(self):
assert index2.slice_locs(end="a") == (0, 6)
assert index2.slice_locs("d", "b") == (0, 4)
assert index2.slice_locs("c", "a") == (2, 6)


def test_np_datetime64_objects():
# GH#50690
ms = np.datetime64(1, "ms")
us = np.datetime64(1000, "us")

left = Index([ms], dtype=object)
right = Index([us], dtype=object)

assert left[0] in right
assert right[0] in left

assert left.get_loc(right[0]) == 0
assert right.get_loc(left[0]) == 0

# non-monotonic cases go through different paths in cython code
sec = np.datetime64("9999-01-01", "s")
day = np.datetime64("2016-01-01", "D")
left2 = Index([ms, sec, day], dtype=object)

expected = np.array([0], dtype=np.intp)
res = left2[:1].get_indexer(right)
tm.assert_numpy_array_equal(res, expected)
res = left2.get_indexer(right)
tm.assert_numpy_array_equal(res, expected)
2 changes: 2 additions & 0 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,9 @@ def srcpath(name=None, suffix=".pyx", subdir="src"):
"depends": (
["pandas/_libs/src/klib/khash_python.h", "pandas/_libs/src/klib/khash.h"]
+ _pxi_dep["hashtable"]
+ tseries_depends
),
"sources": ["pandas/_libs/tslibs/src/datetime/np_datetime.c"],
},
"_libs.index": {
"pyxfile": "_libs/index",
Expand Down