Skip to content

Conversation

@matiaslindgren
Copy link
Contributor

@matiaslindgren matiaslindgren commented Oct 28, 2025

This is a simple workaround. For a proper solution, khash_python.h would require some refactoring to handle exceptions gracefully. It's a bit tricky, though, because kh_python_hash_{equal,func} is exposed to the vendored khash implementation, which calls those functions in a loop.

khash_python.h was silently suppressing all exceptions thrown when calling custom __hash__ and __eq__ methods. This PR implements a new layer for pymap that catches all exceptions thrown during khash computation and raises them properly.

^^^^^
- Bug in :class:`DataFrame` when passing a ``dict`` with a NA scalar and ``columns`` that would always return ``np.nan`` (:issue:`57205`)
- Bug in :class:`Series` ignoring errors when trying to convert :class:`Series` input data to the given ``dtype`` (:issue:`60728`)
- Bug in :class:``PyObjectHashTable`` that would silently suppress exceptions thrown from custom ``__hash__`` and ``__eq__`` methods during hashing (:issue:`57052`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you able to add a test that uses a public API that would be fixed by your changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@jbrockmendel
Copy link
Member

doing this inside the khash code is definitely more difficult, but problably the Right Way To Do It. Does this entail a perf hit?

BTW #62888 is probably going to have to entail digging into that same bit of khash code.

@matiaslindgren
Copy link
Contributor Author

I tried adapting the suggestion from #57052 (comment) (pandas._libs.parsers.raise_parser_error) but there are quite a few failing tests.

SystemError: ... returned a result with an exception set is the catch-all exception raised by the interpreter when an exception is left unhandled in the C API layer. So it seems there are some code paths outside PyObjectHashTable that call kh_python_hash_{equal,func}. I need to do some more digging here.

doing this inside the khash code is definitely more difficult, but problably the Right Way To Do It. Does this entail a perf hit?

I'll set up a tiny benchmark for PyObjectHashTable to compare these changes with main.

@matiaslindgren matiaslindgren changed the title BUG: try triggering exceptions from custom methods in Cython before entering the khash loop BUG: Catch all exceptions raised while calling PyObjectHashTable methods Oct 31, 2025
@matiaslindgren
Copy link
Contributor Author

I implemented a new layer called pymap_checked in pandas/_libs/khash for the PyObject hash table. It will catch every exception thrown during khash computation for PyObjects.

The next problem is fixing the dozens of exceptions that were previously silently suppressed. Most of them seem to be either TypeError: boolean value of NA is ambiguous or TypeError: unhashable type: 'dict' but there are a few others too.

@matiaslindgren
Copy link
Contributor Author

doing this inside the khash code is definitely more difficult, but problably the Right Way To Do It. Does this entail a perf hit?

BTW #62888 is probably going to have to entail digging into that same bit of khash code.

FYI @jbrockmendel this small benchmark I did for PyObjectHashTable.{set,get}_item suggests the if PyErr_Occurred() check does not affect performance, even when it runs on every kh_*_pymap call.

setup

from pandas._libs import hashtable as ht
from random import shuffle


class testkey:
    def __init__(self, value):
        self.value = value

    def __hash__(self):
        return hash(self.value)

    def __eq__(self, other):
        return self.value == other.value


def test_pymap_set_get(indexes: list[int]):
    table = ht.PyObjectHashTable()

    keys = [testkey(f"key{i}") for i in indexes]

    shuffle(indexes)
    for i in indexes:
        table.set_item(keys[i], i)

    shuffle(indexes)
    for i in indexes:
        assert table.get_item(keys[i]) == i


def test_pymap_set_get_no_shuffle(indexes: list[int]):
    table = ht.PyObjectHashTable()

    keys = [testkey(f"key{i}") for i in indexes]

    for i in indexes:
        table.set_item(keys[i], i)

    for i in indexes:
        assert table.get_item(keys[i]) == i

main branch (d597079)

with shuffle

In [1]: %timeit test_pymap_set_get(list(range(100)))
55.9 μs ± 390 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit test_pymap_set_get(list(range(1000)))
606 μs ± 847 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit test_pymap_set_get(list(range(10000)))
6.49 ms ± 18.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit test_pymap_set_get(list(range(100000)))
81.1 ms ± 329 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

without shuffle

In [1]: %timeit test_pymap_set_get_no_shuffle(list(range(100)))
36.9 μs ± 401 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit test_pymap_set_get_no_shuffle(list(range(1000)))
372 μs ± 2.38 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit test_pymap_set_get_no_shuffle(list(range(10000)))
4.1 ms ± 24 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit test_pymap_set_get_no_shuffle(list(range(100000)))
46.6 ms ± 248 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

this PR (0a4cba8)

with shuffle

In [1]: %timeit test_pymap_set_get(list(range(100)))
55.9 μs ± 151 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit test_pymap_set_get(list(range(1000)))
604 μs ± 1.13 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit test_pymap_set_get(list(range(10000)))
6.51 ms ± 10.8 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit test_pymap_set_get(list(range(100000)))
79.6 ms ± 268 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

without shuffle

In [1]: %timeit test_pymap_set_get_no_shuffle(list(range(100)))
37.2 μs ± 106 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [2]: %timeit test_pymap_set_get_no_shuffle(list(range(1000)))
373 μs ± 926 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [3]: %timeit test_pymap_set_get_no_shuffle(list(range(10000)))
4.03 ms ± 8.16 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit test_pymap_set_get_no_shuffle(list(range(100000)))
45.3 ms ± 190 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BUG: ht.PyObjectHashTable swallows exception

3 participants