hash_pandas_object on ExtensionArray-backed Series fails with TypeError #23066

TomAugspurger · 2018-10-09T22:16:37Z

In [1]: import pandas as pd

In [2]: pd.Series
Out[2]: pandas.core.series.Series

In [4]: s = pd.Series(pd.interval_range(0, periods=10))

In [5]: pd.util.hash_pandas_object(s)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-1b7247db4f16> in <module>
----> 1 pd.util.hash_pandas_object(s)

~/sandbox/pandas/pandas/core/util/hashing.py in hash_pandas_object(obj, index, encoding, hash_key, categorize)
     88     elif isinstance(obj, ABCSeries):
     89         h = hash_array(obj.values, encoding, hash_key,
---> 90                        categorize).astype('uint64', copy=False)
     91         if index:
     92             index_iter = (hash_pandas_object(obj.index,

~/sandbox/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
    269     # we'll be working with everything as 64-bit values, so handle this
    270     # 128-bit value early
--> 271     elif np.issubdtype(dtype, np.complex128):
    272         return hash_array(vals.real) + 23 * hash_array(vals.imag)
    273

~/Envs/pandas-dev/lib/python3.7/site-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
    712     """
    713     if not issubclass_(arg1, generic):
--> 714         arg1 = dtype(arg1).type
    715     if not issubclass_(arg2, generic):
    716         arg2_orig = arg2

TypeError: data type not understood

In [6]: s
Out[6]:
0     (0, 1]
1     (1, 2]
2     (2, 3]
3     (3, 4]
4     (4, 5]
5     (5, 6]
6     (6, 7]
7     (7, 8]
8     (8, 9]
9    (9, 10]
dtype: interval

Options

convert to object before hashing
add some kind of _hash_values to the interface. But, how do we prevent hash collisions between similar, but different EAs? For example, the fastest hash for a PeriodArray would be to just hash the ordinals. But we wouldn't want the following two to hash identically (using my PeriodArray branch)

In [35]: pd.core.arrays.PeriodArray._from_ordinals([10, 20], freq='H')
Out[35]:
<pandas PeriodArray>
['1970-01-01 10:00', '1970-01-01 20:00']
Length: 2, dtype: period[H]

In [36]: pd.core.arrays.PeriodArray._from_ordinals([10, 20], freq='D')
Out[36]:
<pandas PeriodArray>
['1970-01-11', '1970-01-21']
Length: 2, dtype: period[D]

So we need to mix the dtype information in too.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2018-10-09T22:22:04Z

Slightly related, this is an issue on with datetime / datetimetz

In [86]: ns = np.array([946706400000000000, 946792800000000000, 946879200000000000,
    ...:        946965600000000000])
    ...:

In [87]: a = pd.Series(pd.DatetimeIndex(ns))

In [88]: b = pd.Series(pd.DatetimeIndex(ns, tz='UTC'))

In [89]: pd.util.testing.assert_series_equal(pd.util.hash_pandas_object(a), pd.util.hash_pandas_object(b))

TomAugspurger · 2018-10-09T22:28:08Z

Hmm, maybe we just don't care about different dtypes having the same hashed values (which is perfectly fair)

In [92]: a = pd.Series(pd.Categorical([0, 0, 1, 2]))

In [93]: b = pd.Series([0, 0, 1, 2])

In [94]: pd.util.hash_pandas_object(a)
Out[94]:
0    3713087409444908179
1    7478705303072568462
2    3975671353655200382
3    3563156779521628949
dtype: uint64

In [95]: pd.util.hash_pandas_object(b)
Out[95]:
0    3713087409444908179
1    7478705303072568462
2    3975671353655200382
3    3563156779521628949
dtype: uint64

jreback · 2018-10-09T22:49:50Z

iirc ther is an issue about the datetime hashing from a while ago

TomAugspurger · 2018-10-10T11:55:24Z

Looks like #16372

So we have two issues

An extension point for EAs to determine how they're hashed.
How to avoid "collisions" between different dtypes (API/BUG: hashing of datetimes is based on UTC values #16372)

Let's focus on the first issue here. I think EAs need some kind of way to say what values are hashed. Performance seems too critical to just .astype(object) here. So two options

A new _values_for_hashing method
Overload _values_for_factorize and use that

Right now I'm leaning toward 2.

jorisvandenbossche · 2018-10-10T22:12:57Z

Performance seems too critical to just .astype(object) here. So two options

Also, astype(object) does not necessarily give you hashable values (eg won't be the case for geometries)

What would be the reason not to use _values_for_factorize? They already need to be hashable, and should be unique to the original data they represent (since they are round-trippable)

TomAugspurger · 2018-10-11T01:40:29Z

re-using _values_for_factorize should be fine.

Closes pandas-dev#23066

Closes #23066

Closes pandas-dev#23066

jbrockmendel · 2019-10-24T23:57:54Z

It looks like hash_pandas_object isn't used outside of tests. Do we still need it? (its exposed in pd.util which is kind of weird)

jreback · 2019-10-25T01:33:26Z

this is exposed as an api for other libraries (dask)

TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Oct 9, 2018

TomAugspurger added this to the 0.24.0 milestone Oct 9, 2018

TomAugspurger added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions labels Oct 9, 2018

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 11, 2018

Support ExtensionArray in hash_pandas_object

1e1aa5c

Closes pandas-dev#23066

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 11, 2018

Support ExtensionArray in hash_pandas_object

5935feb

Closes pandas-dev#23066

TomAugspurger mentioned this issue Oct 11, 2018

Support ExtensionArray in hash_pandas_object #23082

Merged

jreback closed this as completed in #23082 Oct 11, 2018

jreback pushed a commit that referenced this issue Oct 11, 2018

Support ExtensionArray in hash_pandas_object (#23082)

4001252

Closes #23066

tm9k1 pushed a commit to tm9k1/pandas that referenced this issue Nov 19, 2018

Support ExtensionArray in hash_pandas_object (pandas-dev#23082)

d6ae3a0

Closes pandas-dev#23066

rhshadrach mentioned this issue Sep 6, 2023

API: Is pandas.util public? #55023

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hash_pandas_object on ExtensionArray-backed Series fails with TypeError #23066

hash_pandas_object on ExtensionArray-backed Series fails with TypeError #23066

TomAugspurger commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

jreback commented Oct 9, 2018

TomAugspurger commented Oct 10, 2018

jorisvandenbossche commented Oct 10, 2018

TomAugspurger commented Oct 11, 2018

jbrockmendel commented Oct 24, 2019

jreback commented Oct 25, 2019

hash_pandas_object on ExtensionArray-backed Series fails with TypeError #23066

hash_pandas_object on ExtensionArray-backed Series fails with TypeError #23066

Comments

TomAugspurger commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

TomAugspurger commented Oct 9, 2018

jreback commented Oct 9, 2018

TomAugspurger commented Oct 10, 2018

jorisvandenbossche commented Oct 10, 2018

TomAugspurger commented Oct 11, 2018

jbrockmendel commented Oct 24, 2019

jreback commented Oct 25, 2019