Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hash_pandas_object on ExtensionArray-backed Series fails with TypeError #23066

Closed
TomAugspurger opened this issue Oct 9, 2018 · 8 comments
Closed
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Milestone

Comments

@TomAugspurger
Copy link
Contributor

In [1]: import pandas as pd

In [2]: pd.Series
Out[2]: pandas.core.series.Series

In [4]: s = pd.Series(pd.interval_range(0, periods=10))

In [5]: pd.util.hash_pandas_object(s)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-5-1b7247db4f16> in <module>
----> 1 pd.util.hash_pandas_object(s)

~/sandbox/pandas/pandas/core/util/hashing.py in hash_pandas_object(obj, index, encoding, hash_key, categorize)
     88     elif isinstance(obj, ABCSeries):
     89         h = hash_array(obj.values, encoding, hash_key,
---> 90                        categorize).astype('uint64', copy=False)
     91         if index:
     92             index_iter = (hash_pandas_object(obj.index,

~/sandbox/pandas/pandas/core/util/hashing.py in hash_array(vals, encoding, hash_key, categorize)
    269     # we'll be working with everything as 64-bit values, so handle this
    270     # 128-bit value early
--> 271     elif np.issubdtype(dtype, np.complex128):
    272         return hash_array(vals.real) + 23 * hash_array(vals.imag)
    273

~/Envs/pandas-dev/lib/python3.7/site-packages/numpy/core/numerictypes.py in issubdtype(arg1, arg2)
    712     """
    713     if not issubclass_(arg1, generic):
--> 714         arg1 = dtype(arg1).type
    715     if not issubclass_(arg2, generic):
    716         arg2_orig = arg2

TypeError: data type not understood

In [6]: s
Out[6]:
0     (0, 1]
1     (1, 2]
2     (2, 3]
3     (3, 4]
4     (4, 5]
5     (5, 6]
6     (6, 7]
7     (7, 8]
8     (8, 9]
9    (9, 10]
dtype: interval

Options

  1. convert to object before hashing
  2. add some kind of _hash_values to the interface. But, how do we prevent hash collisions between similar, but different EAs? For example, the fastest hash for a PeriodArray would be to just hash the ordinals. But we wouldn't want the following two to hash identically (using my PeriodArray branch)
In [35]: pd.core.arrays.PeriodArray._from_ordinals([10, 20], freq='H')
Out[35]:
<pandas PeriodArray>
['1970-01-01 10:00', '1970-01-01 20:00']
Length: 2, dtype: period[H]

In [36]: pd.core.arrays.PeriodArray._from_ordinals([10, 20], freq='D')
Out[36]:
<pandas PeriodArray>
['1970-01-11', '1970-01-21']
Length: 2, dtype: period[D]

So we need to mix the dtype information in too.

@TomAugspurger TomAugspurger added the ExtensionArray Extending pandas with custom dtypes or arrays. label Oct 9, 2018
@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Oct 9, 2018
@TomAugspurger TomAugspurger added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions labels Oct 9, 2018
@TomAugspurger
Copy link
Contributor Author

Slightly related, this is an issue on with datetime / datetimetz

In [86]: ns = np.array([946706400000000000, 946792800000000000, 946879200000000000,
    ...:        946965600000000000])
    ...:

In [87]: a = pd.Series(pd.DatetimeIndex(ns))

In [88]: b = pd.Series(pd.DatetimeIndex(ns, tz='UTC'))

In [89]: pd.util.testing.assert_series_equal(pd.util.hash_pandas_object(a), pd.util.hash_pandas_object(b))

@TomAugspurger
Copy link
Contributor Author

Hmm, maybe we just don't care about different dtypes having the same hashed values (which is perfectly fair)

In [92]: a = pd.Series(pd.Categorical([0, 0, 1, 2]))

In [93]: b = pd.Series([0, 0, 1, 2])

In [94]: pd.util.hash_pandas_object(a)
Out[94]:
0    3713087409444908179
1    7478705303072568462
2    3975671353655200382
3    3563156779521628949
dtype: uint64

In [95]: pd.util.hash_pandas_object(b)
Out[95]:
0    3713087409444908179
1    7478705303072568462
2    3975671353655200382
3    3563156779521628949
dtype: uint64

@jreback
Copy link
Contributor

jreback commented Oct 9, 2018

iirc ther is an issue about the datetime hashing from a while ago

@TomAugspurger
Copy link
Contributor Author

Looks like #16372

So we have two issues

  1. An extension point for EAs to determine how they're hashed.
  2. How to avoid "collisions" between different dtypes (API/BUG: hashing of datetimes is based on UTC values #16372)

Let's focus on the first issue here. I think EAs need some kind of way to say what values are hashed. Performance seems too critical to just .astype(object) here. So two options

  1. A new _values_for_hashing method
  2. Overload _values_for_factorize and use that

Right now I'm leaning toward 2.

@jorisvandenbossche
Copy link
Member

Performance seems too critical to just .astype(object) here. So two options

Also, astype(object) does not necessarily give you hashable values (eg won't be the case for geometries)

What would be the reason not to use _values_for_factorize? They already need to be hashable, and should be unique to the original data they represent (since they are round-trippable)

@TomAugspurger
Copy link
Contributor Author

re-using _values_for_factorize should be fine.

@jbrockmendel
Copy link
Member

It looks like hash_pandas_object isn't used outside of tests. Do we still need it? (its exposed in pd.util which is kind of weird)

@jreback
Copy link
Contributor

jreback commented Oct 25, 2019

this is exposed as an api for other libraries (dask)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Dtype Conversions Unexpected or buggy dtype conversions ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

No branches or pull requests

4 participants