ENH: EA._hash_pandas_object #51319

jbrockmendel · 2023-02-11T00:10:29Z

closes ENH: hash_pandas_object hook for interface? #51108 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

jbrockmendel · 2023-02-18T02:43:23Z

@MarcoGorelli any thoughts here?

MarcoGorelli · 2023-02-20T12:05:00Z

I'll take a look this week, thanks for the ping

jbrockmendel · 2023-02-26T17:57:12Z

@phofl thoughts?

phofl

lgtm

MarcoGorelli · 2023-02-27T09:21:51Z

pandas/tests/extension/json/test_json.py

+    @pytest.mark.xfail(reason="ValueError: setting an array element with a sequence")
+    def test_hash_pandas_object(self, data):
+        super().test_hash_pandas_object(data)


could you expand on this please? why are we adding it, why does it fail, is it expected that it should pass?

JSONArray is kind of semi-abandoned, about 20% of the existing tests in this file are xfailed. So it is not "expected" in that I do not expect it to pass ever, but it is "expected" in that if anyone ever wanted to make it usable they would want to make this test pass.

MarcoGorelli · 2023-02-27T10:21:28Z

Also, does it need adding here

pandas/pandas/core/arrays/base.py

Lines 117 to 146 in 089aa0c

    
               Methods 
        
               ------- 
        
               argsort 
        
               astype 
        
               copy 
        
               dropna 
        
               factorize 
        
               fillna 
        
               equals 
        
               insert 
        
               isin 
        
               isna 
        
               ravel 
        
               repeat 
        
               searchsorted 
        
               shift 
        
               take 
        
               tolist 
        
               unique 
        
               view 
        
               _accumulate 
        
               _concat_same_type 
        
               _formatter 
        
               _from_factorized 
        
               _from_sequence 
        
               _from_sequence_of_strings 
        
               _reduce 
        
               _values_for_argsort 
        
               _values_for_factorize

?

jbrockmendel · 2023-02-27T16:35:09Z

Also, does it need adding here

Yes, will update

MarcoGorelli · 2023-02-28T14:53:19Z

the docs job is failing , showing

2023-02-27T22:46:43.2184483Z /home/runner/work/pandas/pandas/pandas/core/arrays/base.py:docstring of pandas.core.arrays.base.ExtensionArray:116: WARNING: autosummary: stub file not found 'pandas.api.extensions.ExtensionArray._hash_pandas_object'. Check your autosummary_generate setting.

seems related?

jbrockmendel · 2023-02-28T19:35:06Z

its complaining about No Examples section found? how do i tell it to ignore that?

phofl · 2023-02-28T20:46:47Z

the ignore list is in code_checks.sh

phofl · 2023-03-01T16:25:07Z

thx @jbrockmendel

jorisvandenbossche · 2023-05-31T18:06:47Z

pandas/core/arrays/base.py

+        """
+        from pandas.core.util.hashing import hash_array
+
+        values = self.to_numpy(copy=False)


@jbrockmendel is there a reason you didn't use self._values_for_factorize() as the values to hash (as default), which would preserve the current behaviour?

In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose hash_array publicly (eg in pandas.api.extensions) ?

is there a reason you didn't use self._values_for_factorize()

IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.

In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose hash_array publicly (eg in pandas.api.extensions) ?

Documenting makes sense.

IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.

I have said that before, but I don't see any problem with using that for both, as we have always done. They are very related, and typically values suitable for factorization will also be suitable for hashing, given that factorization is hash-based. And now with this new method, EAs can still override it if they want to use something else than _values_for_factorize.

We actually have a bunch of internal doc comments indicating that _values_for_factorize is also used for hash_pandas_object. And we can add this to the actual _values_for_factorize docstring in the base class to make this use case explicit.

For external EAs that relied on this, this is a regression that we no longer use this by default (e.g. for geopandas this will now fail in hash_object_array, falling back to casting to string before trying it again. But the string representation is not that faithful, and thus can give wrong hashes)

And we can add this to the actual _values_for_factorize docstring in the base class to make this use case explicit.

It was actually documented, which was removed here.

jorisvandenbossche · 2023-05-31T19:34:03Z

pandas/core/arrays/base.py

@@ -1452,6 +1448,31 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):
    # Non-Optimized Default Methods; in the case of the private methods here,
    #  these are not guaranteed to be stable across pandas versions.

+    def _hash_pandas_object(


Something else: you added this method in the "Non-Optimized Default Methods" section, which says that private methods in this section are not guaranteed to be stable (in contrast to well established methods with an underscore that are meant to be overridden like _values_for_factorize, _from_sequence, etc, which are defined more above).

Is that indeed the intention? (in that case it should not be mentioned in the documentation that this is a method to override? Currently it is added to the docstring higher up in this file)
(If we don't want that external EAs override this, that's a another reason we should certainly keep using _values_for_factorize)

in contrast to well established methods with an underscore

Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.

IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)

in contrast to well established methods with an underscore

Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.

It's what the documented methods (that happen to have a underscore or not) mean: we currently document which methods can (or have to / are meant to) be overriden by EA implementations. And you added _hash_pandas_object to the documentation.

For example _from_sequence or _reduce have an underscore, but for sure external EAs can rely on implementing those, regardless of the underscore.

I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)

IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)

Yes, and that's fine I think. But in general when adding methods to the EA base class, we need to be conscious about whether we see such a method as something external EA authors can also override or not. Because if we explicitly are OK with external EA authors using it, we need to give it some higher level of stability.

We could also consider adding methods to our own classes and not to the base class, or have a some pandas BaseArray(ExtensionArray) class that does a bunch of extra things we only want to do for our own ones, where we can be more flexible and don't have to worry about backwards compatibility.
That of course has the disadvantage of creating a split (and having to check, whenever we call that, if the methods exists and have some fallback), but could be easier for certain optimizations where we don't directly want to promise stability / expose this to external EAs.
Speaking as such an external EA author, I think I would be fine with this (and when there is request, we could "promote" a method to the official public EA interface). As a pandas maintainer, it might give some more complexity to know if some method is in which category.

I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)

I'm specifically not interested in reviving the naming convention discussion.

I assume anything in the EA namespace is liable to be overridden by subclass authors.

I assume anything in the EA namespace is liable to be overridden by subclass authors.

The reason of my original comment here is because we have the following comment in the base EA class (one that you added I think):

pandas/pandas/core/arrays/base.py

Lines 1447 to 1450 in d06f2d3

# ------------------------------------------------------------------------

# Non-Optimized Default Methods; in the case of the private methods here,

# these are not guaranteed to be stable across pandas versions.

So while you say that you think anything in the EA namespace can be overridden by EA authors, the above comment at least seems to indicate there are some methods for which we guarantee stability, while for others it's at your own risk (to potentially keep it compatible with every new minor version).

So essentially all that my original comment wanted to say is: if we want that EA authors need to implement _hash_object_array (as you also suggest in #53501), we should move that method a bit higher up in this file to be above that comment about private methods not being stable.

ENH: EA._hash_pandas_object

089aa0c

MarcoGorelli self-requested a review February 20, 2023 12:04

simonjayhawkins added the ExtensionArray Extending pandas with custom dtypes or arrays. label Feb 22, 2023

jbrockmendel added the hashing hash_pandas_object label Feb 25, 2023

phofl approved these changes Feb 26, 2023

View reviewed changes

phofl added this to the 2.1 milestone Feb 26, 2023

MarcoGorelli reviewed Feb 27, 2023

View reviewed changes

jbrockmendel added 2 commits February 27, 2023 14:29

Merge branch 'main' into enh-ea-hash_pandas_object

aa7729a

update docstring

aad544c

jbrockmendel added 2 commits February 28, 2023 08:43

troubleshoot docbuild

318adcf

troubleshoot code check build

b024c7d

jbrockmendel added 2 commits February 28, 2023 15:20

ignore in code_checks

9a7d669

Merge branch 'main' into enh-ea-hash_pandas_object

ae3f0cb

phofl merged commit 02adb3d into pandas-dev:main Mar 1, 2023

jbrockmendel deleted the enh-ea-hash_pandas_object branch March 1, 2023 16:31

jorisvandenbossche reviewed May 31, 2023

View reviewed changes

jorisvandenbossche mentioned this pull request May 31, 2023

Use _values_for_factorize by default for hashing ExtensionArrays #53475

Merged

jorisvandenbossche reviewed May 31, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: EA._hash_pandas_object #51319

ENH: EA._hash_pandas_object #51319

jbrockmendel commented Feb 11, 2023

jbrockmendel commented Feb 18, 2023

MarcoGorelli commented Feb 20, 2023

jbrockmendel commented Feb 26, 2023

phofl left a comment

MarcoGorelli Feb 27, 2023

jbrockmendel Feb 27, 2023

MarcoGorelli commented Feb 27, 2023

jbrockmendel commented Feb 27, 2023

MarcoGorelli commented Feb 28, 2023

jbrockmendel commented Feb 28, 2023

phofl commented Feb 28, 2023

phofl commented Mar 1, 2023

jorisvandenbossche May 31, 2023

jbrockmendel May 31, 2023

jorisvandenbossche May 31, 2023

jorisvandenbossche May 31, 2023

jorisvandenbossche May 31, 2023

jbrockmendel Jun 15, 2023

jorisvandenbossche Jun 19, 2023 •

edited

jbrockmendel Jun 19, 2023

jorisvandenbossche Jun 20, 2023

	# ------------------------------------------------------------------------
	# Non-Optimized Default Methods; in the case of the private methods here,
	# these are not guaranteed to be stable across pandas versions.

ENH: EA._hash_pandas_object #51319

ENH: EA._hash_pandas_object #51319

Conversation

jbrockmendel commented Feb 11, 2023

jbrockmendel commented Feb 18, 2023

MarcoGorelli commented Feb 20, 2023

jbrockmendel commented Feb 26, 2023

phofl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MarcoGorelli commented Feb 27, 2023

jbrockmendel commented Feb 27, 2023

MarcoGorelli commented Feb 28, 2023

jbrockmendel commented Feb 28, 2023

phofl commented Feb 28, 2023

phofl commented Mar 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jun 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche Jun 19, 2023 •

edited