Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: EA._hash_pandas_object #51319

Merged
merged 7 commits into from
Mar 1, 2023

Conversation

jbrockmendel
Copy link
Member

@jbrockmendel
Copy link
Member Author

@MarcoGorelli any thoughts here?

@MarcoGorelli MarcoGorelli self-requested a review February 20, 2023 12:04
@MarcoGorelli
Copy link
Member

I'll take a look this week, thanks for the ping

@simonjayhawkins simonjayhawkins added the ExtensionArray Extending pandas with custom dtypes or arrays. label Feb 22, 2023
@jbrockmendel jbrockmendel added the hashing hash_pandas_object label Feb 25, 2023
@jbrockmendel
Copy link
Member Author

@phofl thoughts?

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@phofl phofl added this to the 2.1 milestone Feb 26, 2023
Comment on lines +243 to +245
@pytest.mark.xfail(reason="ValueError: setting an array element with a sequence")
def test_hash_pandas_object(self, data):
super().test_hash_pandas_object(data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you expand on this please? why are we adding it, why does it fail, is it expected that it should pass?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSONArray is kind of semi-abandoned, about 20% of the existing tests in this file are xfailed. So it is not "expected" in that I do not expect it to pass ever, but it is "expected" in that if anyone ever wanted to make it usable they would want to make this test pass.

@MarcoGorelli
Copy link
Member

Also, does it need adding here

Methods
-------
argsort
astype
copy
dropna
factorize
fillna
equals
insert
isin
isna
ravel
repeat
searchsorted
shift
take
tolist
unique
view
_accumulate
_concat_same_type
_formatter
_from_factorized
_from_sequence
_from_sequence_of_strings
_reduce
_values_for_argsort
_values_for_factorize

?

@jbrockmendel
Copy link
Member Author

Also, does it need adding here

Yes, will update

@MarcoGorelli
Copy link
Member

the docs job is failing , showing

2023-02-27T22:46:43.2184483Z /home/runner/work/pandas/pandas/pandas/core/arrays/base.py:docstring of pandas.core.arrays.base.ExtensionArray:116: WARNING: autosummary: stub file not found 'pandas.api.extensions.ExtensionArray._hash_pandas_object'. Check your autosummary_generate setting.

seems related?

@jbrockmendel
Copy link
Member Author

its complaining about No Examples section found? how do i tell it to ignore that?

@phofl
Copy link
Member

phofl commented Feb 28, 2023

the ignore list is in code_checks.sh

@phofl phofl merged commit 02adb3d into pandas-dev:main Mar 1, 2023
@phofl
Copy link
Member

phofl commented Mar 1, 2023

thx @jbrockmendel

@jbrockmendel jbrockmendel deleted the enh-ea-hash_pandas_object branch March 1, 2023 16:31
"""
from pandas.core.util.hashing import hash_array

values = self.to_numpy(copy=False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel is there a reason you didn't use self._values_for_factorize() as the values to hash (as default), which would preserve the current behaviour?

In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose hash_array publicly (eg in pandas.api.extensions) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason you didn't use self._values_for_factorize()

IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.

In addition, if we want to allow EAs to override this, we should either document what the keyword arguments mean, and/or we should expose hash_array publicly (eg in pandas.api.extensions) ?

Documenting makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC it was some combination of _values_for_factorize not being required and trying to respect the _vfc shouldn't be used for anything but factorize policy.

I have said that before, but I don't see any problem with using that for both, as we have always done. They are very related, and typically values suitable for factorization will also be suitable for hashing, given that factorization is hash-based. And now with this new method, EAs can still override it if they want to use something else than _values_for_factorize.

We actually have a bunch of internal doc comments indicating that _values_for_factorize is also used for hash_pandas_object. And we can add this to the actual _values_for_factorize docstring in the base class to make this use case explicit.

For external EAs that relied on this, this is a regression that we no longer use this by default (e.g. for geopandas this will now fail in hash_object_array, falling back to casting to string before trying it again. But the string representation is not that faithful, and thus can give wrong hashes)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we can add this to the actual _values_for_factorize docstring in the base class to make this use case explicit.

It was actually documented, which was removed here.

@@ -1452,6 +1448,31 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):
# Non-Optimized Default Methods; in the case of the private methods here,
# these are not guaranteed to be stable across pandas versions.

def _hash_pandas_object(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something else: you added this method in the "Non-Optimized Default Methods" section, which says that private methods in this section are not guaranteed to be stable (in contrast to well established methods with an underscore that are meant to be overridden like _values_for_factorize, _from_sequence, etc, which are defined more above).

Is that indeed the intention? (in that case it should not be mentioned in the documentation that this is a method to override? Currently it is added to the docstring higher up in this file)
(If we don't want that external EAs override this, that's a another reason we should certainly keep using _values_for_factorize)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in contrast to well established methods with an underscore

Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.

IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in contrast to well established methods with an underscore

Is that what underscores are supposed to mean? I'm not eager to revive the EA-namespace-naming-scheme discussion, but that would fit well there.

It's what the documented methods (that happen to have a underscore or not) mean: we currently document which methods can (or have to / are meant to) be overriden by EA implementations. And you added _hash_pandas_object to the documentation.

For example _from_sequence or _reduce have an underscore, but for sure external EAs can rely on implementing those, regardless of the underscore.

I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)

IIRC the main motivation was allowing for a performant implementation for ArrowExtensionArray (without special-casing internally ArrowEA internally)

Yes, and that's fine I think. But in general when adding methods to the EA base class, we need to be conscious about whether we see such a method as something external EA authors can also override or not. Because if we explicitly are OK with external EA authors using it, we need to give it some higher level of stability.

We could also consider adding methods to our own classes and not to the base class, or have a some pandas BaseArray(ExtensionArray) class that does a bunch of extra things we only want to do for our own ones, where we can be more flexible and don't have to worry about backwards compatibility.
That of course has the disadvantage of creating a split (and having to check, whenever we call that, if the methods exists and have some fallback), but could be easier for certain optimizations where we don't directly want to promise stability / expose this to external EAs.
Speaking as such an external EA author, I think I would be fine with this (and when there is request, we could "promote" a method to the official public EA interface). As a pandas maintainer, it might give some more complexity to know if some method is in which category.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure which discussion needs to be revived (unless you don't agree that some of the methods with an underscore can be overridden by external EAs?)

I'm specifically not interested in reviving the naming convention discussion.

I assume anything in the EA namespace is liable to be overridden by subclass authors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume anything in the EA namespace is liable to be overridden by subclass authors.

The reason of my original comment here is because we have the following comment in the base EA class (one that you added I think):

# ------------------------------------------------------------------------
# Non-Optimized Default Methods; in the case of the private methods here,
# these are not guaranteed to be stable across pandas versions.

So while you say that you think anything in the EA namespace can be overridden by EA authors, the above comment at least seems to indicate there are some methods for which we guarantee stability, while for others it's at your own risk (to potentially keep it compatible with every new minor version).

So essentially all that my original comment wanted to say is: if we want that EA authors need to implement _hash_object_array (as you also suggest in #53501), we should move that method a bit higher up in this file to be above that comment about private methods not being stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. hashing hash_pandas_object
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: hash_pandas_object hook for interface?
5 participants