Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtensionDtype should be hashable #22476

Closed
TomAugspurger opened this issue Aug 23, 2018 · 6 comments

Comments

Projects
None yet
2 participants
@TomAugspurger
Copy link
Contributor

commented Aug 23, 2018

Using this for sparse in #22325, it'd be good to have generally.

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Aug 23, 2018

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2018

I'll also use this for #22994.

It looks like our default ExtensionDtype.eq is pretty dangerous. It's basically wrong for any kind of parametrized type. I'll do some reading about how to handle this.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Oct 4, 2018

So, a safe and low-tech option is to have EA authors declare the meaningful attributes of their ExtensionDtype. For PeriodDtype, this is freq. For CategorcialDtype its the categories and order.

class PeriodDtype(ExtensionDtype):
    _attributes = ['freq']

    def __eq__(self, other):
        if isinstance(other, compat.string_types):
            try:
                other = self._construct_from_string(other)
            except TypeError:
                return False
        if isinstance(other, type(self)):
            return all(
                getattr(self, attr) == getattr(other, attr)
                for attr in self._attributes
            )
        return False

    def __hash__(self):
        return hash(tuple(getattr(self, attr) for attr in self._attributes))

(in reality, __hash__ and __eq__ would go on the base ExtensionDtype class, not PeriodDtype.

That requires a couple things from all the _attributes.

  1. Their __eq__ returns a boolean
  2. They're hashable

CategoricalDtype.categories fails on both those counts, so the default implementation would work for it, but that's fine. I think that's a less common case.

The downside is a tiny bit of duplication. We could do fancier things with metaclasses probably, but that seems like overkill.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Oct 4, 2018

TomAugspurger added a commit that referenced this issue Oct 8, 2018

API: ExtensionDtype Equality and Hashability (#22996)
* API: ExtensionDtype Equality and Hashability

Closes #22476

brute4s99 added a commit to brute4s99/pandas that referenced this issue Nov 19, 2018

API: ExtensionDtype Equality and Hashability (pandas-dev#22996)
* API: ExtensionDtype Equality and Hashability

Closes pandas-dev#22476
@h-vetinari

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

I guess I missed this somehow - my attempt at a SetArray (#22382) has been dormant for some time while other issues were more pressing. In any case, having hashability as a general requirement makes it impossible to implement an extension array for sets, unless there is a way to circumvent that requirement somehow.

While implementation is not around the corner, I'd like for this to be a possibility further down the line. In the same spirit, @jreback commented on my first attempt at sets (without using an extension array) in #21547 (comment):

EA is an extension array & a dtype, both of which are first class. That's how I'd like to see sets (and lists and dicts)

EDIT: If need be, I guess the backend could use an array of frozenset or even tuple, but that wouldn't be my first choice, as it makes implementing the methods much more cumbersome.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2018

@h-vetinari

This comment has been minimized.

Copy link
Contributor

commented Nov 20, 2018

@TomAugspurger
Thanks for the quick response. This was my misunderstanding - I read "Pandas now requires that extension dtypes be hashable" in https://pandas-docs.github.io/pandas-docs-travis/whatsnew/v0.24.0.html#extensiontype-changes, and got the wrong idea.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.