Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: values_for_argsort, values_for_factorize, from_factorized #53501

Open
jbrockmendel opened this issue Jun 2, 2023 · 4 comments
Open

DEPR: values_for_argsort, values_for_factorize, from_factorized #53501

jbrockmendel opened this issue Jun 2, 2023 · 4 comments
Labels
Deprecate Functionality to remove in pandas ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Jun 2, 2023

History: when originally designing EAs there was a hope/thought that many methods could be implemented in terms of a small number of core methods, of which values_for_factorize (vff) and values_for_argsort (vfa) were two of the main ones. Over time we found that many of the places we used these other than factorize/argsort were causing problems and they got pruned.

At this point we are down to only a few internal uses of each. _from_factorized is used only in EA.factorize. vfa is used in EA.argsort, EA.rank, and nargminmax (which in turn is used in EA.argmin/argmax). vff is used in EA.factorize and merge._factorize_keys. #53475 will restore it as being used in hash_pandas_object.

We should deprecate these patterns entirely.

  1. The status quo regarding whether these are required/encouraged is confusing. The solution is to have less stuff.
  2. Because the default vff casts to object, any place we use it on a EA that doesn't override it is slow.
    2b) In factorize_keys we special-case MaskedDtype and ArrowDtype to avoid this performance hit. That special-casing is a code smell.
  3. The merge._factorize_keys usage means authors cannot override a cast to numpy. This would be a huge pain point for potential GPU/distributed EAs.

Implementation-wise, a deprecation could look like:

  1. Deprecate EA.factorize saying in the future it will raise AbstractMethodError.
  2. Move nanargminmax to an EA method _nanargminmax.
  3. Deprecate EA._nanargminmax, EA.argsort, EA.rank saying in the future they will raise AbstractMethodError. Can suggest the vfa pattern for interested authors.
  4. Not sure exactly about merge._factorize_keys, will look into it.
@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2023
@lithomas1 lithomas1 added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2023
@jorisvandenbossche
Copy link
Member

1. ... The solution is to have less stuff.

So what's the exact alternative that you are proposing? If we remove _values_for_factorize and _values_for_argsort, you intend that EA authors have to implement all of EA.factorize, EA._hash_pandas_object, EA.argsort, EA.argmin, EA.argmax, EA._rank themselves? (and we would also need some new mechanism to deal with the merge use case, although this is something we might want anyway at some point)

That doesn't really sound as "less stuff" for EA authors. But also not necessarily for pandas, as we would need to expose some of the lower level utility functions publicly to help EA authors to implement those methods, so also increasing the API surface on our side.

2. Because the default vff casts to object, any place we use it on a EA that doesn't override it is slow.
2b) In factorize_keys we special-case MaskedDtype and ArrowDtype to avoid this performance hit. That special-casing is a code smell.

I think this could easily be solved by allowing _values_for_factorize return a values, mask in addition to values, na_sentinel (this could unify those different code paths in _factorize_keys, only needing a single case for ExtensionArrays)? That's something we already agreed upon in the past that it would be good to do this (#33276)

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jun 19, 2023

So what's the exact alternative that you are proposing?

Let's focus on values_for_factorize (vff) for now. Let's also assume that something like #53696 gets rid of the merge usage of vff.

I propose adding a deprecation to the base class factorize and _hash_pandas_object saying that they are deprecated and EA authors should implement it directly. The message/docs could include a suggested pattern for implementing them. In the case of hash_pandas_object we could use the now-reverted to_numpy() implementation.

@jorisvandenbossche
Copy link
Member

The message/docs could include a suggested pattern for implementing them.

That would mean that we have to expose the pandas.core.algorithms.factorize_array and pandas.core.util.hashing.hash_array functions (which are currently used to implement those methods) publicly, right? (eg in pandas.api.extensions)
There is already a top-level pandas.factorize, but that doesn't have all functionality of factorize_array (na_sentinel and mask support)

To be clear, I have no problems with doing that. I am mostly not fully convinced we have much to gain here for the trouble going through this deprecation. If _values_for_factorize is only used for factorize and hash_object_array, it's doesn't keep much code complexity to just keep this convenience for EA authors?

@jbrockmendel
Copy link
Member Author

That would mean that we have to expose the pandas.core.algorithms.factorize_array and pandas.core.util.hashing.hash_array functions (which are currently used to implement those methods) publicly, right? (eg in pandas.api.extensions)

An alternative would be for the suggested implementation to wrap the numpy values in PandasArray and call that object's factorize/_hash_pandas_object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

3 participants