DEPR: values_for_argsort, values_for_factorize, from_factorized #53501

jbrockmendel · 2023-06-02T16:36:28Z

History: when originally designing EAs there was a hope/thought that many methods could be implemented in terms of a small number of core methods, of which values_for_factorize (vff) and values_for_argsort (vfa) were two of the main ones. Over time we found that many of the places we used these other than factorize/argsort were causing problems and they got pruned.

At this point we are down to only a few internal uses of each. _from_factorized is used only in EA.factorize. vfa is used in EA.argsort, EA.rank, and nargminmax (which in turn is used in EA.argmin/argmax). vff is used in EA.factorize and merge._factorize_keys. #53475 will restore it as being used in hash_pandas_object.

We should deprecate these patterns entirely.

The status quo regarding whether these are required/encouraged is confusing. The solution is to have less stuff.
Because the default vff casts to object, any place we use it on a EA that doesn't override it is slow.
2b) In factorize_keys we special-case MaskedDtype and ArrowDtype to avoid this performance hit. That special-casing is a code smell.
The merge._factorize_keys usage means authors cannot override a cast to numpy. This would be a huge pain point for potential GPU/distributed EAs.

Implementation-wise, a deprecation could look like:

Deprecate EA.factorize saying in the future it will raise AbstractMethodError.
Move nanargminmax to an EA method _nanargminmax.
Deprecate EA._nanargminmax, EA.argsort, EA.rank saying in the future they will raise AbstractMethodError. Can suggest the vfa pattern for interested authors.
Not sure exactly about merge._factorize_keys, will look into it.

jorisvandenbossche · 2023-06-19T12:55:32Z

1. ... The solution is to have less stuff.

So what's the exact alternative that you are proposing? If we remove _values_for_factorize and _values_for_argsort, you intend that EA authors have to implement all of EA.factorize, EA._hash_pandas_object, EA.argsort, EA.argmin, EA.argmax, EA._rank themselves? (and we would also need some new mechanism to deal with the merge use case, although this is something we might want anyway at some point)

That doesn't really sound as "less stuff" for EA authors. But also not necessarily for pandas, as we would need to expose some of the lower level utility functions publicly to help EA authors to implement those methods, so also increasing the API surface on our side.

2. Because the default vff casts to object, any place we use it on a EA that doesn't override it is slow.
2b) In factorize_keys we special-case MaskedDtype and ArrowDtype to avoid this performance hit. That special-casing is a code smell.

I think this could easily be solved by allowing _values_for_factorize return a values, mask in addition to values, na_sentinel (this could unify those different code paths in _factorize_keys, only needing a single case for ExtensionArrays)? That's something we already agreed upon in the past that it would be good to do this (#33276)

jbrockmendel · 2023-06-19T22:12:36Z

So what's the exact alternative that you are proposing?

Let's focus on values_for_factorize (vff) for now. Let's also assume that something like #53696 gets rid of the merge usage of vff.

I propose adding a deprecation to the base class factorize and _hash_pandas_object saying that they are deprecated and EA authors should implement it directly. The message/docs could include a suggested pattern for implementing them. In the case of hash_pandas_object we could use the now-reverted to_numpy() implementation.

jorisvandenbossche · 2023-06-20T07:14:26Z

The message/docs could include a suggested pattern for implementing them.

That would mean that we have to expose the pandas.core.algorithms.factorize_array and pandas.core.util.hashing.hash_array functions (which are currently used to implement those methods) publicly, right? (eg in pandas.api.extensions)
There is already a top-level pandas.factorize, but that doesn't have all functionality of factorize_array (na_sentinel and mask support)

To be clear, I have no problems with doing that. I am mostly not fully convinced we have much to gain here for the trouble going through this deprecation. If _values_for_factorize is only used for factorize and hash_object_array, it's doesn't keep much code complexity to just keep this convenience for EA authors?

jbrockmendel · 2023-06-21T23:08:02Z

That would mean that we have to expose the pandas.core.algorithms.factorize_array and pandas.core.util.hashing.hash_array functions (which are currently used to implement those methods) publicly, right? (eg in pandas.api.extensions)

An alternative would be for the suggested implementation to wrap the numpy values in PandasArray and call that object's factorize/_hash_pandas_object.

jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2023

lithomas1 added Deprecate Functionality to remove in pandas Needs Discussion Requires discussion from core team before further action ExtensionArray Extending pandas with custom dtypes or arrays. and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 2, 2023

jorisvandenbossche mentioned this issue Jun 20, 2023

ENH: EA._hash_pandas_object #51319

Merged

5 tasks

jbrockmendel mentioned this issue Jul 4, 2023

ENH: allow EAs to override MergeOperation._get_join_indexers #53696

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DEPR: values_for_argsort, values_for_factorize, from_factorized #53501

DEPR: values_for_argsort, values_for_factorize, from_factorized #53501

jbrockmendel commented Jun 2, 2023 •

edited

jorisvandenbossche commented Jun 19, 2023

jbrockmendel commented Jun 19, 2023 •

edited

jorisvandenbossche commented Jun 20, 2023

jbrockmendel commented Jun 21, 2023

DEPR: values_for_argsort, values_for_factorize, from_factorized #53501

DEPR: values_for_argsort, values_for_factorize, from_factorized #53501

Comments

jbrockmendel commented Jun 2, 2023 • edited

jorisvandenbossche commented Jun 19, 2023

jbrockmendel commented Jun 19, 2023 • edited

jorisvandenbossche commented Jun 20, 2023

jbrockmendel commented Jun 21, 2023

jbrockmendel commented Jun 2, 2023 •

edited

jbrockmendel commented Jun 19, 2023 •

edited