ENH: Sorting of ExtensionArrays #19957

TomAugspurger · 2018-03-01T20:48:41Z

This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort. These are required for factorize, which is required for groupby (my end goal).

I haven't implemented ExtensionArray.sort_values yet because it hasn't become necessary. But I can if needed / desired.

This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort

jorisvandenbossche · 2018-03-01T21:46:07Z

pandas/core/arrays/base.py

+        kind : {'quicksort', 'mergesort', 'heapsort'}, optional
+            Sorting algorithm.
+        order : str or list of str, optional
+            Included for NumPy compatibility.


Is this compatibility needed because in the code we use np.argsort(values) which passes those keywords to the method?
(it is a bit unfortunate ..)

It's not necessary, and I can remove it.

I see now that Categorical.argsort has a different signature. I suppose we should match that.

codecov · 2018-03-02T16:01:50Z

Codecov Report

Merging #19957 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #19957      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         152      152              
  Lines       49205    49223      +18     
==========================================
+ Hits        45159    45177      +18     
  Misses       4046     4046

Flag	Coverage Δ
#multiple	`90.16% <100%> (ø)`	⬆️
#single	`41.84% <35.71%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/arrays/base.py	`83.33% <100%> (+2.68%)`	⬆️
pandas/core/arrays/categorical.py	`96.2% <100%> (-0.02%)`	⬇️
pandas/core/window.py	`96.26% <0%> (-0.01%)`	⬇️
pandas/plotting/_core.py	`82.27% <0%> (ø)`	⬆️
pandas/io/json/normalize.py	`96.93% <0%> (+0.06%)`	⬆️
pandas/util/testing.py	`84.11% <0%> (+0.16%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7273ea0...4885245. Read the comment docs.

TomAugspurger · 2018-03-02T16:09:25Z

Changed how this is organized a bit, to reflect a pattern I noticed here an elsewhere.

In several places (here, factorize, unique), a method like .argsort is composed of three parts

1.) Data coercion / prep
2.) The actual algorithm
3.) postprocessing

In the case of argsort, it's

1.) Just the array for most types, the codes for Categorical
2.) np.argsort
3.) Maybe reversiong for ascending=False

So I split the method in two EA.argsort and EA._values_for_argsort. For the common case of "I just want to pick which array gets sent to the algo", you just have to overrride _values_for_argort. If you need total control (e.g. if you aren't using np.argsort to do the actual work), then you'll need to override EA.argsort and do everything.

I don't know how useful this will prove to be, but wanted to hear other's thoughts.

jorisvandenbossche · 2018-03-02T16:12:04Z

Do you foresee similar patterns for other algos? Like _values_for_factorize (not sure if that makes sense). Just to think about if we would get a proliferation of such methods

TomAugspurger · 2018-03-02T18:00:56Z

Yes factorize would be another. Though it would be a bit more complicated. I'll probably remove it for now. That means a bit more duplication, but fewer levels of indirection.

…

On Fri, Mar 2, 2018 at 10:13 AM, Joris Van den Bossche < ***@***.***> wrote: Do you foresee similar patterns for other algos? Like _values_for_factorize (not sure if that makes sense). Just to think about if we would get a proliferation of such methods — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#19957 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIsdiBiJkXrc6AUBR5y_d0GnEojQBks5taW9igaJpZM4SY-A0> .

shoyer · 2018-03-02T19:55:37Z

What is the use-case for writing your own sorting algorithm? Maybe radix sort when your data falls into pre-known categories?

My inclination would be to only cinlude _values_for_argsort, since that means minimal work for extension array authors.

This reverts commit 44b6d72.

TomAugspurger · 2018-03-02T21:16:48Z

my inclination would be to only cinlude _values_for_argsort

Do you mean not having an argsort method then? From pandas point of view, having argsort, as we don't have to check the array time in sort_values, and other places where we use arr.argsort.

TomAugspurger · 2018-03-02T22:33:37Z

pandas/core/arrays/categorical.py

-        based on matching category values. Thus, this function can be
-        called on an unordered Categorical instance unlike the functions
-        'Categorical.min' and 'Categorical.max'.
+    def argsort(self, *args, **kwargs):


Not sure if this changes our opinion on _values_for_argsort, but the apparently Python2 has issues with passing through the arguments correctly to the super() call.

____________________ TestCategoricalSort.test_numpy_argsort ____________________ self = <pandas.tests.categorical.test_sorting.TestCategoricalSort object at 0x7efcb391f950> def test_numpy_argsort(self): c = Categorical([5, 3, 1, 4, 2], ordered=True) expected = np.array([2, 4, 1, 3, 0]) > tm.assert_numpy_array_equal(np.argsort(c), expected, check_dtype=False) pandas/tests/categorical/test_sorting.py:26: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../miniconda3/envs/pandas/lib/python2.7/site-packages/numpy/core/fromnumeric.py:886: in argsort return argsort(axis, kind, order) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ self = [5, 3, 1, 4, 2] Categories (5, int64): [1 < 2 < 3 < 4 < 5] ascending = -1, kind = 'quicksort', args = (None,), kwargs = {} def argsort(self, ascending=True, kind='quicksort', *args, **kwargs): """ Returns the indices that would sort the Categorical instance if 'sort_values' was called. This function is implemented to provide compatibility with numpy ndarray objects. While an ordering is applied to the category values, arg-sorting in this context refers more to organizing and grouping together based on matching category values. Thus, this function can be called on an unordered Categorical instance unlike the functions 'Categorical.min' and 'Categorical.max'. Returns ------- argsorted : numpy array See also -------- numpy.ndarray.argsort """ # Keep the implementation here just for the docstring. return super(Categorical, self).argsort(ascending=ascending, kind=kind, > *args, **kwargs) E TypeError: argsort() got multiple values for keyword argument 'ascending'

Changing the Categorical.argsort to accept just *args, **kwargs fixes things, since ExtensionArray does the argument validation, but it's a bit unfortunate.

jreback · 2018-03-04T20:21:44Z

pandas/core/arrays/base.py

@@ -236,6 +237,52 @@ def isna(self):
        """
        raise AbstractMethodError(self)

+    def _values_for_argsort(self):
+        # type: () -> ndarray
+        """Get the ndarray to be passed to np.argsort.


why is this needed? shouldn't this one of our myriad of _values methods/properties here?

_ndarray_valaues seems like more of an internal thing, no? I don't know enough to say whether the _ndarray_values appropriate for our current uses (mostly indexing IIRC) are also appropriate for argsort.

my point is what is the point of overriding this specific one? why is not a general purpose EA method/property used here. The proliferation of methods properties is really troublesome.

if someone wants to override argsort great. but providing an indirect mechism is really kludgey.

my point is what is the point of overriding this specific one?

This PR is implementing argsort. I could see a similar pattern for factorize.

The proliferation of methods properties is really troublesome.

How so?

but providing an indirect mechism is really kludgey.

What's kudgey about it? I think the common case will be overriding the values provided to np.argsort, not overriding the sorting algorithm itself. This is true for Categorical and IPArray, and will be true for Period and probably others.

Without _values_for_argsort we, and 3rd party libraries, will have duplicate code for validating keyword arguments and the ascending kwarg.

We don't necessarily want to convert into extension arrays into the same NumPy array used for np.asarray().

For example, the IP Address extension array probably wants to convert into numpy object array of IPAddress object. But for sorting, it could just return numpy structured array with a few integer fields, which will be much faster for comparisons than Python IPAddress objects.

Indeed. In the IPArray case, there's a numpy array, so _values_for_argsort is just that array.

OR simply call np.asarray() if you actually need an ndarray.

Then we're back in the duplicate code situation. That's OK for the base class, but Categorical, Period, and Interval will end up re-implementing argsort from scratch. With _values_for_argsort, it's just a matter of constructing that array (codes, ordinals, concat left & right).

On the name _values_for_argsort, it's possible that we'll find other uses for it, in which case we just change the name.

Do you know if a simple array appropriate for arg-sorting is also appropriate for factorize, joins, indexing, etc? I'm not sure ahead of time.

Do you know if a simple array appropriate for arg-sorting is also appropriate for factorize, joins, indexing, etc?

Well, we can scratch factorize / groupby off. Categorical defines _codes_for_groupby separately from its ndarray_values (codes)

pandas/pandas/core/arrays/categorical.py

Line 638 in 3783ccc

def _codes_for_groupby(self, sort):

So we can't just use one array for everything.

ok, so then let's pick a name change _codes_for_groupby. in this refactor we want to find other usecases and fix our code now rather than later.

something like: _int_mapping_for_values

jreback · 2018-03-07T14:13:15Z

pandas/core/arrays/base.py

@@ -236,6 +237,52 @@ def isna(self):
        """
        raise AbstractMethodError(self)

+    def _values_for_argsort(self):
+        # type: () -> ndarray
+        """Get the ndarray to be passed to np.argsort.


my point is what is the point of overriding this specific one? why is not a general purpose EA method/property used here. The proliferation of methods properties is really troublesome.

jreback · 2018-03-07T14:13:51Z

pandas/core/arrays/base.py

@@ -236,6 +237,52 @@ def isna(self):
        """
        raise AbstractMethodError(self)

+    def _values_for_argsort(self):
+        # type: () -> ndarray
+        """Get the ndarray to be passed to np.argsort.


if someone wants to override argsort great. but providing an indirect mechism is really kludgey.

jreback · 2018-03-10T02:10:46Z

pandas/core/arrays/base.py

@@ -236,6 +237,52 @@ def isna(self):
        """
        raise AbstractMethodError(self)

+    def _values_for_argsort(self):
+        # type: () -> ndarray
+        """Get the ndarray to be passed to np.argsort.


why is this different from _ndarray_values, which is implemented on EA arrays? having 2 is just plain confusing, not to mention you have a ``_formatting_values```.

TomAugspurger · 2018-03-13T15:30:49Z

The linting failure is fixed in #20330

jorisvandenbossche · 2018-03-20T13:33:59Z

Many of things we are discussing here, we will only encounter in practice (like the question if we will get many of such _values_for_argsort interfaces). So I think it is more important to get something minimally working merged, so people (and we in pandas) can start using and experimenting with it.

I am fine with the current PR (both _values_for_argsort and argsort), although I personally don't really care much (from a user perspective) for having argsort.
So a "minimal" conservative route to go forward (minimal from user point of view), would also be to only have _values_for_argsort for extension authors.
We can always later add more public methods if there is demand for it. And I think it is also no problem to still change things in the extension interface (for the extension author, not user) later on (I suppose we will label this experimental for some time).

Although having both _values_for_argsort and argsort methods on ExtensionArray might actually be kind of a compromise between the viewpoints of Jeff and Stephan :-)

shoyer · 2018-03-20T15:37:39Z

So I think it is more important to get something minimally working merged, so people (and we in pandas) can start using and experimenting with it.

👍

TomAugspurger · 2018-03-20T19:02:40Z

I think (hope?) then that we're OK proceeding with this as is then.

c776133 unskips the sorting tests for JSONArray. The strategy is to sort an array of the dictionary items converted to tuples, which is maybe the most sensible way to sort a dictionary. Anyway, we have an example of sorting something that isn't regularly sortable.

The remaining skip is for df.sort_values(['A', 'B']), which factorizes the column. That should be fixed by #20361, which this is blocking.

Then we'll have groupby, which is shaping up to be a surprisingly small change.

Update: #20361 won't quite enable df.sort_values(['A', 'B']). Categorical would need a bit of work to allow non-hashable items in the categories (not a high priority).

Require stable dictionary order

Relies on pandas-dev/pandas#19957

TomAugspurger · 2018-03-20T21:39:42Z

CI all passed.

TomAugspurger · 2018-03-21T11:59:14Z

@jreback are you +1 here, given the discussion above? Since the last time around we discovered that _ndarray_values can't hope to be used for both factorize and argsorting.

TomAugspurger · 2018-03-22T12:33:36Z

@jreback thoughts?

jreback · 2018-03-22T14:14:54Z

will look a bit later

jreback · 2018-03-22T23:22:23Z

thanks @TomAugspurger

I am still greatly concerned about the expansion of the private API. Since its completely private I guesss that's ok, but we should definitely try to be as minimal as possible and remove things that are duplicative / unecessary.

More to the point, there is a fair amount of divergence between naming inside pandas internals and EA. (and numpy for that matter). This needs to be addressed in the short term. Otherwise we end up with a complex & confusing API that no-one will be able to contribute to in the future.

Let's make a pro-active effort to minimize these frictions. (IOW create a new issue).

jorisvandenbossche · 2018-03-23T09:17:38Z

Since its completely private

Note that those APIs are not private. They are part of the extension array interface, so 'public' for extension authors.

there is a fair amount of divergence between naming inside pandas internals and EA

First, I am not sure this would be a bad thing. EAs are new. If we go the route of Series and Index being composed of an ExtensionArray (and not subclassing them), then they are something different, and I think it is good to use different names for things that do something differently.

But, I currently also don't see any divergence.
The specific methods we now have for ExtensionArrays:

_constructor_from_sequence -> the same concept does not exist for Series/Index, so new name is normal
_values_for_argsort -> this is not relevant for Series/Index, something new for EA
_formatting_values -> this is consistent with the Block method
_concat_same_type -> this is consistent with the Block method (and there is also no equivalent for Series). There is a Index._concat_same_dtype, which is maybe something to investigate if we can get rid of that.
_can_hold_na -> this is consistent with both Series and Block method
_ndarray_values -> consistent with naming of equivalent attribute on Series/Index

It is of course true that there is divergence with numpy arrays, but if that was not needed, we wouldn't have needed to make such an ExtensionArray in the first place.

jreback · 2018-03-23T10:05:41Z

https://circleci.com/gh/pandas-dev/pandas/12793#tests/containers/3

is breaking. I think some json EA tests are depending on a certain ordering.

@TomAugspurger

jreback · 2018-03-23T10:08:21Z

@jorisvandenbossche

First, I am not sure this would be a bad thing. EAs are new. If we go the route of Series and Index being composed of an ExtensionArray (and not subclassing them), then they are something different, and I think it is good to use different names for things that do something differently.

My point is there are now 2 ways / names of doing things (in some cases). This is not great and should be addressed by re-aligning naming in the Series.

A new reader to the codebase will be very confused on what to use when. Sure EA's are not exactly like Series/Index. But over the years we have gone to great lengths to make these look and feel and be implemented in a very similar way. So the divergence between Index / EA's is disturbing. My point is that before we move more on EA, Index's need to be subclassed and become real EA's.

jorisvandenbossche · 2018-03-23T10:13:52Z

My point is there are now 2 ways / names of doing things (in some cases).

Which cases? Please be specific. I gave a list above, and I don't see any with 2 ways of naming.

jreback · 2018-03-23T10:16:07Z

Index is not an EA subclass.

jorisvandenbossche · 2018-03-23T10:25:12Z

Sorry, what is the relation with my previous comment?
Yes, Index is not a EA subclass, and that is on purpose because 1) that's a much bigger undertaking 2) I don't think we already agree on whether it should become one or if we should do composition (as we do with normal arrays)

jreback · 2018-03-23T10:39:09Z

my point is that when Index becomes a subclass that all of the implementation details of it should follow a single convention (for construction / indexing) etc. These would naturally use the EA conventions. The problem now is they are different. This is just technical debt.

jorisvandenbossche · 2018-03-23T12:43:22Z

@jreback I am not sure that Index will become a subclass of ExtensionArray, but let's not discuss that here, but rather in #19696 (comment)

* Fixed factorize for MACArray Relies on pandas-dev/pandas#19957 * Build on na_value * Include groupby patch

ENH: Sorting of ExtensionArrays

0ec3600

This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort

TomAugspurger added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 1, 2018

TomAugspurger added this to the 0.23.0 milestone Mar 1, 2018

jorisvandenbossche reviewed Mar 1, 2018

View reviewed changes

TomAugspurger changed the title ~~ENH: Sorting of ExtensionArrays~~ [WIP]ENH: Sorting of ExtensionArrays Mar 1, 2018

REF: Split argsort into two parts

4707273

TomAugspurger changed the title ~~[WIP]ENH: Sorting of ExtensionArrays~~ ENH: Sorting of ExtensionArrays Mar 2, 2018

Fixed docstring

b61fb8d

Remove _values_for_argsort

44b6d72

Revert "Remove _values_for_argsort"

5be3917

This reverts commit 44b6d72.

TomAugspurger added 3 commits March 2, 2018 15:54

Merge remote-tracking branch 'upstream/master' into fu1+sort

e474c20

Workaround Py2

c2578c3

Indexer as array

b73e303

TomAugspurger commented Mar 2, 2018

View reviewed changes

Fixed dtypes

0db9e97

jreback requested changes Mar 4, 2018

View reviewed changes

jreback requested changes Mar 7, 2018

View reviewed changes

jreback requested changes Mar 10, 2018

View reviewed changes

TomAugspurger added 4 commits March 12, 2018 09:05

Fixed docstring

baf624c

Merge remote-tracking branch 'upstream/master' into fu1+sort

ce92f7b

Merge remote-tracking branch 'upstream/master' into fu1+sort

8cbfc36

Merge remote-tracking branch 'upstream/master' into fu1+sort

425fb2a

Unskip most JSON tests

c776133

Skip tests on 3.5 and lower

4885245

Require stable dictionary order

TomAugspurger added a commit to TomAugspurger/cyberpandas that referenced this pull request Mar 20, 2018

Fixed factorize for MACArray

fa420ad

Relies on pandas-dev/pandas#19957

TomAugspurger mentioned this pull request Mar 20, 2018

Fixed factorize for MACArray ContinuumIO/cyberpandas#13

Merged

TomAugspurger mentioned this pull request Mar 22, 2018

ENH/API: ExtensionArray.factorize #20361

Merged

jreback approved these changes Mar 22, 2018

View reviewed changes

jreback merged commit 85817a7 into pandas-dev:master Mar 22, 2018

TomAugspurger deleted the fu1+sort branch March 23, 2018 11:05

TomAugspurger added a commit to ContinuumIO/cyberpandas that referenced this pull request Mar 27, 2018

Fixed factorize for MACArray (#13)

b3152b0

* Fixed factorize for MACArray Relies on pandas-dev/pandas#19957 * Build on na_value * Include groupby patch

javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018

ENH: Sorting of ExtensionArrays (pandas-dev#19957)

c302b04

dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018

ENH: Sorting of ExtensionArrays (pandas-dev#19957)

e02c7de

kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018

ENH: Sorting of ExtensionArrays (pandas-dev#19957)

9b326cd

jorisvandenbossche mentioned this pull request Apr 3, 2020

EA interface - requirements for "hashable, value+order-preserving ndarray" #33276

Open

jorisvandenbossche mentioned this pull request Nov 17, 2021

REF: Series.argsort should dispatch to EA.argsort #43840

Closed

3 tasks

ENH: Sorting of ExtensionArrays #19957

ENH: Sorting of ExtensionArrays #19957

Conversation

TomAugspurger commented Mar 1, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Mar 2, 2018 • edited Loading

Codecov Report

TomAugspurger commented Mar 2, 2018

jorisvandenbossche commented Mar 2, 2018

TomAugspurger commented Mar 2, 2018 via email

shoyer commented Mar 2, 2018 • edited Loading

TomAugspurger commented Mar 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomAugspurger commented Mar 13, 2018

jorisvandenbossche commented Mar 20, 2018

shoyer commented Mar 20, 2018

TomAugspurger commented Mar 20, 2018 • edited Loading

TomAugspurger commented Mar 20, 2018

TomAugspurger commented Mar 21, 2018

TomAugspurger commented Mar 22, 2018

jreback commented Mar 22, 2018

jreback commented Mar 22, 2018

jorisvandenbossche commented Mar 23, 2018

jreback commented Mar 23, 2018

jreback commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018

jreback commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018

jreback commented Mar 23, 2018

jorisvandenbossche commented Mar 23, 2018

codecov bot commented Mar 2, 2018 •

edited

Loading

shoyer commented Mar 2, 2018 •

edited

Loading

TomAugspurger commented Mar 20, 2018 •

edited

Loading