Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Sorting of ExtensionArrays #19957

Merged
merged 20 commits into from
Mar 22, 2018
Merged

Conversation

TomAugspurger
Copy link
Contributor

This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort. These are required for factorize, which is required for groupby (my end goal).

I haven't implemented ExtensionArray.sort_values yet because it hasn't become necessary. But I can if needed / desired.

This enables {Series,DataFrame}.sort_values and {Series,DataFrame}.argsort
@TomAugspurger TomAugspurger added Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays. labels Mar 1, 2018
@TomAugspurger TomAugspurger added this to the 0.23.0 milestone Mar 1, 2018
kind : {'quicksort', 'mergesort', 'heapsort'}, optional
Sorting algorithm.
order : str or list of str, optional
Included for NumPy compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this compatibility needed because in the code we use np.argsort(values) which passes those keywords to the method?
(it is a bit unfortunate ..)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not necessary, and I can remove it.

I see now that Categorical.argsort has a different signature. I suppose we should match that.

@TomAugspurger TomAugspurger changed the title ENH: Sorting of ExtensionArrays [WIP]ENH: Sorting of ExtensionArrays Mar 1, 2018
@TomAugspurger TomAugspurger changed the title [WIP]ENH: Sorting of ExtensionArrays ENH: Sorting of ExtensionArrays Mar 2, 2018
@codecov
Copy link

codecov bot commented Mar 2, 2018

Codecov Report

Merging #19957 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #19957      +/-   ##
==========================================
+ Coverage   91.77%   91.78%   +<.01%     
==========================================
  Files         152      152              
  Lines       49205    49223      +18     
==========================================
+ Hits        45159    45177      +18     
  Misses       4046     4046
Flag Coverage Δ
#multiple 90.16% <100%> (ø) ⬆️
#single 41.84% <35.71%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/arrays/base.py 83.33% <100%> (+2.68%) ⬆️
pandas/core/arrays/categorical.py 96.2% <100%> (-0.02%) ⬇️
pandas/core/window.py 96.26% <0%> (-0.01%) ⬇️
pandas/plotting/_core.py 82.27% <0%> (ø) ⬆️
pandas/io/json/normalize.py 96.93% <0%> (+0.06%) ⬆️
pandas/util/testing.py 84.11% <0%> (+0.16%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7273ea0...4885245. Read the comment docs.

@TomAugspurger
Copy link
Contributor Author

Changed how this is organized a bit, to reflect a pattern I noticed here an elsewhere.

In several places (here, factorize, unique), a method like .argsort is composed of three parts

1.) Data coercion / prep
2.) The actual algorithm
3.) postprocessing

In the case of argsort, it's

1.) Just the array for most types, the codes for Categorical
2.) np.argsort
3.) Maybe reversiong for ascending=False

So I split the method in two EA.argsort and EA._values_for_argsort. For the common case of "I just want to pick which array gets sent to the algo", you just have to overrride _values_for_argort. If you need total control (e.g. if you aren't using np.argsort to do the actual work), then you'll need to override EA.argsort and do everything.

I don't know how useful this will prove to be, but wanted to hear other's thoughts.

@jorisvandenbossche
Copy link
Member

Do you foresee similar patterns for other algos? Like _values_for_factorize (not sure if that makes sense). Just to think about if we would get a proliferation of such methods

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 2, 2018 via email

@shoyer
Copy link
Member

shoyer commented Mar 2, 2018

What is the use-case for writing your own sorting algorithm? Maybe radix sort when your data falls into pre-known categories?

My inclination would be to only cinlude _values_for_argsort, since that means minimal work for extension array authors.

@TomAugspurger
Copy link
Contributor Author

my inclination would be to only cinlude _values_for_argsort

Do you mean not having an argsort method then? From pandas point of view, having argsort, as we don't have to check the array time in sort_values, and other places where we use arr.argsort.

based on matching category values. Thus, this function can be
called on an unordered Categorical instance unlike the functions
'Categorical.min' and 'Categorical.max'.
def argsort(self, *args, **kwargs):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this changes our opinion on _values_for_argsort, but the apparently Python2 has issues with passing through the arguments correctly to the super() call.

____________________ TestCategoricalSort.test_numpy_argsort ____________________

self = <pandas.tests.categorical.test_sorting.TestCategoricalSort object at 0x7efcb391f950>

    def test_numpy_argsort(self):
        c = Categorical([5, 3, 1, 4, 2], ordered=True)
    
        expected = np.array([2, 4, 1, 3, 0])
>       tm.assert_numpy_array_equal(np.argsort(c), expected,
                                    check_dtype=False)

pandas/tests/categorical/test_sorting.py:26: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../miniconda3/envs/pandas/lib/python2.7/site-packages/numpy/core/fromnumeric.py:886: in argsort
    return argsort(axis, kind, order)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = [5, 3, 1, 4, 2]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
ascending = -1, kind = 'quicksort', args = (None,), kwargs = {}

    def argsort(self, ascending=True, kind='quicksort', *args, **kwargs):
        """
            Returns the indices that would sort the Categorical instance if
            'sort_values' was called. This function is implemented to provide
            compatibility with numpy ndarray objects.
    
            While an ordering is applied to the category values, arg-sorting
            in this context refers more to organizing and grouping together
            based on matching category values. Thus, this function can be
            called on an unordered Categorical instance unlike the functions
            'Categorical.min' and 'Categorical.max'.
    
            Returns
            -------
            argsorted : numpy array
    
            See also
            --------
            numpy.ndarray.argsort
            """
        # Keep the implementation here just for the docstring.
        return super(Categorical, self).argsort(ascending=ascending, kind=kind,
>                                               *args, **kwargs)
E       TypeError: argsort() got multiple values for keyword argument 'ascending'

Changing the Categorical.argsort to accept just *args, **kwargs fixes things, since ExtensionArray does the argument validation, but it's a bit unfortunate.

@@ -236,6 +237,52 @@ def isna(self):
"""
raise AbstractMethodError(self)

def _values_for_argsort(self):
# type: () -> ndarray
"""Get the ndarray to be passed to np.argsort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? shouldn't this one of our myriad of _values methods/properties here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ndarray_valaues seems like more of an internal thing, no? I don't know enough to say whether the _ndarray_values appropriate for our current uses (mostly indexing IIRC) are also appropriate for argsort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is what is the point of overriding this specific one? why is not a general purpose EA method/property used here. The proliferation of methods properties is really troublesome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone wants to override argsort great. but providing an indirect mechism is really kludgey.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is what is the point of overriding this specific one?

This PR is implementing argsort. I could see a similar pattern for factorize.

The proliferation of methods properties is really troublesome.

How so?

but providing an indirect mechism is really kludgey.

What's kudgey about it? I think the common case will be overriding the values provided to np.argsort, not overriding the sorting algorithm itself. This is true for Categorical and IPArray, and will be true for Period and probably others.

Without _values_for_argsort we, and 3rd party libraries, will have duplicate code for validating keyword arguments and the ascending kwarg.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't necessarily want to convert into extension arrays into the same NumPy array used for np.asarray().

For example, the IP Address extension array probably wants to convert into numpy object array of IPAddress object. But for sorting, it could just return numpy structured array with a few integer fields, which will be much faster for comparisons than Python IPAddress objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. In the IPArray case, there's a numpy array, so _values_for_argsort is just that array.

OR simply call np.asarray() if you actually need an ndarray.

Then we're back in the duplicate code situation. That's OK for the base class, but Categorical, Period, and Interval will end up re-implementing argsort from scratch. With _values_for_argsort, it's just a matter of constructing that array (codes, ordinals, concat left & right).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On the name _values_for_argsort, it's possible that we'll find other uses for it, in which case we just change the name.

Do you know if a simple array appropriate for arg-sorting is also appropriate for factorize, joins, indexing, etc? I'm not sure ahead of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if a simple array appropriate for arg-sorting is also appropriate for factorize, joins, indexing, etc?

Well, we can scratch factorize / groupby off. Categorical defines _codes_for_groupby separately from its ndarray_values (codes)

def _codes_for_groupby(self, sort):

So we can't just use one array for everything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so then let's pick a name change _codes_for_groupby. in this refactor we want to find other usecases and fix our code now rather than later.

something like: _int_mapping_for_values

@@ -236,6 +237,52 @@ def isna(self):
"""
raise AbstractMethodError(self)

def _values_for_argsort(self):
# type: () -> ndarray
"""Get the ndarray to be passed to np.argsort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my point is what is the point of overriding this specific one? why is not a general purpose EA method/property used here. The proliferation of methods properties is really troublesome.

@@ -236,6 +237,52 @@ def isna(self):
"""
raise AbstractMethodError(self)

def _values_for_argsort(self):
# type: () -> ndarray
"""Get the ndarray to be passed to np.argsort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if someone wants to override argsort great. but providing an indirect mechism is really kludgey.

@@ -236,6 +237,52 @@ def isna(self):
"""
raise AbstractMethodError(self)

def _values_for_argsort(self):
# type: () -> ndarray
"""Get the ndarray to be passed to np.argsort.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this different from _ndarray_values, which is implemented on EA arrays? having 2 is just plain confusing, not to mention you have a ``_formatting_values```.

@TomAugspurger
Copy link
Contributor Author

The linting failure is fixed in #20330

@jorisvandenbossche
Copy link
Member

Many of things we are discussing here, we will only encounter in practice (like the question if we will get many of such _values_for_argsort interfaces). So I think it is more important to get something minimally working merged, so people (and we in pandas) can start using and experimenting with it.

I am fine with the current PR (both _values_for_argsort and argsort), although I personally don't really care much (from a user perspective) for having argsort.
So a "minimal" conservative route to go forward (minimal from user point of view), would also be to only have _values_for_argsort for extension authors.
We can always later add more public methods if there is demand for it. And I think it is also no problem to still change things in the extension interface (for the extension author, not user) later on (I suppose we will label this experimental for some time).

Although having both _values_for_argsort and argsort methods on ExtensionArray might actually be kind of a compromise between the viewpoints of Jeff and Stephan :-)

@shoyer
Copy link
Member

shoyer commented Mar 20, 2018

So I think it is more important to get something minimally working merged, so people (and we in pandas) can start using and experimenting with it.

👍

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Mar 20, 2018

I think (hope?) then that we're OK proceeding with this as is then.

c776133 unskips the sorting tests for JSONArray. The strategy is to sort an array of the dictionary items converted to tuples, which is maybe the most sensible way to sort a dictionary. Anyway, we have an example of sorting something that isn't regularly sortable.

The remaining skip is for df.sort_values(['A', 'B']), which factorizes the column. That should be fixed by #20361, which this is blocking.

Then we'll have groupby, which is shaping up to be a surprisingly small change.

Update: #20361 won't quite enable df.sort_values(['A', 'B']). Categorical would need a bit of work to allow non-hashable items in the categories (not a high priority).

Require stable dictionary order
TomAugspurger added a commit to TomAugspurger/cyberpandas that referenced this pull request Mar 20, 2018
@TomAugspurger
Copy link
Contributor Author

CI all passed.

@TomAugspurger
Copy link
Contributor Author

@jreback are you +1 here, given the discussion above? Since the last time around we discovered that _ndarray_values can't hope to be used for both factorize and argsorting.

@TomAugspurger
Copy link
Contributor Author

@jreback thoughts?

@jreback
Copy link
Contributor

jreback commented Mar 22, 2018

will look a bit later

@jreback jreback merged commit 85817a7 into pandas-dev:master Mar 22, 2018
@jreback
Copy link
Contributor

jreback commented Mar 22, 2018

thanks @TomAugspurger

I am still greatly concerned about the expansion of the private API. Since its completely private I guesss that's ok, but we should definitely try to be as minimal as possible and remove things that are duplicative / unecessary.

More to the point, there is a fair amount of divergence between naming inside pandas internals and EA. (and numpy for that matter). This needs to be addressed in the short term. Otherwise we end up with a complex & confusing API that no-one will be able to contribute to in the future.

Let's make a pro-active effort to minimize these frictions. (IOW create a new issue).

@jorisvandenbossche
Copy link
Member

Since its completely private

Note that those APIs are not private. They are part of the extension array interface, so 'public' for extension authors.

there is a fair amount of divergence between naming inside pandas internals and EA

First, I am not sure this would be a bad thing. EAs are new. If we go the route of Series and Index being composed of an ExtensionArray (and not subclassing them), then they are something different, and I think it is good to use different names for things that do something differently.

But, I currently also don't see any divergence.
The specific methods we now have for ExtensionArrays:

  • _constructor_from_sequence -> the same concept does not exist for Series/Index, so new name is normal
  • _values_for_argsort -> this is not relevant for Series/Index, something new for EA
  • _formatting_values -> this is consistent with the Block method
  • _concat_same_type -> this is consistent with the Block method (and there is also no equivalent for Series). There is a Index._concat_same_dtype, which is maybe something to investigate if we can get rid of that.
  • _can_hold_na -> this is consistent with both Series and Block method
  • _ndarray_values -> consistent with naming of equivalent attribute on Series/Index

It is of course true that there is divergence with numpy arrays, but if that was not needed, we wouldn't have needed to make such an ExtensionArray in the first place.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2018

https://circleci.com/gh/pandas-dev/pandas/12793#tests/containers/3

is breaking. I think some json EA tests are depending on a certain ordering.

@TomAugspurger

@jreback
Copy link
Contributor

jreback commented Mar 23, 2018

@jorisvandenbossche

First, I am not sure this would be a bad thing. EAs are new. If we go the route of Series and Index being composed of an ExtensionArray (and not subclassing them), then they are something different, and I think it is good to use different names for things that do something differently.

My point is there are now 2 ways / names of doing things (in some cases). This is not great and should be addressed by re-aligning naming in the Series.

A new reader to the codebase will be very confused on what to use when. Sure EA's are not exactly like Series/Index. But over the years we have gone to great lengths to make these look and feel and be implemented in a very similar way. So the divergence between Index / EA's is disturbing. My point is that before we move more on EA, Index's need to be subclassed and become real EA's.

@jorisvandenbossche
Copy link
Member

My point is there are now 2 ways / names of doing things (in some cases).

Which cases? Please be specific. I gave a list above, and I don't see any with 2 ways of naming.

@jreback
Copy link
Contributor

jreback commented Mar 23, 2018

Index is not an EA subclass.

@jorisvandenbossche
Copy link
Member

Sorry, what is the relation with my previous comment?
Yes, Index is not a EA subclass, and that is on purpose because 1) that's a much bigger undertaking 2) I don't think we already agree on whether it should become one or if we should do composition (as we do with normal arrays)

@jreback
Copy link
Contributor

jreback commented Mar 23, 2018

my point is that when Index becomes a subclass that all of the implementation details of it should follow a single convention (for construction / indexing) etc. These would naturally use the EA conventions. The problem now is they are different. This is just technical debt.

@TomAugspurger TomAugspurger deleted the fu1+sort branch March 23, 2018 11:05
@jorisvandenbossche
Copy link
Member

@jreback I am not sure that Index will become a subclass of ExtensionArray, but let's not discuss that here, but rather in #19696 (comment)

TomAugspurger added a commit to ContinuumIO/cyberpandas that referenced this pull request Mar 27, 2018
* Fixed factorize for MACArray

Relies on pandas-dev/pandas#19957

* Build on na_value

* Include groupby patch
javadnoorb pushed a commit to javadnoorb/pandas that referenced this pull request Mar 29, 2018
dworvos pushed a commit to dworvos/pandas that referenced this pull request Apr 2, 2018
kornilova203 pushed a commit to kornilova203/pandas that referenced this pull request Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants