Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added key option to df/series.sort_values(key=...) and df/series.sort_index(key=...) sorting #27237

Merged
merged 64 commits into from
Apr 27, 2020

Conversation

jacobaustin123
Copy link
Contributor

Added a key parameter to the DataFrame/Series.sort_values and sort_index functions matching Python sorted semantics for allowing custom sorting orders. Address open issue #3942.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR! Can you add support for Index to go with DataFrame / Series as well?

pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
@WillAyd
Copy link
Member

WillAyd commented Jul 6, 2019 via email

@pep8speaks
Copy link

pep8speaks commented Jul 6, 2019

Hello @jacobaustin123! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-04-27 02:12:02 UTC

@jacobaustin123
Copy link
Contributor Author

Just pushed a new commit that adds support for Index.sort_values(key=...) and adds a key flag to nargsort(...) and lexsort_indexer(...). I've also added static type hints. I still use Index.map to apply the key function for sort_index functions, because it seems simpler and semantically better to apply the key to the values and not the codes.

doc/source/whatsnew/v0.25.1.rst Outdated Show resolved Hide resolved
pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/indexes/base.py Outdated Show resolved Hide resolved
pandas/core/series.py Outdated Show resolved Hide resolved
@WillAyd
Copy link
Member

WillAyd commented Sep 5, 2019

@ja3067 is this still active?

@jacobaustin123
Copy link
Contributor Author

I will make the requested changes this weekend. The only decision I think needs to be made is how sort_values(key=...) and sort_index(key=...) should behave for multiple levels or columns. Should the same function be applied to every level/column, or should the key function be expected to operate on a tuple? Likewise, how should the key behave for sort_remaining? Should it be applied to everything?

@WillAyd
Copy link
Member

WillAyd commented Sep 20, 2019

The only decision I think needs to be made is how sort_values(key=...) and sort_index(key=...) should behave for multiple levels or columns.

Without a key is just operates on the levels individually, no? If not mistaken something should be the same here

@jacobaustin123
Copy link
Contributor Author

jacobaustin123 commented Sep 20, 2019

Without a key is just operates on the levels individually, no? If not mistaken something should be the same here

@WillAyd Agreed. In the case, say, of one numeric column and one string column, should we for instance support a dictionary of {column_name : Callable} so that different keys can be passed, or the key can be passed only to the primary column. Same question for sort_index.

@WillAyd
Copy link
Member

WillAyd commented Sep 22, 2019

I would leave that enhancement to a separate PR if requested

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

@ja3067 is this still active?

@jacobaustin123
Copy link
Contributor Author

jacobaustin123 commented Nov 13, 2019

@WillAyd I'll make the necessary updates this weekend. Sorry for the delay.

@jacobaustin123
Copy link
Contributor Author

jacobaustin123 commented Nov 19, 2019

@WillAyd Ok. The final PR is almost done. Last question is about the key for sortindex for a MultiIndex when level is specified. There are two options:

  1. the key function is applied to the entire MultiIndex tuple like the map function does currently. You could do something like key=lambda x : x[0].lower()
  2. the key function is applied to each level in the MultiIindex separately. The easiest way to do this would be to add a method to the MultiIndex called idx.maplevel(self, mapper, level=0) that applies a map function to a set of levels. However, this would probably require modifying the Index API. Would this be OK/worthwhile? This would be a little less flexible, but maybe more reasonable.

What do you think?

@jacobaustin123
Copy link
Contributor Author

@WillAyd I've incorporated the review changes, and it passes tests now. Should be good.

pandas/core/frame.py Outdated Show resolved Hide resolved
pandas/core/indexes/datetimelike.py Outdated Show resolved Hide resolved
pandas/core/indexes/datetimelike.py Outdated Show resolved Hide resolved
@WillAyd
Copy link
Member

WillAyd commented Nov 29, 2019

Can you merge master and repush? Not sure what 37 error was may be resolved

@jorisvandenbossche
Copy link
Member

@jorisvandenbossche the only big change is how keys are applied to MultiIindex objects. Keys are vectorized, and they're applied per column to DataFrames and now per level to MultiIndex. So if you do df.sort_index(key=...) the key is passed each level of the index separately. Otherwise things are the same.

Thanks, that sounds good!

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small nitpicks. For the rest looks good!

doc/source/user_guide/basics.rst Outdated Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
We've added a ``key`` argument to the DataFrame and Series sorting methods, including
:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
to the each column of a DataFrame before sorting is performed (:issue:`27237`). See
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to the each column of a DataFrame before sorting is performed (:issue:`27237`). See
to each column of a DataFrame before sorting is performed (:issue:`27237`). See

Apart from the typo, I find "each column" a bit confusing, as it is of course only applied to those columns that are used for sorting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See if the new version is clearer.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version looks good!

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
pandas/core/generic.py Outdated Show resolved Hide resolved
pandas/core/indexes/datetimelike.py Show resolved Hide resolved
pandas/core/sorting.py Outdated Show resolved Hide resolved
@jacobaustin123
Copy link
Contributor Author

@WillAyd @jorisvandenbossche @jreback This is badly timed since I know we want to finish this, but I made some changes today that add support for dictionary keys, i.e.

df = DataFrame({0: ["Hello", "goodbye"], 1: [0, 1]})
result = df.sort_values([0, 1], key={0: lambda col: col.str.lower()})

So you can specify a different key function for different levels/columns. I have a commit ready I can push now, but if people prefer, we can finalize this and have that as a separate commit, although as part of this addition I refactored the code significantly so everything is pulled into ensure_key_mapped. @jreback might prefer that, since it means there's no more key in lexsort_indexer or nargsort. I just didn't want to push it if people wanted to merge this and be done.

@jreback
Copy link
Contributor

jreback commented Apr 14, 2020

let’s not add anything more

_as = self.argsort()
if not ascending:
_as = _as[::-1]
sorted_index = self.take(_as)
return sorted_index, _as
else:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback do you know what this second branch is here for? I'm tempted to make the argsort method the default since it's necessary for key sorting, unless it's an important optimization.

if axis == 0:
new_df = DataFrame._from_arrays(new_levels, df.columns, df.index)
else:
new_df = DataFrame._from_arrays(new_levels, df.index, df.columns).transpose()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to do this (construct a DataFrame along either row or column axis) that's better than transposing it like this?

@jacobaustin123
Copy link
Contributor Author

@jreback Ok. I just reverted those changes but kept some of the refactoring so e.g. lexort_indexer no longer has a key function, and ensure_key_mapped handles all the logic for different datatypes. There are two questions I point out in a review above. Otherwise I'm happier with this version and I think it achieves your goal for having key mapping totally encapsulated in ensure_key_mapped.

Dict keys can be an easy PR on top.

@jacobaustin123
Copy link
Contributor Author

@jreback gentle bump. would love to finish this up.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jacobaustin123 I would really like to merge this. but you keep changing an enormous amount of code. Please got back to when I make comments I think 10 or 12 days ago.

changed all of the ensure_key_mapped again, this is a mess. Please just go back to the much simpler version.

doc/source/user_guide/basics.rst Show resolved Hide resolved
doc/source/user_guide/basics.rst Show resolved Hide resolved
doc/source/user_guide/basics.rst Show resolved Hide resolved
doc/source/user_guide/basics.rst Show resolved Hide resolved
doc/source/whatsnew/v1.1.0.rst Show resolved Hide resolved
pandas/core/sorting.py Outdated Show resolved Hide resolved
pandas/core/sorting.py Outdated Show resolved Hide resolved
pandas/core/sorting.py Outdated Show resolved Hide resolved
pandas/core/sorting.py Outdated Show resolved Hide resolved
pandas/core/sorting.py Show resolved Hide resolved
@jacobaustin123
Copy link
Contributor Author

@jreback ok. I can force revert to the previous version, or just address your new comments here. Let me know what you prefer. The motivation for this was to do as you asked and move all logic to ensure_key_mapped.

@jreback
Copy link
Contributor

jreback commented Apr 26, 2020

@jreback ok. I can force revert to the previous version, or just address your new comments here. Let me know what you prefer. The motivation for this was to do as you asked and move all logic to ensure_key_mapped.

well I'ld lke to make the doc changes I have above, and not have 3 different ensure_key_mapped functions. you had 1 or maybe 2 before. Please simplify. I am happy to take changes as a followup, but this is too much now.

@jacobaustin123
Copy link
Contributor Author

@jreback. Understood. I'll address doc comments and revert.

@jreback
Copy link
Contributor

jreback commented Apr 26, 2020

@jreback. Understood. I'll address doc comments and revert.

thanks.

this is just a very large PR and hard to grok what has changed.

@jacobaustin123
Copy link
Contributor Author

@jreback ok I just reverted to the prior version and updated documentation. I had to force push, but I just reverted to the commit before the major changes and then added two commits on top.

@jreback
Copy link
Contributor

jreback commented Apr 27, 2020

kk will look soon thanks @jacobaustin123

@@ -2986,6 +3039,9 @@ def sort_values(
)

def _try_kind_sort(arr):
arr = ensure_key_mapped(arr, key)
arr = getattr(arr, "_values", arr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in followon you can use extract_array here

@@ -204,15 +218,14 @@ def lexsort_indexer(keys, orders=None, na_position: str = "last"):
elif orders is None:
orders = [True] * len(keys)

for key, order in zip(keys, orders):

for k, order in zip(keys, orders):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you do this

We've added a ``key`` argument to the DataFrame and Series sorting methods, including
:meth:`DataFrame.sort_values`, :meth:`DataFrame.sort_index`, :meth:`Series.sort_values`,
and :meth:`Series.sort_index`. The ``key`` can be any callable function which is applied
column-by-column to each column used for sorting, before sorting is performed (:issue:`27237`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the orginal issue number here (or rather replace this issue number )

@jreback jreback merged commit dec736f into pandas-dev:master Apr 27, 2020
@jreback
Copy link
Contributor

jreback commented Apr 27, 2020

thanks @jacobaustin123

3 followon comments if you would. Also happy to take a refactoring PR in sorting generally. Targeted PRs are best.

@jacobaustin123
Copy link
Contributor Author

@jreback awesome thank you! I'll address these and maybe do a small refactoring PR. And maybe a second PR to add dictionary key sorting.

rhshadrach pushed a commit to rhshadrach/pandas that referenced this pull request May 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add key to sorting functions
9 participants