EHN/FIX: Add na_last parameter to DataFrame.sort. Fixes GH3917 #5231

unutbu · 2013-10-15T17:15:17Z

This is an attempt to fix the Nested Sort with NaN bug (#3917).

I've added tests to test_frame.py and test_hashtable.py to demonstrate the problem.

hashtable.Factorizer.factorize has been modified to map nan to na_sentinel. Before it was mapping nan to a label which was already being used. This, I believe is the origin of the bug.

My first idea was to mimick/reuse code from Series.order, since this method already handles nans nicely, and allows the user to choose if nans should be placed at the beginning or the end of the sort via the na_last parameter. Although I found a solution using code from Series.order, I eventually abandoned this when I realized this patches the problem at too high a level and that it could be handled more generally with a modification of factorize. I retained the idea that df.sort should have a na_last parameter, however.

To that end, groupby._lexsort_indexer has been modified to handle all possible combinations of na_last and orders settings. There are four tests (assertions) in test_frame.py to exercise the possibilities, one of which demonstrates that df.sort(['A','B']) now behaves correctly for the DataFrame shown in GH3917.

jreback · 2013-10-15T17:33:51Z

interesting..this may go pretty far towards solving #5190

jtratner · 2013-10-15T18:38:19Z

I'd like time to review this - please don't merge for now.

jreback · 2013-10-15T18:42:18Z

agreed..

jtratner · 2013-10-15T20:09:36Z

First, you need to put the new argument in last position everywhere so
you don't break anything (and don't accidentally have users entering
something wrong - esp because usually no difference between 1 and True)

On external API - how about calling it handle_na, and then you can pass
either "rm" (remove them), "fill" (pass fill value), "last" (end up as
last), could also do "first" (end up as first) etc. this reminds me that I
need to add na-handling on apply and groupby as well. internal methods can
handle however they want.

unutbu · 2013-10-15T20:28:08Z

Okay, I'll put the new argument at the end.

I'm a little hesitant about adding an rm option. Should a sort really be able to remove items?

Instead of handle_na='fill', how would you feel about a key=func parameter? Then instead of replacing NaNs with fill_value, we could use func to provide proxy values for NaNs: key=lambda x: fill_value if np.isnan(x) else x.

The change to the code may be minimal: something like

k = func(self[x].values)

jreback · 2013-10-15T20:30:35Z

@jtratner

I don't think we need to mess with the na_last (as far as having complicated behavior)
why is the current not adequate ?

jtratner · 2013-10-15T20:43:58Z

It seems awkward to me (na_position=last seems better or na_order=last). And if we're starting from the premise that we want to unify the API, would this apply flexibly to all the other sort methods? what about Series and Index sorting? And to confirm, na_last=False makes them first? Removes them? And I'm assuming it's also supposed to maintain relative position?

jtratner · 2013-10-15T20:45:16Z

Index doesn't have sort - so just Series and Panel. (groupby?) Assuming it applies to all.

jreback · 2013-10-15T20:47:41Z

this is just about sort ordering; you can't do a remove here; there are only 2 ways to do it, na_first or na_last (that said, you could do ti with a 'how' argument). Not sure we need to break the API here though

jtratner · 2013-10-15T21:00:05Z

Okay. na_last doesn't sound intuitive to me, but it seems like I'm in the
minority.
On Oct 15, 2013 4:47 PM, "jreback" notifications@github.com wrote:

this is just about sort ordering; you can't do a remove here; there are
only 2 ways to do it, na_first or na_last (that said, you could do ti with
a 'how' argument). Not sure we need to break the API here though

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5231#issuecomment-26370801
.

unutbu · 2013-10-15T21:03:18Z

I like na_position better than na_last since I find na_last=False a bit confusing. But whatever you decide is fine with me. I'm going to hold off on adding a key=func parameter to a different pull request :)

jtratner · 2013-10-15T22:09:14Z

And you can just check first letter if you want (so accept last, l, first,
f etc)
On Oct 15, 2013 5:03 PM, "unutbu" notifications@github.com wrote:

I like na_position better than na_last since I find na_last=False a bit
confusing. I'm going to hold off on adding a key=func parameter to a
different pull request :)

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/5231#issuecomment-26371995
.

unutbu · 2013-10-15T22:12:49Z

While I'm at it, would it be okay to change orders to ascending in groupby _lexsort_indexer?
Or maybe I shouldn't break the API...

jtratner · 2013-10-15T22:19:08Z

Also, I neglected to say earlier - thanks for submitting this PR!

jtratner · 2013-10-15T22:27:11Z

Not totally clear on what you mean. That said, we can't change existing
behavior without a warning (but can add keyword arguments with defaults to
change behavior)

unutbu · 2013-10-15T22:32:02Z

Not at all, my pleasure.

It seems to me the keyword ascending as used in DataFrame.sort has the same meaning as orders as used in groupby's _lexsort_indexer. If that is true, it would be nice if only one name were used for the concept.

jtratner · 2013-10-15T22:35:26Z

If you change the parameter, nice to change name of PR too :)

jreback · 2013-10-17T21:00:11Z

@unutbu can you add a release notes/v0.13.0 entry (1-liner or small example in 0.13 if you want)

under API changes

unutbu · 2013-10-17T21:24:19Z

I've found a bug in my code. With df defined as in GH3917, df.sort(['A'], na_position='first') does not place NaN at the beginning of the result.

jtratner · 2013-10-17T21:30:46Z

Glad you caught that! Still have some time to get this in for 0.13.

unutbu · 2013-10-17T21:49:29Z

pandas/core/frame.py

@@ -2632,34 +2638,22 @@ def sort_index(self, axis=0, by=None, ascending=True, inplace=False,
                raise ValueError('Length of ascending (%d) != length of by'
                                 ' (%d)' % (len(ascending), len(by)))

-            if len(by) > 1:


I removed the if len(by) > 1. Now, whenever the by parameter is used, the sorting is done by _lexsort_indexer. Before, argsort was being used when len(by) == 1. But argsort always puts NaNs at the end, so if we want to be able to place NaNs at the beginning, it seems better now to use _lexsort_indexer.

unutbu · 2013-10-17T22:08:19Z

Warning: My PR only affects how DataFrame.sort works when sorting columns with the by parameter. It doesn't currently address sorting by label. If NaN is a label, na_position does not affect the way it gets sorted...

unutbu · 2013-10-17T22:15:17Z

As I shift code away from argsort and onto _lexsort_indexer, the kind parameter is being used less and less. With my changes, kind is only used when sorting by label.

jtratner · 2013-10-17T22:15:37Z

okay, then you need to make the parameter None for now and then raise (NotImplementedError) if it is specified without a 'by' arg I guess. But that said - better if you supported that.

jreback · 2013-10-18T12:43:35Z

I like na_position...so in 0.14, then we will deprecate na_last for Series.order and change to na_position, yes? (aside from other fixups to Series.sort/order)

maybe want to add that warning now?

unutbu · 2013-10-19T08:55:55Z

@jreback I've added code to core/series.py to warn users that na_last is deprecated, and allow them to use the na_position parameter instead.

@jtratner I've tried to extend np.argsort by adding ascending and na_postion parameters. The extended function, _nargsort is in core/groupby. I used _nargsort in core/frame.py to allow DataFrame to sort a single column or label while respecting na_position.

unutbu · 2013-10-19T10:18:43Z

Hm. Travis-CI does not like the changes I've made. I'm setting up new virtualenvs to debug this...

jtratner · 2013-10-19T12:46:33Z

I think there's some issue with mergesort not being fully implemented in numpy 1.6 - maybe you changed a default without realizing it?

jtratner · 2013-10-19T12:48:05Z

This error here - https://travis-ci.org/unutbu/pandas/jobs/12751402 - "TypeError: unorderable types: str() > float()", is likely occurring because you didn't mask out the nan values before the sort.

jtratner · 2013-10-19T12:48:20Z

(and by you here I mean "pandas with this PR")

jreback · 2014-03-27T22:24:03Z

@unutbu awesome!

jreback · 2014-03-27T22:25:54Z

The deprecation removal notice is in #6581

jreback · 2014-03-27T22:34:40Z

@unutbu their may be a couple of FutureWarnings from na_last in the current test suite in test_groupby.py, can you do a followup to change the paremeter name (maybe elsewhere...just happened to notice that one)

jreback · 2014-03-27T22:46:32Z

@unutbu its Series.sort is passing na_last to order...i'll just make the change...pretty trivial

jreback · 2014-03-27T22:50:45Z

sorted 43baf9e

jreback · 2014-04-09T14:44:57Z

@unutbu realized that Series.sort does not propogate the na_position argument, no biggie...

bigger question I have is that sort defaults to quicksort and order defaults to mergesort probably because sort is a 'pass-thru' (or was) to np.sort which defaults that.

practically evey other sort is defaulting to quicksort any thoughts on this?

unutbu · 2014-04-09T18:34:02Z

@jreback: Thanks for fixing Series.sort. Making quicksort the default for all sorting methods makes sense to me.

jreback · 2014-04-09T18:42:16Z

ok.. makes sense

…andas-dev#8239 DEPR: remove of na_last from Series.order/Series.sort, xref pandas-dev#5231

unutbu reviewed Oct 17, 2013
View reviewed changes

jreback merged commit 87e1212 into pandas-dev:master Mar 27, 2014

jsexauer mentioned this pull request Mar 27, 2014

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

This was referenced Apr 3, 2014

DEPR: create issues for the current FutureWarnings in pandas #6641

Closed

Remove number of deprecated parameters/functions/classes [fix #6641] #6813

Merged

jreback mentioned this pull request Apr 9, 2014

API: default sort kind #6847

Closed

patricktokeeffe mentioned this pull request Sep 10, 2014

API: unify sort API #5190

Closed

jreback added a commit to jreback/pandas that referenced this pull request Aug 18, 2015

ENH/DEPR: add .sorted() method for API consistency, pandas-dev#9816, p…

13d2d71

…andas-dev#8239 DEPR: remove of na_last from Series.order/Series.sort, xref pandas-dev#5231

jreback mentioned this pull request Jul 24, 2016

DEPR: deprecations log for removed issues #13777

Closed

EHN/FIX: Add na_last parameter to DataFrame.sort. Fixes GH3917 #5231

EHN/FIX: Add na_last parameter to DataFrame.sort. Fixes GH3917 #5231

Conversation

unutbu commented Oct 15, 2013

jreback commented Oct 15, 2013

jtratner commented Oct 15, 2013

jreback commented Oct 15, 2013

jtratner commented Oct 15, 2013

unutbu commented Oct 15, 2013

jreback commented Oct 15, 2013

jtratner commented Oct 15, 2013

jtratner commented Oct 15, 2013

jreback commented Oct 15, 2013

jtratner commented Oct 15, 2013

unutbu commented Oct 15, 2013

jtratner commented Oct 15, 2013

unutbu commented Oct 15, 2013

jtratner commented Oct 15, 2013

jtratner commented Oct 15, 2013

unutbu commented Oct 15, 2013

jtratner commented Oct 15, 2013

jreback commented Oct 17, 2013

unutbu commented Oct 17, 2013

jtratner commented Oct 17, 2013

unutbu Oct 17, 2013

Choose a reason for hiding this comment

unutbu commented Oct 17, 2013

unutbu commented Oct 17, 2013

jtratner commented Oct 17, 2013

jreback commented Oct 18, 2013

unutbu commented Oct 19, 2013

unutbu commented Oct 19, 2013

jtratner commented Oct 19, 2013

jtratner commented Oct 19, 2013

jtratner commented Oct 19, 2013

jreback commented Mar 27, 2014

jreback commented Mar 27, 2014

jreback commented Mar 27, 2014

jreback commented Mar 27, 2014

jreback commented Mar 27, 2014

jreback commented Apr 9, 2014

unutbu commented Apr 9, 2014

jreback commented Apr 9, 2014