API: unify sort API #5190

jreback · 2013-10-12T13:33:57Z

related is #2094
related is #6847 (fixes kind and some arg ordering)
related is #7121 (make sortlevel a part of sort_index by adding level arg)

the sorting API is currently inconsistent and confusing. here is what exists:

Series:

sort: calls Series.order, in-place, defaults quicksort
order: do the sort on values, return a new object, defaults mergesort
sort_index: sort by labels, returns new object

Frame:

sort: calls sort_index
sort_index: sorts by the index with no args, otherwise a nested sort of the passed columns

The semantics are different between Series and DataFrame. In Series, sort mean in-place, order returns a new object. sort/order sort on the values, while sort_index sorts on the index. For a DataFrame, sort and sort_index are the same and sort on a column/list of columns; inplace is a keyword.

Proposed signature of combined methods. We need to break a Series API here. because sort is an in-place method which is quite inconsistent with everything else.

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
                   kind='quicksort', na_last=True):

This is what I think we should do:

make Series.sort/order be the same.
by can take a column/list of columns (as it can now), or an index name / index to provide index sorting (which means sort by the specifiied axis)
default is inplace=False (which is the same as now, except for Series.sort).
Series.sort_index does s.sort('index')
DataFrame.sort_index does df.sort('index')
eventually deprecate Series.order
add DataFrame.sort_columns to perform axis=1 sorting

This does switch the argument to the current sort_index, (e.g. axis is currently first), but I think then allows more natural syntax

df.sort() or df.sort_index() or df.sort_index('index') sort on the index labels
df.sort(['A','B'],axis=1) sort on these columns (allow 'index' here as well to sort on the index too)
df.sort_columns() or df.sort('columns') sort on the column labels
df.sort_columns() defaults axis=1, so df.sort_columns(['A','B']) is equiv of - - df.sort(['A','B'],axis=1)
s.sort() sort on the values
s.sort('index') or s.sort_index() sort on the series index

The text was updated successfully, but these errors were encountered:

jtratner · 2013-10-12T13:53:09Z

What happens if you have columns named 'index' or 'columns'?

cancan101 · 2013-10-13T14:14:39Z

Is the version of mergesort used stable? Assuming yes there are times then to prefer merge over quick.

jreback · 2013-10-13T14:16:16Z

mergesort is stable. the point is that right now sort and order have different defaults for the sortkind.

cancan101 · 2013-10-13T14:20:08Z

Fwiw java uses different sorts for primitives vs objects : http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#sort(int[], int, int)

Stability for objects speed for primitives.

cancan101 · 2013-10-13T14:20:55Z

My point being there might actually be a valid reason to gave different defaults.

jreback · 2013-10-13T15:02:06Z

There IS often a valid reason to have different sorts available, that's why there is a keyword, but its incorrect to have a routine which does the same thing have different defaults.

aside from the fact that java is NOT a good example of coding :)

these are all primitives which are being sorted

jtratner · 2013-10-28T01:31:02Z

For 0.13 - could we change Series.sort() to return a value or add NDFrame.order() that just passes to NDFrame.sort()? It would mean that you could do the same thing no matter what kind of NDFrame you had.

jreback · 2013-10-28T01:33:02Z

I think we ought to fix the API - but let's think thru
prob just leave 0.13 as is

jtratner · 2013-10-28T01:34:57Z

okay

ghost · 2014-01-06T03:45:20Z

You can't make quicksort the default, sort stability is a contract and you can't just break it.
if you want consistency you have to default to the stricter guarantee.

numpy arrays and series currently have compatible signatures for sort:

>>> a=np.ndarray([1,1])
>>> b=pd.Series([1,2])
>>> a.sort??
a.sort(axis=-1, kind='quicksort', order=None)
In [9]: b.sort??
Definition: b.sort(self, axis=0, kind='quicksort', order=None, ascending=True)

I find it hard to believe that's a coincidence or just an act of cargo-cult when the API was designed.
You may break more then you bargained for.

In general, more "consistency" achieved by breaking existing code is not automatically a good move.
You strive for consistency when designing an API but once 10,000 people develop against that API,
consisteny has to compete with stability.

Once an API is established, the useability argument is moot since more people are used to it then not.
That just leaves aesthetic considerations, easily trumped by stability I think.

There have been too many breaking changes in recent releases and we seem to treat
it more lightly as time passes. That's a bad trend IMO.
It's not convenient or sexy to keep an API stable, but it's responsible.

TL;DR: if a 1000 people need to modify working code to accomodate this change, is it still worth doing?

jtratner · 2014-01-06T03:48:55Z

@y-p okay, so let's use that signature everywhere instead

ghost · 2014-01-06T03:51:11Z

You think we should drop the axis argument from df.sort()? ok, sure.

jreback · 2014-01-06T03:53:03Z

quick sort is not the default, not sure where u see that

I don't think anything actually needs to change here except allowing inplace arguments

(which sort has inplace=True and order has inplace=False)

and IMHO just because numpy had setup an API doesn't mean it's what pandas should necessarily do - most of the time yes but by definition pandas does provide a richer environment

ghost · 2014-01-06T03:53:53Z

The suggested unified signature:

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
                   kind='quicksort', na_last=True):

jreback · 2014-01-06T03:57:05Z

sorry I wrote what the default are!

I think they happen to be weird that sorting in place vs a copy has 2 different defaults

guess numpy defaults quick sort

ghost · 2014-01-06T03:57:27Z

Like I said, I don't think the reason was cargo-cult there must have been a stronger reason
for keeping series compatible with numpy, you may not discover it until you try to change it.
Or the users will for you.

ghost · 2014-01-06T03:59:25Z

The more detailed the argument the less effective it is I guess.

Will the proposed changes break a large chunk of existing code and if so, is it worth it?

patricktokeeffe · 2014-09-08T19:29:00Z

The sort options are a little confusing so I'm glad to see a unified API in the works. As of 0.14.1, the series sort_index method has a single keyword argument ascending while the data frame sort_index method additionally has inplace, na_position, and kind (all of which could apply to a series). ~~Since discovering this discrepancy, I've transitioned to plain sort and haven't looked back to sort_index.~~ (Spoke too soon.)

jreback · 2014-09-08T19:38:12Z

@patricktokeeffe unfortunately this won't be address in 0.15.0 (unless someone steps up).

Its not difficult, just a bit of testing / deprecation. Interested?

patricktokeeffe · 2014-09-08T19:45:26Z

Sounds interesting. Is there a branch in progress? Most of the discussion seems to be in this issue.

jreback · 2014-09-08T20:21:31Z

nope just this issue

feel free to poke around and see what you would change to make things consistent

would ideally be done in 0.15 if we are to break anything

patricktokeeffe · 2014-09-10T00:22:49Z

I do think Series.sort should be able to sort on the index but I don't think extending the column list argument is the right solution. @jtratner points out there could be a column named 'index'. And how would multi-indexes be handled?

How about this unified signature? It has:

new on_index keyword which accepts a boolean, index name or list of index names
new on_cols keyword instead of columns (existing df.sort keyword) or by (existing df.sort_index keyword, not intuitive IMO) which accepts column/list of columns
change from na_last to na_position (happened in EHN/FIX: Add na_last parameter to DataFrame.sort. Fixes GH3917 #5231)

def sort(self, on_cols=None, on_index=None, axis=0, level=None, ascending=True, 
         inplace=False, kind='quicksort', na_position='last'):

The current default behavior for Series.sort is on values but for DataFrame.sort it is on index. I think one of these should change for a unified API and I think it makes more sense to sort on index by default. However, I recognize compatibility needs so in the proposed signature, on_cols and on_index are both None, which would result in the current object-dependent behavior.

So the revised proposed changes are:

make Series.sort/order equivalent and deprecate Series.order
keep argument to take a column/list of columns but unify keyword to on_cols
add on_index argument which can take a boolean/index name/list of index names
make default inplace=False (which only changes Series.sort)
create DataFrame.sort_columns to perform df.sort(axis=1)

This still swaps the first two arguments of df.sort_index, but I don't understand how df.sort differs from df.sort_index right now anyway.

Regarding syntax:

for dataframes
- sort on index labels: df.sort() == df.sort(on_index=True) == df.sort_index()
- sort on columns: df.sort_columns() == df.sort(axis=1) == df.sort(axis=1,on_index=False)
- sort on columns A, B: df.sort(['A','B'], axis=1) == df.sort_columns(['A','B'])
- sort on index & columns A, B: df.sort(['A','B'], axis=1, on_index=True) == df.sort_columns(['A','B'], on_index=True)
for series
- sort on values: s.sort() == s.sort(on_index=False)
- sort on index labels: s.sort(on_index=True) == s.sort_index()

There is still refinement to do here. I haven't used MultiIndexes, for example, so I don't know quite they should be handled. Probably like they are in the columns/by keywords of df.sort/sort_index but passed to on_index instead?

patricktokeeffe · 2014-09-10T18:54:20Z

Per discussion in #7121, level sorting can be folded into sort with the new level keyword. It's already in the proposed unified signature with default value None.

proposed changes
- add level argument which accepts integer or level name (improves on df.sortlevel which docs say only takes integers) -- perhaps on_level is more consistent with proposed signature
- add sort_remaining keyword (not in proposed signature but present in df.sortlevel)
- create sort_level for naming convention consistency and deprecate sortlevel?
syntax:
- df.sort(level=0) == df.sortlevel(0) == df.sort_level(0)
- s.sort(level=0 == s.sortlevel(0) == s.sort_level(0)

jreback · 2014-09-10T18:55:45Z

can you post signature for new sort?

patricktokeeffe · 2014-09-10T18:57:17Z

current proposal:

def sort(self, on_cols=None, on_index=None, axis=0, level=None, ascending=True, 
         inplace=False, kind='quicksort', na_position='last', sort_remaining=None):

jreback · 2014-09-10T19:03:34Z

why do you need on_index/on_cols isn' the combination of by,level,axis enough?

by=columns_to_sort,axis=1 -> sort by the 'by' columns
level=0 sort by the index (e.g. sort_index)
level=0,axis=1 -> sort the columns (e.g. sort_columns)

patricktokeeffe · 2014-09-10T20:03:28Z

I introduced on_index because of the ambiguity of sort(by='index'). IIUC, sort(level=0) would be an equivalent expression.

Would users ever sort on columns and index? Like: sort(by=cols_to_sort, level=0)?

jreback · 2014-09-10T20:07:31Z

That last is possible if you wanted to sort by a particular index value (and reorder the columns based on that since you are specifying level=0 and axis=0)

jreback · 2014-09-10T20:08:36Z

@patricktokeeffe can you post a complete proposal (e.g. signature / doc string), functions to add/change, and and API changes to expect. I will upate the top of this PR with it.

patricktokeeffe · 2014-09-10T20:30:35Z

OK. revised proposed signature (complete proposal forthcoming...):

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
         kind='quicksort', na_position='last', sort_remaining=None):

The sort_remaining keyword comes from Series/DataFrame.sortlevel.

patricktokeeffe · 2014-09-11T00:30:56Z

For reference, the 0.14.1 signatures are:

Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True)
Series.sort_index(ascending=True)
Series.sortlevel(level=0, ascending=True, sort_remaining=True)

DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last')
DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort', 
                     na_position='last')
DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)

Proposed unified signature for Series.sort and DataFrame.sort:

# maybe axis first, then by?
def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False, 
         kind='quicksort', na_position='last', sort_remaining=True):
         """Sort by labels (along either axis), by the values in column(s) or both.

         If both, labels take precedence over columns. If neither is specified,
         behavior is object-dependent: series = on values, dataframe = on index.

         Parameters
         ----------
         by : column name or list of column names
             if not None, sort on values in specified column name; perform nested
             sort if list of column names specified. this argument ignored by series
         axis : {0, 1}
             sort index/rows (0) or columns (1); for Series, only 0 can be specified
         level : int or level name or list of ints or list of column names
             if not None, sort on values in specified index level(s)
         ascending : bool or list of bool
             Sort ascending vs. descending. Specify list for multiple sort orders.
         inplace : bool
             if True, perform operation in-place (without creating new instance)
         kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
             Choice of sorting algorithm. See np.sort for more information. 
             ‘mergesort’ is the only stable algorithm. For data frames, this option is 
             only applied when sorting on a single column or label.
         na_position : {'first', 'last'}
             ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
         sort_remaining : bool
             if true and sorting by level and index is multilevel, sort by other levels
             too (in order) after sorting by specified level
         """

The sort_index signatures change too and sort_columns is created:

Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort', 
                  na_position='last', sort_remaining=True)
DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False, 
                     kind='quicksort', na_position='last', sort_remaining=True) 
                     # by is DEPRECATED, see change 7

DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False, 
                       kind='quicksort', na_position='last', sort_remaining=True)
                       # or maybe level=None

Proposed changes:

make inplace=False default (changes Series.sort)
new by argument to accept column-name/list-of-column-names in first (or second?) position
- deprecate columns keyword of DataFrame.sort, replaced with by (df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal)
- don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead
- don't swap order of by/axis in DataFrame.sort_index (see change 7)
- this argument is ignored by series which is odd considering it's in first position. perhaps it should go in second position with axis first
new level argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)
- replaces tuple behavior of columns arg of DataFrame.sort
- add level argument to sort_index in first position so level(s) of multilevel index can be specified; this makes sort_index==sortlevel (see change 8)
- also adds sort_remaining arg to handle multi-level indexes
new method DataFrame.sort_columns==sort(axis=1) (see syntax below)
deprecate Series.order since change 1 makes Series.sort equivalent
add inplace, kind, and na_position arguments to Series.sort_index (to match DataFrame.sort_index); by and axis args are not added since they don't make sense for series
deprecate and eventually remove by argument from DataFrame.sort_index since it makes sort_index equivalent to sort
deprecate sortlevel since change 3b makes sort_index equivalent

Notes:

default behavior of sort is still object-dependent: for series, sorts by values and for data frames, sorts by index
new level arg makes sort_index and sortlevel equivalent. if sortlevel is retained:
- should rename sortlevel to sort_level for naming conventions
- Series.sortlevel should have inplace argument added
- maybe don't add level and sort_remaining args to sort_index so it's not equivalent to sort_level (intentionally limiting sort_index seems like a bad idea though)
it's unclear if default should be level=None for sort_columns. probably not since level=None falls back to level=0 anyway
since Series.sort ignores by argument, maybe axis should go in first position

Syntax:

dataframes
- sort() == sort(level=0) == sort_index() == sortlevel()
  - without columns or level specified, defaults to current behavior of sort on index
- sort(['A','B'])
  - since columns are specified, default index sort should not occur; sorting only happens using columns 'A' and 'B'
- sort(level='spam') == sort_index('spam') == sortlevel('spam')
  - sort occurs on row index named 'spam' or level of multi-index named 'spam'
- sort(['A','B'], level='spam')
  - level controls here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'
- sort(axis=1) == sort(axis=1, level=0) == sort_columns()
  - since data frames default to sort on index, leaving level=None is the same as level=0
- sort(['A','B'], axis=1) == sort_columns(['A','B'])
  - as with preceding example, level=None becomes level=0 in sort_columns
- sort(['A','B'], axis=1, level='spam') == sort_columns(['A','B'], level='spam')
  - axis controls level so sort will be on columns named 'A' and 'B' in column index named 'spam'
series:
- sort() == order() -- sorts on values
- with level specified, sorts on index/named index/level of multi-index:
  - sort(level=0) == sort_index() == sortlevel()
  - sort(level='spam') == sort_index('spam') == sortlevel('spam')

This is getting pretty hairy... Please point out contradictions/logic flaws/etc.

jreback · 2014-09-11T00:33:14Z

I am going to close this issue, and have you open a new one (with your text from above, just copy-past).
SO that you can edit it!.

This was referenced Oct 14, 2013

CLN: 0.14 NDFrame cleaning #5044

Closed

EHN/FIX: Add na_last parameter to DataFrame.sort. Fixes GH3917 #5231

Merged

jreback mentioned this issue Jan 5, 2014

Should series have an is_view method? #5856

Closed

This was referenced Jan 29, 2014

ENH partial sorting for mi in sortlevel #6135

Merged

Sorting by column, API improvements possible with SO example? #2094

Closed

jreback modified the milestones: 0.15.0, 0.14.0 Apr 9, 2014

jreback mentioned this issue May 13, 2014

df.sortlevel() vs df.sort_index() #7121

Closed

patricktokeeffe mentioned this issue Sep 8, 2014

View returned when using squeeze=True with read_csv? #8217

Closed

jreback closed this as completed Sep 11, 2014

patricktokeeffe mentioned this issue Sep 11, 2014

API: unified sorting #8239

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: unify sort API #5190

API: unify sort API #5190

jreback commented Oct 12, 2013

jtratner commented Oct 12, 2013

cancan101 commented Oct 13, 2013

jreback commented Oct 13, 2013

cancan101 commented Oct 13, 2013

cancan101 commented Oct 13, 2013

jreback commented Oct 13, 2013

jtratner commented Oct 28, 2013

jreback commented Oct 28, 2013

jtratner commented Oct 28, 2013

ghost commented Jan 6, 2014

jtratner commented Jan 6, 2014

ghost commented Jan 6, 2014

jreback commented Jan 6, 2014

ghost commented Jan 6, 2014

jreback commented Jan 6, 2014

ghost commented Jan 6, 2014

ghost commented Jan 6, 2014

patricktokeeffe commented Sep 8, 2014

jreback commented Sep 8, 2014

patricktokeeffe commented Sep 8, 2014

jreback commented Sep 8, 2014

patricktokeeffe commented Sep 10, 2014

patricktokeeffe commented Sep 10, 2014

jreback commented Sep 10, 2014

patricktokeeffe commented Sep 10, 2014

jreback commented Sep 10, 2014

patricktokeeffe commented Sep 10, 2014

jreback commented Sep 10, 2014

jreback commented Sep 10, 2014

patricktokeeffe commented Sep 10, 2014

patricktokeeffe commented Sep 11, 2014

jreback commented Sep 11, 2014

API: unify sort API #5190

API: unify sort API #5190

Comments

jreback commented Oct 12, 2013

jtratner commented Oct 12, 2013

cancan101 commented Oct 13, 2013

jreback commented Oct 13, 2013

cancan101 commented Oct 13, 2013

cancan101 commented Oct 13, 2013

jreback commented Oct 13, 2013

jtratner commented Oct 28, 2013

jreback commented Oct 28, 2013

jtratner commented Oct 28, 2013

ghost commented Jan 6, 2014

jtratner commented Jan 6, 2014

ghost commented Jan 6, 2014

jreback commented Jan 6, 2014

ghost commented Jan 6, 2014

jreback commented Jan 6, 2014

ghost commented Jan 6, 2014

ghost commented Jan 6, 2014

patricktokeeffe commented Sep 8, 2014

jreback commented Sep 8, 2014

patricktokeeffe commented Sep 8, 2014

jreback commented Sep 8, 2014

patricktokeeffe commented Sep 10, 2014

patricktokeeffe commented Sep 10, 2014

jreback commented Sep 10, 2014

patricktokeeffe commented Sep 10, 2014

jreback commented Sep 10, 2014

patricktokeeffe commented Sep 10, 2014

jreback commented Sep 10, 2014

jreback commented Sep 10, 2014

patricktokeeffe commented Sep 10, 2014

patricktokeeffe commented Sep 11, 2014

jreback commented Sep 11, 2014