Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: unify sort API #5190

Closed
jreback opened this issue Oct 12, 2013 · 32 comments
Closed

API: unify sort API #5190

jreback opened this issue Oct 12, 2013 · 32 comments
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented Oct 12, 2013

related is #2094
related is #6847 (fixes kind and some arg ordering)
related is #7121 (make sortlevel a part of sort_index by adding level arg)

the sorting API is currently inconsistent and confusing. here is what exists:

Series:

  • sort: calls Series.order, in-place, defaults quicksort
  • order: do the sort on values, return a new object, defaults mergesort
  • sort_index: sort by labels, returns new object

Frame:

  • sort: calls sort_index
  • sort_index: sorts by the index with no args, otherwise a nested sort of the passed columns

The semantics are different between Series and DataFrame. In Series, sort mean in-place, order returns a new object. sort/order sort on the values, while sort_index sorts on the index. For a DataFrame, sort and sort_index are the same and sort on a column/list of columns; inplace is a keyword.

Proposed signature of combined methods. We need to break a Series API here. because sort is an in-place method which is quite inconsistent with everything else.

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
                   kind='quicksort', na_last=True):

This is what I think we should do:

  • make Series.sort/order be the same.
  • by can take a column/list of columns (as it can now), or an index name / index to provide index sorting (which means sort by the specifiied axis)
  • default is inplace=False (which is the same as now, except for Series.sort).
  • Series.sort_index does s.sort('index')
  • DataFrame.sort_index does df.sort('index')
  • eventually deprecate Series.order
  • add DataFrame.sort_columns to perform axis=1 sorting

This does switch the argument to the current sort_index, (e.g. axis is currently first), but I think then allows more natural syntax

  • df.sort() or df.sort_index() or df.sort_index('index') sort on the index labels
  • df.sort(['A','B'],axis=1) sort on these columns (allow 'index' here as well to sort on the index too)
  • df.sort_columns() or df.sort('columns') sort on the column labels
  • df.sort_columns() defaults axis=1, so df.sort_columns(['A','B']) is equiv of - - df.sort(['A','B'],axis=1)
  • s.sort() sort on the values
  • s.sort('index') or s.sort_index() sort on the series index
@jtratner
Copy link
Contributor

What happens if you have columns named 'index' or 'columns'?

@cancan101
Copy link
Contributor

Is the version of mergesort used stable? Assuming yes there are times then to prefer merge over quick.

@jreback
Copy link
Contributor Author

jreback commented Oct 13, 2013

mergesort is stable. the point is that right now sort and order have different defaults for the sortkind.

@cancan101
Copy link
Contributor

Fwiw java uses different sorts for primitives vs objects : http://docs.oracle.com/javase/7/docs/api/java/util/Arrays.html#sort(int[], int, int)

Stability for objects speed for primitives.

@cancan101
Copy link
Contributor

My point being there might actually be a valid reason to gave different defaults.

@jreback
Copy link
Contributor Author

jreback commented Oct 13, 2013

There IS often a valid reason to have different sorts available, that's why there is a keyword, but its incorrect to have a routine which does the same thing have different defaults.

aside from the fact that java is NOT a good example of coding :)

these are all primitives which are being sorted

@jtratner
Copy link
Contributor

For 0.13 - could we change Series.sort() to return a value or add NDFrame.order() that just passes to NDFrame.sort()? It would mean that you could do the same thing no matter what kind of NDFrame you had.

@jreback
Copy link
Contributor Author

jreback commented Oct 28, 2013

I think we ought to fix the API - but let's think thru
prob just leave 0.13 as is

@jtratner
Copy link
Contributor

okay

@ghost
Copy link

ghost commented Jan 6, 2014

You can't make quicksort the default, sort stability is a contract and you can't just break it.
if you want consistency you have to default to the stricter guarantee.

numpy arrays and series currently have compatible signatures for sort:

>>> a=np.ndarray([1,1])
>>> b=pd.Series([1,2])
>>> a.sort??
a.sort(axis=-1, kind='quicksort', order=None)
In [9]: b.sort??
Definition: b.sort(self, axis=0, kind='quicksort', order=None, ascending=True)

I find it hard to believe that's a coincidence or just an act of cargo-cult when the API was designed.
You may break more then you bargained for.

In general, more "consistency" achieved by breaking existing code is not automatically a good move.
You strive for consistency when designing an API but once 10,000 people develop against that API,
consisteny has to compete with stability.

Once an API is established, the useability argument is moot since more people are used to it then not.
That just leaves aesthetic considerations, easily trumped by stability I think.

There have been too many breaking changes in recent releases and we seem to treat
it more lightly as time passes. That's a bad trend IMO.
It's not convenient or sexy to keep an API stable, but it's responsible.

TL;DR: if a 1000 people need to modify working code to accomodate this change, is it still worth doing?

@jtratner
Copy link
Contributor

jtratner commented Jan 6, 2014

@y-p okay, so let's use that signature everywhere instead

@ghost
Copy link

ghost commented Jan 6, 2014

You think we should drop the axis argument from df.sort()? ok, sure.

@jreback
Copy link
Contributor Author

jreback commented Jan 6, 2014

quick sort is not the default, not sure where u see that

I don't think anything actually needs to change here except allowing inplace arguments

(which sort has inplace=True and order has inplace=False)

and IMHO just because numpy had setup an API doesn't mean it's what pandas should necessarily do - most of the time yes but by definition pandas does provide a richer environment

@ghost
Copy link

ghost commented Jan 6, 2014

The suggested unified signature:

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
                   kind='quicksort', na_last=True):

@jreback
Copy link
Contributor Author

jreback commented Jan 6, 2014

sorry I wrote what the default are!

I think they happen to be weird that sorting in place vs a copy has 2 different defaults

guess numpy defaults quick sort

@ghost
Copy link

ghost commented Jan 6, 2014

Like I said, I don't think the reason was cargo-cult there must have been a stronger reason
for keeping series compatible with numpy, you may not discover it until you try to change it.
Or the users will for you.

@ghost
Copy link

ghost commented Jan 6, 2014

The more detailed the argument the less effective it is I guess.

Will the proposed changes break a large chunk of existing code and if so, is it worth it?

@patricktokeeffe
Copy link
Contributor

The sort options are a little confusing so I'm glad to see a unified API in the works. As of 0.14.1, the series sort_index method has a single keyword argument ascending while the data frame sort_index method additionally has inplace, na_position, and kind (all of which could apply to a series). Since discovering this discrepancy, I've transitioned to plain sort and haven't looked back to sort_index. (Spoke too soon.)

@jreback
Copy link
Contributor Author

jreback commented Sep 8, 2014

@patricktokeeffe unfortunately this won't be address in 0.15.0 (unless someone steps up).

Its not difficult, just a bit of testing / deprecation. Interested?

@patricktokeeffe
Copy link
Contributor

Sounds interesting. Is there a branch in progress? Most of the discussion seems to be in this issue.

@jreback
Copy link
Contributor Author

jreback commented Sep 8, 2014

nope just this issue

feel free to poke around and see what you would change to make things consistent

would ideally be done in 0.15 if we are to break anything

@patricktokeeffe
Copy link
Contributor

I do think Series.sort should be able to sort on the index but I don't think extending the column list argument is the right solution. @jtratner points out there could be a column named 'index'. And how would multi-indexes be handled?

How about this unified signature? It has:

  • new on_index keyword which accepts a boolean, index name or list of index names
  • new on_cols keyword instead of columns (existing df.sort keyword) or by (existing df.sort_index keyword, not intuitive IMO) which accepts column/list of columns
  • change from na_last to na_position (happened in EHN/FIX: Add na_last parameter to DataFrame.sort. Fixes GH3917 #5231)
def sort(self, on_cols=None, on_index=None, axis=0, level=None, ascending=True, 
         inplace=False, kind='quicksort', na_position='last'):

The current default behavior for Series.sort is on values but for DataFrame.sort it is on index. I think one of these should change for a unified API and I think it makes more sense to sort on index by default. However, I recognize compatibility needs so in the proposed signature, on_cols and on_index are both None, which would result in the current object-dependent behavior.

So the revised proposed changes are:

  • make Series.sort/order equivalent and deprecate Series.order
  • keep argument to take a column/list of columns but unify keyword to on_cols
  • add on_index argument which can take a boolean/index name/list of index names
  • make default inplace=False (which only changes Series.sort)
  • create DataFrame.sort_columns to perform df.sort(axis=1)

This still swaps the first two arguments of df.sort_index, but I don't understand how df.sort differs from df.sort_index right now anyway.

Regarding syntax:

  • for dataframes
    • sort on index labels: df.sort() == df.sort(on_index=True) == df.sort_index()
    • sort on columns: df.sort_columns() == df.sort(axis=1) == df.sort(axis=1,on_index=False)
    • sort on columns A, B: df.sort(['A','B'], axis=1) == df.sort_columns(['A','B'])
    • sort on index & columns A, B: df.sort(['A','B'], axis=1, on_index=True) == df.sort_columns(['A','B'], on_index=True)
  • for series
    • sort on values: s.sort() == s.sort(on_index=False)
    • sort on index labels: s.sort(on_index=True) == s.sort_index()

There is still refinement to do here. I haven't used MultiIndexes, for example, so I don't know quite they should be handled. Probably like they are in the columns/by keywords of df.sort/sort_index but passed to on_index instead?

@patricktokeeffe
Copy link
Contributor

Per discussion in #7121, level sorting can be folded into sort with the new level keyword. It's already in the proposed unified signature with default value None.

  • proposed changes
    • add level argument which accepts integer or level name (improves on df.sortlevel which docs say only takes integers) -- perhaps on_level is more consistent with proposed signature
    • add sort_remaining keyword (not in proposed signature but present in df.sortlevel)
    • create sort_level for naming convention consistency and deprecate sortlevel?
  • syntax:
    • df.sort(level=0) == df.sortlevel(0) == df.sort_level(0)
    • s.sort(level=0 == s.sortlevel(0) == s.sort_level(0)

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2014

can you post signature for new sort?

@patricktokeeffe
Copy link
Contributor

current proposal:

def sort(self, on_cols=None, on_index=None, axis=0, level=None, ascending=True, 
         inplace=False, kind='quicksort', na_position='last', sort_remaining=None):

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2014

why do you need on_index/on_cols isn' the combination of by,level,axis enough?

by=columns_to_sort,axis=1 -> sort by the 'by' columns
level=0 sort by the index (e.g. sort_index)
level=0,axis=1 -> sort the columns (e.g. sort_columns)

@patricktokeeffe
Copy link
Contributor

I introduced on_index because of the ambiguity of sort(by='index'). IIUC, sort(level=0) would be an equivalent expression.

Would users ever sort on columns and index? Like: sort(by=cols_to_sort, level=0)?

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2014

That last is possible if you wanted to sort by a particular index value (and reorder the columns based on that since you are specifying level=0 and axis=0)

@jreback
Copy link
Contributor Author

jreback commented Sep 10, 2014

@patricktokeeffe can you post a complete proposal (e.g. signature / doc string), functions to add/change, and and API changes to expect. I will upate the top of this PR with it.

@patricktokeeffe
Copy link
Contributor

OK. revised proposed signature (complete proposal forthcoming...):

def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False,
         kind='quicksort', na_position='last', sort_remaining=None):

The sort_remaining keyword comes from Series/DataFrame.sortlevel.

@patricktokeeffe
Copy link
Contributor

For reference, the 0.14.1 signatures are:

Series.sort(axis=0, ascending=True, kind='quicksort', na_position='last', inplace=True)
Series.sort_index(ascending=True)
Series.sortlevel(level=0, ascending=True, sort_remaining=True)

DataFrame.sort(columns=None, axis=0, ascending=True, inplace=False, kind='quicksort', 
               na_position='last')
DataFrame.sort_index(axis=0, by=None, ascending=True, inplace=False, kind='quicksort', 
                     na_position='last')
DataFrame.sortlevel(level=0, axis=0, ascending=True, inplace=False, sort_remaining=True)

Proposed unified signature for Series.sort and DataFrame.sort:

# maybe axis first, then by?
def sort(self, by=None, axis=0, level=None, ascending=True, inplace=False, 
         kind='quicksort', na_position='last', sort_remaining=True):
         """Sort by labels (along either axis), by the values in column(s) or both.

         If both, labels take precedence over columns. If neither is specified,
         behavior is object-dependent: series = on values, dataframe = on index.

         Parameters
         ----------
         by : column name or list of column names
             if not None, sort on values in specified column name; perform nested
             sort if list of column names specified. this argument ignored by series
         axis : {0, 1}
             sort index/rows (0) or columns (1); for Series, only 0 can be specified
         level : int or level name or list of ints or list of column names
             if not None, sort on values in specified index level(s)
         ascending : bool or list of bool
             Sort ascending vs. descending. Specify list for multiple sort orders.
         inplace : bool
             if True, perform operation in-place (without creating new instance)
         kind : {‘quicksort’, ‘mergesort’, ‘heapsort’}
             Choice of sorting algorithm. See np.sort for more information. 
             ‘mergesort’ is the only stable algorithm. For data frames, this option is 
             only applied when sorting on a single column or label.
         na_position : {'first', 'last'}
             ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end
         sort_remaining : bool
             if true and sorting by level and index is multilevel, sort by other levels
             too (in order) after sorting by specified level
         """

The sort_index signatures change too and sort_columns is created:

Series.sort_index(level=0, ascending=True, inplace=False, kind='quicksort', 
                  na_position='last', sort_remaining=True)
DataFrame.sort_index(level=0, axis=0, by=None, ascending=True, inplace=False, 
                     kind='quicksort', na_position='last', sort_remaining=True) 
                     # by is DEPRECATED, see change 7

DataFrame.sort_columns(by=None, level=0, ascending=True, inplace=False, 
                       kind='quicksort', na_position='last', sort_remaining=True)
                       # or maybe level=None

Proposed changes:

  1. make inplace=False default (changes Series.sort)
  2. new by argument to accept column-name/list-of-column-names in first (or second?) position
    • deprecate columns keyword of DataFrame.sort, replaced with by (df.sort signature would need to retain columns keyword until finally removed but it's not shown in proposal)
    • don't allow tuples to access levels of multi-index (columns arg of DataFrame.sort allows tuples); use new level argument instead
    • don't swap order of by/axis in DataFrame.sort_index (see change 7)
    • this argument is ignored by series which is odd considering it's in first position. perhaps it should go in second position with axis first
  3. new level argument to accept integer/level-name/list-of-ints/list-of-level-names for sorting (multi)index by particular level(s)
    • replaces tuple behavior of columns arg of DataFrame.sort
    • add level argument to sort_index in first position so level(s) of multilevel index can be specified; this makes sort_index==sortlevel (see change 8)
    • also adds sort_remaining arg to handle multi-level indexes
  4. new method DataFrame.sort_columns==sort(axis=1) (see syntax below)
  5. deprecate Series.order since change 1 makes Series.sort equivalent
  6. add inplace, kind, and na_position arguments to Series.sort_index (to match DataFrame.sort_index); by and axis args are not added since they don't make sense for series
  7. deprecate and eventually remove by argument from DataFrame.sort_index since it makes sort_index equivalent to sort
  8. deprecate sortlevel since change 3b makes sort_index equivalent

Notes:

  • default behavior of sort is still object-dependent: for series, sorts by values and for data frames, sorts by index
  • new level arg makes sort_index and sortlevel equivalent. if sortlevel is retained:
    • should rename sortlevel to sort_level for naming conventions
    • Series.sortlevel should have inplace argument added
    • maybe don't add level and sort_remaining args to sort_index so it's not equivalent to sort_level (intentionally limiting sort_index seems like a bad idea though)
  • it's unclear if default should be level=None for sort_columns. probably not since level=None falls back to level=0 anyway
  • since Series.sort ignores by argument, maybe axis should go in first position

Syntax:

  • dataframes
    • sort() == sort(level=0) == sort_index() == sortlevel()
      • without columns or level specified, defaults to current behavior of sort on index
    • sort(['A','B'])
      • since columns are specified, default index sort should not occur; sorting only happens using columns 'A' and 'B'
    • sort(level='spam') == sort_index('spam') == sortlevel('spam')
      • sort occurs on row index named 'spam' or level of multi-index named 'spam'
    • sort(['A','B'], level='spam')
      • level controls here even though columns are specified so sort happens along row index named 'spam' first, then nested sort occurs using columns 'A' and 'B'
    • sort(axis=1) == sort(axis=1, level=0) == sort_columns()
      • since data frames default to sort on index, leaving level=None is the same as level=0
    • sort(['A','B'], axis=1) == sort_columns(['A','B'])
      • as with preceding example, level=None becomes level=0 in sort_columns
    • sort(['A','B'], axis=1, level='spam') == sort_columns(['A','B'], level='spam')
      • axis controls level so sort will be on columns named 'A' and 'B' in column index named 'spam'
  • series:
    • sort() == order() -- sorts on values
    • with level specified, sorts on index/named index/level of multi-index:
      • sort(level=0) == sort_index() == sortlevel()
      • sort(level='spam') == sort_index('spam') == sortlevel('spam')

This is getting pretty hairy... Please point out contradictions/logic flaws/etc.

@jreback
Copy link
Contributor Author

jreback commented Sep 11, 2014

I am going to close this issue, and have you open a new one (with your text from above, just copy-past).
SO that you can edit it!.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

4 participants