New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add key to sorting functions #3942

Open
hayd opened this Issue Jun 18, 2013 · 17 comments

Comments

Projects
None yet
6 participants
@hayd
Contributor

hayd commented Jun 18, 2013

Many python functions (sorting, max/min) accept a key argument, perhaps they could in pandas too.

.

The terrible motivating example was this awful hack from this question.... for which maybe one could do

df.sort_index(key=lambda t: literal_eval(t[1:-1]))

This would still be an awful awful hack, but a slightly less awful one.

@SethMMorton

This comment has been minimized.

Show comment
Hide comment
@SethMMorton

SethMMorton Apr 11, 2015

Here's a specific use case that came up on StackOverflow recently.

from pandas import DataFrame
df = DataFrame({'a':range(5)},index=['0hr', '128hr', '72hr', '48hr', '96hr'])
print(df.sort())

This returns

       a
0hr    0
128hr  1
48hr   3
72hr   2
96hr   4

The user would have liked to sort the index "naturally" using natsort, i.e.:

from natsort import natsort_keygen
print(df.sort(key=natsort_geygen()))

which would have returned

       a
0hr    0
48hr   3
72hr   2
96hr   4
128hr  1

SethMMorton commented Apr 11, 2015

Here's a specific use case that came up on StackOverflow recently.

from pandas import DataFrame
df = DataFrame({'a':range(5)},index=['0hr', '128hr', '72hr', '48hr', '96hr'])
print(df.sort())

This returns

       a
0hr    0
128hr  1
48hr   3
72hr   2
96hr   4

The user would have liked to sort the index "naturally" using natsort, i.e.:

from natsort import natsort_keygen
print(df.sort(key=natsort_geygen()))

which would have returned

       a
0hr    0
48hr   3
72hr   2
96hr   4
128hr  1
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 11, 2015

Contributor

Here's the way to do this (further you could use a CateogricalIndex to have custom sorting).

In [67]: df = DataFrame({'a':range(5)},index=pd.to_timedelta(['0h', '128h', '72h', '48h', '96h']))

In [68]: df
Out[68]: 
                 a
0 days 00:00:00  0
5 days 08:00:00  1
3 days 00:00:00  2
2 days 00:00:00  3
4 days 00:00:00  4

In [69]: df.sort_index()
Out[69]: 
                 a
0 days 00:00:00  0
2 days 00:00:00  3
3 days 00:00:00  2
4 days 00:00:00  4
5 days 08:00:00  1
Contributor

jreback commented Apr 11, 2015

Here's the way to do this (further you could use a CateogricalIndex to have custom sorting).

In [67]: df = DataFrame({'a':range(5)},index=pd.to_timedelta(['0h', '128h', '72h', '48h', '96h']))

In [68]: df
Out[68]: 
                 a
0 days 00:00:00  0
5 days 08:00:00  1
3 days 00:00:00  2
2 days 00:00:00  3
4 days 00:00:00  4

In [69]: df.sort_index()
Out[69]: 
                 a
0 days 00:00:00  0
2 days 00:00:00  3
3 days 00:00:00  2
4 days 00:00:00  4
5 days 08:00:00  1
@SethMMorton

This comment has been minimized.

Show comment
Hide comment
@SethMMorton

SethMMorton Apr 11, 2015

Agreed, in this specific example they could use timedeltas. But what if the index labels were something like "version1", "version2" ... "version10", etc. or something else that does not directly correspond with a builtin, sortable datatype? I think I'm looking for a more general solution.

I'm not really sure how to use CategoricalIndex to do custom sorting... I'm getting a bit lost in the docs. Is that considered by the pandas community to be the correct way to do custom sorting?

SethMMorton commented Apr 11, 2015

Agreed, in this specific example they could use timedeltas. But what if the index labels were something like "version1", "version2" ... "version10", etc. or something else that does not directly correspond with a builtin, sortable datatype? I think I'm looking for a more general solution.

I'm not really sure how to use CategoricalIndex to do custom sorting... I'm getting a bit lost in the docs. Is that considered by the pandas community to be the correct way to do custom sorting?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 11, 2015

Contributor

see #9741 (big, long), but short answer is yes.

Contributor

jreback commented Apr 11, 2015

see #9741 (big, long), but short answer is yes.

@SethMMorton

This comment has been minimized.

Show comment
Hide comment
@SethMMorton

SethMMorton Apr 11, 2015

If I understand properly, in order for Categorical to work properly the sorting order for all elements must be defined at the time of creation of the DataFrame. What if additional elements are added that were not present in the initial definition?

Based on your suggestions, I am getting the impression that key will not be added to sort. Am I reading between the lines correctly?

SethMMorton commented Apr 11, 2015

If I understand properly, in order for Categorical to work properly the sorting order for all elements must be defined at the time of creation of the DataFrame. What if additional elements are added that were not present in the initial definition?

Based on your suggestions, I am getting the impression that key will not be added to sort. Am I reading between the lines correctly?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 11, 2015

Contributor

@SethMMorton I think we could add key to .sorted (pro the name of the new method to handle all sorting). Just takes some work.

In categoricals, you can add things later if you want.(and reorder to change the sorting order).

Contributor

jreback commented Apr 11, 2015

@SethMMorton I think we could add key to .sorted (pro the name of the new method to handle all sorting). Just takes some work.

In categoricals, you can add things later if you want.(and reorder to change the sorting order).

@SethMMorton

This comment has been minimized.

Show comment
Hide comment
@SethMMorton

SethMMorton Apr 11, 2015

Ok. Thanks for your time. I'll take a good, deep look at categoricals.

SethMMorton commented Apr 11, 2015

Ok. Thanks for your time. I'll take a good, deep look at categoricals.

@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Apr 11, 2015

Member

An issue with the key argument is that it encourages writing non-vectorized code. I would rather support supplying an array to sort by instead, e.g., something like df.sort(order=[natsort_geygen(x) for x in df.index]). Categoricals are also a nice way to handle this.

Member

shoyer commented Apr 11, 2015

An issue with the key argument is that it encourages writing non-vectorized code. I would rather support supplying an array to sort by instead, e.g., something like df.sort(order=[natsort_geygen(x) for x in df.index]). Categoricals are also a nice way to handle this.

@SethMMorton

This comment has been minimized.

Show comment
Hide comment
@SethMMorton

SethMMorton Apr 12, 2015

True. Maybe it would be nice if the Categoricals could provide a key argument that would be an alternative to the categories argument, to define a functional way that the categories would be sorted rather than a pre-defined way. Unless I am mistaken, this would still keep the power of the Categoricals while allowing the flexibility provided by the key argument.

SethMMorton commented Apr 12, 2015

True. Maybe it would be nice if the Categoricals could provide a key argument that would be an alternative to the categories argument, to define a functional way that the categories would be sorted rather than a pre-defined way. Unless I am mistaken, this would still keep the power of the Categoricals while allowing the flexibility provided by the key argument.

@kay1793

This comment has been minimized.

Show comment
Hide comment
@kay1793

kay1793 Apr 12, 2015

first, sort by key is a universally known operation, why drag categories into it?

An issue with the key argument is that it encourages writing non-vectorized code.
<...> I would rather support supplying an array to sort by instead

adding sorted was suggested as a way to get a consistent API for sorting within pandas and with. it looks like the first thing you'd like to do is to have its key behave differently from both python (which accepts lambdas) and pandas (which accepts either lambda/ufuncs/arrays in such cases).

pandas already usefully accepts lambdas in many many places, you can always optimize later if you need to. why decide for every user and every problem that this specific boilerplate/performance tradeoff is worth it? the array you'd like to require is just as likely to be computed with the lambda you'd like to ban only with added inconvenience for users.

kay1793 commented Apr 12, 2015

first, sort by key is a universally known operation, why drag categories into it?

An issue with the key argument is that it encourages writing non-vectorized code.
<...> I would rather support supplying an array to sort by instead

adding sorted was suggested as a way to get a consistent API for sorting within pandas and with. it looks like the first thing you'd like to do is to have its key behave differently from both python (which accepts lambdas) and pandas (which accepts either lambda/ufuncs/arrays in such cases).

pandas already usefully accepts lambdas in many many places, you can always optimize later if you need to. why decide for every user and every problem that this specific boilerplate/performance tradeoff is worth it? the array you'd like to require is just as likely to be computed with the lambda you'd like to ban only with added inconvenience for users.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Apr 12, 2015

Contributor

With Categoricals you get custom sorting for free. Thinking about if there is utility in allowing categories= to accept a callable to provide a progamatic way to compute the ordering rather than passing in the values. Would be useful I suppose if you .set_categories with NEW categories (which would preserve the ordering).

Contributor

jreback commented Apr 12, 2015

With Categoricals you get custom sorting for free. Thinking about if there is utility in allowing categories= to accept a callable to provide a progamatic way to compute the ordering rather than passing in the values. Would be useful I suppose if you .set_categories with NEW categories (which would preserve the ordering).

@kay1793

This comment has been minimized.

Show comment
Hide comment
@kay1793

kay1793 Apr 12, 2015

With Categoricals you get custom sorting for free.

you wouldn't tell users to use category ordering when they want to sort the index lexically, and keyed sorting is no different. wanting to sort by a key function doesn't imply that the index is or should be categorical. categories are irrelevant unless the index data is categorical in nature to begin with.

kay1793 commented Apr 12, 2015

With Categoricals you get custom sorting for free.

you wouldn't tell users to use category ordering when they want to sort the index lexically, and keyed sorting is no different. wanting to sort by a key function doesn't imply that the index is or should be categorical. categories are irrelevant unless the index data is categorical in nature to begin with.

@SethMMorton

This comment has been minimized.

Show comment
Hide comment
@SethMMorton

SethMMorton Apr 12, 2015

Let me add a more concrete example of when having a key option to sort would be easier to use from a user's perspective, and may possibly be more efficient than Categoricals.

Suppose that a user had data in text files, and one of the columns contains distances with associated units, i.e. "45.232m" or "0.59472km". Let's say there are ~500,000 rows, and each has a different distance. Now, suppose the user wanted to sort based the data in this column. Obviously, they will have to do some sort of transformation of this data to make it sortable, since a purely ordinal sort will not work. As far as I can tell, currently the two most obvious results are to a) make a new column of the transformation result and use that column for sorting, or b) make the column a category, and then sort the data in the list, and make the categories the sorted data.

import re
from pandas import read_csv

def transform_distance(x):
    """Convert string of value and unit to a float.
    Since this is just an example, skip error checking."""
    m = re.match(r'(.*)([mkc]?m)', x)
    units = {'m': 1, 'cm': 0.01, 'mm': 0.001, 'km': 1000}
    return float(m.group(1)) * units[m.group(2)]

df = read_csv('myfile.data')

# Sort method 1: Make a new column and sort on that.
df['distances_sort'] = df.distances.map(transform_distance)
df.sort('distances_sort')

# Sort method 2: Use categoricals
df.distances = df.distances.astype('category')
df.distances.cat.reorder_categories(sorted(df.distances, key=transform_distance), inplace=True, ordered=True)
df.sort('distances')

To me, neither seem entirely preferable because method 1 adds extra data to the DataFrame, which will take up space and require me to filter out later if I want to write out to file, and method 2 requires sorting all the data in my column before I can sort the data in my DataFrame, which unless I am mistaken is not incredibly efficient.

Things would be made worse if I then wanted to read in a second file and append that data to the DataFrame I already had, or if I wanted to modify the existing data in the "distances" column. I would then need to re-update my "distances_sort" column, or re-perform the reorder_categories call before I could sort again.

If a key method were added to sort, all the boilerplate goes away as well as the extra processing. Sorting would just become

# Proposed sort method: Use a key argument
df.sort('distances', key=transform_distances)

Now, no matter how I update or modify my distances column, I do not need to do any additional pre-processing before sorting.

The key argument could be flexible and support either a function, or a dict of functions. This second input type would be used if you wanted to provide a key for only a few columns, or different keys for different columns; for example:

# Supporting multi-column sorting with a key.
# In this case, only columns 'a' and 'c' would use the key for sorting,
# and 'b' and 'd' would sort in the standard way.
df.sort(['a', 'b', 'c', 'd'], key={'a': lambda x: x.lower(), 'c': transform_distances})

SethMMorton commented Apr 12, 2015

Let me add a more concrete example of when having a key option to sort would be easier to use from a user's perspective, and may possibly be more efficient than Categoricals.

Suppose that a user had data in text files, and one of the columns contains distances with associated units, i.e. "45.232m" or "0.59472km". Let's say there are ~500,000 rows, and each has a different distance. Now, suppose the user wanted to sort based the data in this column. Obviously, they will have to do some sort of transformation of this data to make it sortable, since a purely ordinal sort will not work. As far as I can tell, currently the two most obvious results are to a) make a new column of the transformation result and use that column for sorting, or b) make the column a category, and then sort the data in the list, and make the categories the sorted data.

import re
from pandas import read_csv

def transform_distance(x):
    """Convert string of value and unit to a float.
    Since this is just an example, skip error checking."""
    m = re.match(r'(.*)([mkc]?m)', x)
    units = {'m': 1, 'cm': 0.01, 'mm': 0.001, 'km': 1000}
    return float(m.group(1)) * units[m.group(2)]

df = read_csv('myfile.data')

# Sort method 1: Make a new column and sort on that.
df['distances_sort'] = df.distances.map(transform_distance)
df.sort('distances_sort')

# Sort method 2: Use categoricals
df.distances = df.distances.astype('category')
df.distances.cat.reorder_categories(sorted(df.distances, key=transform_distance), inplace=True, ordered=True)
df.sort('distances')

To me, neither seem entirely preferable because method 1 adds extra data to the DataFrame, which will take up space and require me to filter out later if I want to write out to file, and method 2 requires sorting all the data in my column before I can sort the data in my DataFrame, which unless I am mistaken is not incredibly efficient.

Things would be made worse if I then wanted to read in a second file and append that data to the DataFrame I already had, or if I wanted to modify the existing data in the "distances" column. I would then need to re-update my "distances_sort" column, or re-perform the reorder_categories call before I could sort again.

If a key method were added to sort, all the boilerplate goes away as well as the extra processing. Sorting would just become

# Proposed sort method: Use a key argument
df.sort('distances', key=transform_distances)

Now, no matter how I update or modify my distances column, I do not need to do any additional pre-processing before sorting.

The key argument could be flexible and support either a function, or a dict of functions. This second input type would be used if you wanted to provide a key for only a few columns, or different keys for different columns; for example:

# Supporting multi-column sorting with a key.
# In this case, only columns 'a' and 'c' would use the key for sorting,
# and 'b' and 'd' would sort in the standard way.
df.sort(['a', 'b', 'c', 'd'], key={'a': lambda x: x.lower(), 'c': transform_distances})
@shoyer

This comment has been minimized.

Show comment
Hide comment
@shoyer

shoyer Apr 12, 2015

Member

OK, I can see key as a useful option.

Instead of allow key to accept dicts, what about sorting multiple columns -> the key function gets a tuple?

Member

shoyer commented Apr 12, 2015

OK, I can see key as a useful option.

Instead of allow key to accept dicts, what about sorting multiple columns -> the key function gets a tuple?

@SethMMorton

This comment has been minimized.

Show comment
Hide comment
@SethMMorton

SethMMorton Apr 12, 2015

As long as it is well documented how to use the key on multiple columns, I don't much care. Just having the key option would be a huge step in the right direction.

SethMMorton commented Apr 12, 2015

As long as it is well documented how to use the key on multiple columns, I don't much care. Just having the key option would be a huge step in the right direction.

@gepcel

This comment has been minimized.

Show comment
Hide comment
@gepcel

gepcel Jun 15, 2017

Contributor

I found this while searching these kinds of usages. I really like @SethMMorton 's idea. Really wish this will happen. Under many circumstances, this makes more sense than catogeries.

Contributor

gepcel commented Jun 15, 2017

I found this while searching these kinds of usages. I really like @SethMMorton 's idea. Really wish this will happen. Under many circumstances, this makes more sense than catogeries.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Jun 15, 2017

Contributor

as indicated above, if someone wants to step up and add this functionality in a proper / tested way, then it is likely to be accepted.

Contributor

jreback commented Jun 15, 2017

as indicated above, if someone wants to step up and add this functionality in a proper / tested way, then it is likely to be accepted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment