Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: nlargest for DataFrame #3960

Closed
hayd opened this issue Jun 19, 2013 · 34 comments · Fixed by #7113
Closed

ENH: nlargest for DataFrame #3960

hayd opened this issue Jun 19, 2013 · 34 comments · Fixed by #7113
Labels
API Design Enhancement Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Milestone

Comments

@hayd
Copy link
Contributor

hayd commented Jun 19, 2013

I don't think there is a way to get the nlargest elements in a DataFrame without sorting.

In ordinary python you'd use heapq's nlargest (and we can hack a bit to use it for a DataFrame):

In [10]: df
Out[10]:
                IP                                              Agent  Count
0    74.86.158.106  Mozilla/5.0+(compatible; UptimeRobot/2.0; http...    369
1   203.81.107.103  Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20...    388
2  173.199.120.155  Mozilla/5.0 (compatible; AhrefsBot/4.0; +http:...    417
3    124.43.84.242  Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.3...    448
4  112.135.196.223  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...    454
5   124.43.155.138  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G...    461
6   124.43.104.198  Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20...    467

In [11]: df.sort('Count', ascending=False).head(3)
Out[11]:
                IP                                              Agent  Count
6   124.43.104.198  Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20...    467
5   124.43.155.138  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G...    461
4  112.135.196.223  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...    454
In [21]: from heapq import nlargest

In [22]: top_3 = nlargest(3, df.iterrows(), key=lambda x: x[1]['Count'])

In [23]: pd.DataFrame.from_items(top_3).T
Out[23]:
                IP                                              Agent Count
6   124.43.104.198  Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20...   467
5   124.43.155.138  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G...   461
4  112.135.196.223  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...   454

This is much slower than sorting, presumbly from the overhead, I thought I'd throw this as a feature idea anyway.

see http://stackoverflow.com/a/17194717/1240268

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

Which is shockingly slow... but I guess there is a lot going on there.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

here's a comparison (using kth smallest)

In [44]: s = Series(np.arange(1000000)[::-1])

heap

In [45]: from heapq import nsmallest
In [46]: %timeit nsmallest(3, s.values)
1 loops, best of 3: 485 ms per loop

sort

In [50]: %timeit s.order().head(3)
10 loops, best of 3: 58.3 ms per loop

and the winner

In [47]: def f(x):
    v = pd.algos.kth_smallest(x.values.astype(float),3)
    return x[x<=v]
In [48]: %timeit f(s)
100 loops, best of 3: 11.7 ms per loop

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

Ah ha! so there is a kth_smallest already... ahem!

(Was going to suggest bottleneck http://stackoverflow.com/a/10463648/1240268)

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

but it might be useful to wrap these up into a series method .....(as this is a cython method). and no kth_largest....so prob should just write that one....(I don't think there is an easy inversion)

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

I would propose using nlargest and nsmallest... I could have a crack at some cython for kth_largest.

Similarly you could have a key (like for the heapq versions) e.g. #3942. Is the issue here that looking up a key (from python) for each value makes it incredibly slow?

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

Instead of a key, just make a column of the values of the key, then its all vectorized, so yes the 'key' argument is a problem.

@jtratner
Copy link
Contributor

Just a thought - what if you followed some of the conventions of agg /apply
and allowed passing of 'sum', etc and could do this with a set of columns,
like:

df.orderby([a, b, c], sum)

(but with better syntax :P)
And if you can't cythonize it, fall back to using python function.
On Jun 19, 2013 1:02 PM, "jreback" notifications@github.com wrote:

Instead of a key, just make a column of the values of the key, then its
all vectorized, so yes the 'key' argument is a problem.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3960#issuecomment-19697882
.

@jtratner
Copy link
Contributor

And then you could use the n* or other functions on that
On Jun 19, 2013 1:06 PM, "Jeffrey Tratner" jtratner@gmail.com wrote:

Just a thought - what if you followed some of the conventions of agg
/apply and allowed passing of 'sum', etc and could do this with a set of
columns, like:

df.orderby([a, b, c], sum)

(but with better syntax :P)
And if you can't cythonize it, fall back to using python function.
On Jun 19, 2013 1:02 PM, "jreback" notifications@github.com wrote:

Instead of a key, just make a column of the values of the key, then its
all vectorized, so yes the 'key' argument is a problem.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3960#issuecomment-19697882
.

@jtratner
Copy link
Contributor

Last thing - maybe better in reverse order, pass function and then cols.
(or maybe this already exists?)
On Jun 19, 2013 1:07 PM, jtratner@gmail.com wrote:

And then you could use the n* or other functions on that
On Jun 19, 2013 1:06 PM, "Jeffrey Tratner" jtratner@gmail.com wrote:

Just a thought - what if you followed some of the conventions of agg
/apply and allowed passing of 'sum', etc and could do this with a set of
columns, like:

df.orderby([a, b, c], sum)

(but with better syntax :P)
And if you can't cythonize it, fall back to using python function.
On Jun 19, 2013 1:02 PM, "jreback" notifications@github.com wrote:

Instead of a key, just make a column of the values of the key, then its
all vectorized, so yes the 'key' argument is a problem.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3960#issuecomment-19697882
.

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2013

cols, function i think is the way R works. i like the function, cols syntax more but that's just me

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

@jreback I guess you could just create the column to use as the key on the fly.

function, cols doesn't make sense if you're not passing a key though (most cases?)...

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

@jtratner

what would df.orderby([a,b,c],sum) actually do?

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

Presumably ordering by the "key_column"

df.orderby([a, b ,c], key=lambda row: row[a] + row[b] + row[c])

without actually creating the column:

df['key'] = df[[a,b,c]].apply(sum)
df.sort('key').head(3)

@cpcloud
Copy link
Member

cpcloud commented Jun 19, 2013

yep that's what i was thinking too

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

I'm not sure I'm entirely sold on using the cols in this way actually, usually when passing columns you order by a then b then c.

Really I was thinking of this more as

df.orderby(key=lambda row: row[a] + row[b] + row[c])
df.orderby(key=lambda row: row[a, b ,c].sum())

:s

@jtratner
Copy link
Contributor

Yeah, I guess that fits the model of other functions better if using (cols,
key).
On Jun 19, 2013 1:19 PM, "Phillip Cloud" notifications@github.com wrote:

which is the same as np.sort(df[[a, b, c]].sum()), but orderby would
allow an arbitrary function.


Reply to this email directly or view it on GitHub.

which is the same as np.sort(df[[a, b, c]].sum()), but orderby would allow
an arbitrary function.


Reply to this email directly or view it on
GitHubhttps://github.com//issues/3960#issuecomment-19699033
.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

I think this is the same thing

df[df[['a','b','c']].sum().order().index]

why do we need a separate function?

@jtratner
Copy link
Contributor

Well, similar to groupby you could intelligently handle it s.t if no func
orders by a, b, c, and if func, orders by output (vectorizing if possible).
I was fancifully thinking it could accept a series of tuples of ('cols',
agg or key func) but maybe that's more complicated than it needs to be.

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

So the key would actually be applied column wise (and then sort by these) ?

keys = df[cols].applymap(key)

I guess these examples makes more sense when key != sum.

@jtratner
Copy link
Contributor

@jreback good point.
On Jun 19, 2013 1:28 PM, "Andy Hayden" notifications@github.com wrote:

So the key would actually be applied column wise (and then sort by these) ?

keys = df[cols].applymap(key)

I guess these examples makes more sense when key != sum.


Reply to this email directly or view it on GitHubhttps://github.com//issues/3960#issuecomment-19699568
.

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

@jreback so kth_smallest just pulls out the kth smallest (obviously!), which isn't quite the same as .sort().head(3) (since it only pulls out one value). The algorithm just doesn't sort the smaller or larger elements (it's pretty darn clever).

Maybe I'll just look at how heapq.nlargest is implemented, or anyone know a better one?

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

you can prob just copy it and reverse the signs

and you only need the kth smallest, because if v=kth_smallest
then s[s<v].head(k) gives you the smallest k values (and this is very fast)
this is what I do in my function

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

Doe this mean it doesn't sort those k values?

And also... if there're duplicates this might not include the largest item?

Isn't kth_largest(n-k) = kth_smallest(k).

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

yes I think you would need to sort the kth smallest.....s[s<v].order().head(k) I think will do it (as if there are dups then that expression would have len > k

I thnk you are right about kth_largest....prob just as fast too (but not sure)

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

fyi I believe the current kth algo and bottleneck are pretty similarr, as I recall a discussion maybe last year between wes and ken about how to fast median (which is of course kth = n/2)

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

@hayd
Copy link
Contributor Author

hayd commented Jun 19, 2013

If there are dupes this isn't well defined so I don't think it matters.

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

2 issues to consider,

non-numeric dtypes (though you can convert to view('i8') for datetimelike)

there is a function, needs_i8_conversion which detects this

nan's, i would always exclude, maybe even do a dropna first (then you don't have to deal)

@jreback
Copy link
Contributor

jreback commented Jun 19, 2013

also I would only make this method for Series, can always be applied to frame if needed (e.g. this is much like say argsort)

@ghost ghost assigned hayd Jun 21, 2013
@jreback
Copy link
Contributor

jreback commented Sep 22, 2013

this is a nice little function, should do in 0.14

@jreback
Copy link
Contributor

jreback commented Mar 22, 2014

@hayd this look fine ...were you waiting on something?

@hayd
Copy link
Contributor Author

hayd commented Mar 24, 2014

No, just sulking about perf, will have a look again this week. Should def put in 0.14.

@jreback
Copy link
Contributor

jreback commented May 5, 2014

@hayd ping!!!!

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 12, 2014
@jreback jreback modified the milestones: 0.14.0, 0.14.1 May 13, 2014
@cpcloud cpcloud assigned cpcloud and unassigned hayd May 13, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants