ENH: nlargest for DataFrame #3960

hayd · 2013-06-19T15:43:06Z

I don't think there is a way to get the nlargest elements in a DataFrame without sorting.

In ordinary python you'd use heapq's nlargest (and we can hack a bit to use it for a DataFrame):

In [10]: df
Out[10]:
                IP                                              Agent  Count
0    74.86.158.106  Mozilla/5.0+(compatible; UptimeRobot/2.0; http...    369
1   203.81.107.103  Mozilla/5.0 (Windows NT 6.1; rv:21.0) Gecko/20...    388
2  173.199.120.155  Mozilla/5.0 (compatible; AhrefsBot/4.0; +http:...    417
3    124.43.84.242  Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.3...    448
4  112.135.196.223  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...    454
5   124.43.155.138  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G...    461
6   124.43.104.198  Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20...    467

In [11]: df.sort('Count', ascending=False).head(3)
Out[11]:
                IP                                              Agent  Count
6   124.43.104.198  Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20...    467
5   124.43.155.138  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G...    461
4  112.135.196.223  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...    454

In [21]: from heapq import nlargest

In [22]: top_3 = nlargest(3, df.iterrows(), key=lambda x: x[1]['Count'])

In [23]: pd.DataFrame.from_items(top_3).T
Out[23]:
                IP                                              Agent Count
6   124.43.104.198  Mozilla/5.0 (Windows NT 5.1; rv:21.0) Gecko/20...   467
5   124.43.155.138  Mozilla/5.0 (Windows NT 6.1; WOW64; rv:21.0) G...   461
4  112.135.196.223  Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.3...   454

This is much slower than sorting, presumbly from the overhead, I thought I'd throw this as a feature idea anyway.

see http://stackoverflow.com/a/17194717/1240268

The text was updated successfully, but these errors were encountered:

hayd · 2013-06-19T15:53:40Z

Which is shockingly slow... but I guess there is a lot going on there.

jreback · 2013-06-19T16:27:43Z

here's a comparison (using kth smallest)

In [44]: s = Series(np.arange(1000000)[::-1])

heap

In [45]: from heapq import nsmallest
In [46]: %timeit nsmallest(3, s.values)
1 loops, best of 3: 485 ms per loop

sort

In [50]: %timeit s.order().head(3)
10 loops, best of 3: 58.3 ms per loop

and the winner

In [47]: def f(x):
    v = pd.algos.kth_smallest(x.values.astype(float),3)
    return x[x<=v]
In [48]: %timeit f(s)
100 loops, best of 3: 11.7 ms per loop

hayd · 2013-06-19T16:34:34Z

Ah ha! so there is a kth_smallest already... ahem!

(Was going to suggest bottleneck http://stackoverflow.com/a/10463648/1240268)

jreback · 2013-06-19T16:37:06Z

but it might be useful to wrap these up into a series method .....(as this is a cython method). and no kth_largest....so prob should just write that one....(I don't think there is an easy inversion)

hayd · 2013-06-19T16:55:28Z

I would propose using nlargest and nsmallest... I could have a crack at some cython for kth_largest.

Similarly you could have a key (like for the heapq versions) e.g. #3942. Is the issue here that looking up a key (from python) for each value makes it incredibly slow?

jreback · 2013-06-19T17:01:58Z

Instead of a key, just make a column of the values of the key, then its all vectorized, so yes the 'key' argument is a problem.

jtratner · 2013-06-19T17:06:43Z

Just a thought - what if you followed some of the conventions of agg /apply
and allowed passing of 'sum', etc and could do this with a set of columns,
like:

df.orderby([a, b, c], sum)

(but with better syntax :P)
And if you can't cythonize it, fall back to using python function.
On Jun 19, 2013 1:02 PM, "jreback" notifications@github.com wrote:

Instead of a key, just make a column of the values of the key, then its
all vectorized, so yes the 'key' argument is a problem.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3960#issuecomment-19697882
.

jtratner · 2013-06-19T17:07:29Z

And then you could use the n* or other functions on that
On Jun 19, 2013 1:06 PM, "Jeffrey Tratner" jtratner@gmail.com wrote:

Just a thought - what if you followed some of the conventions of agg
/apply and allowed passing of 'sum', etc and could do this with a set of
columns, like:

df.orderby([a, b, c], sum)

(but with better syntax :P)
And if you can't cythonize it, fall back to using python function.
On Jun 19, 2013 1:02 PM, "jreback" notifications@github.com wrote:

Instead of a key, just make a column of the values of the key, then its
all vectorized, so yes the 'key' argument is a problem.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3960#issuecomment-19697882
.

jtratner · 2013-06-19T17:09:06Z

Last thing - maybe better in reverse order, pass function and then cols.
(or maybe this already exists?)
On Jun 19, 2013 1:07 PM, jtratner@gmail.com wrote:

And then you could use the n* or other functions on that
On Jun 19, 2013 1:06 PM, "Jeffrey Tratner" jtratner@gmail.com wrote:

Just a thought - what if you followed some of the conventions of agg
/apply and allowed passing of 'sum', etc and could do this with a set of
columns, like:

df.orderby([a, b, c], sum)

(but with better syntax :P)
And if you can't cythonize it, fall back to using python function.
On Jun 19, 2013 1:02 PM, "jreback" notifications@github.com wrote:

Instead of a key, just make a column of the values of the key, then its
all vectorized, so yes the 'key' argument is a problem.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3960#issuecomment-19697882
.

cpcloud · 2013-06-19T17:11:22Z

cols, function i think is the way R works. i like the function, cols syntax more but that's just me

hayd · 2013-06-19T17:12:10Z

@jreback I guess you could just create the column to use as the key on the fly.

function, cols doesn't make sense if you're not passing a key though (most cases?)...

jreback · 2013-06-19T17:14:02Z

@jtratner

what would df.orderby([a,b,c],sum) actually do?

hayd · 2013-06-19T17:17:12Z

Presumably ordering by the "key_column"

df.orderby([a, b ,c], key=lambda row: row[a] + row[b] + row[c])

without actually creating the column:

df['key'] = df[[a,b,c]].apply(sum)
df.sort('key').head(3)

cpcloud · 2013-06-19T17:19:54Z

yep that's what i was thinking too

hayd · 2013-06-19T17:22:44Z

I'm not sure I'm entirely sold on using the cols in this way actually, usually when passing columns you order by a then b then c.

Really I was thinking of this more as

df.orderby(key=lambda row: row[a] + row[b] + row[c])
df.orderby(key=lambda row: row[a, b ,c].sum())

:s

jtratner · 2013-06-19T17:24:00Z

Yeah, I guess that fits the model of other functions better if using (cols,
key).
On Jun 19, 2013 1:19 PM, "Phillip Cloud" notifications@github.com wrote:

which is the same as np.sort(df[[a, b, c]].sum()), but orderby would
allow an arbitrary function.

—
Reply to this email directly or view it on GitHub.

which is the same as np.sort(df[[a, b, c]].sum()), but orderby would allow
an arbitrary function.

—
Reply to this email directly or view it on
GitHubhttps://github.com//issues/3960#issuecomment-19699033
.

jreback · 2013-06-19T17:25:40Z

I think this is the same thing

df[df[['a','b','c']].sum().order().index]

why do we need a separate function?

jtratner · 2013-06-19T17:27:28Z

Well, similar to groupby you could intelligently handle it s.t if no func
orders by a, b, c, and if func, orders by output (vectorizing if possible).
I was fancifully thinking it could accept a series of tuples of ('cols',
agg or key func) but maybe that's more complicated than it needs to be.

hayd · 2013-06-19T17:27:51Z

So the key would actually be applied column wise (and then sort by these) ?

keys = df[cols].applymap(key)

I guess these examples makes more sense when key != sum.

jtratner · 2013-06-19T17:32:25Z

@jreback good point.
On Jun 19, 2013 1:28 PM, "Andy Hayden" notifications@github.com wrote:

So the key would actually be applied column wise (and then sort by these) ?

keys = df[cols].applymap(key)

I guess these examples makes more sense when key != sum.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/3960#issuecomment-19699568
.

hayd · 2013-06-19T20:04:12Z

@jreback so kth_smallest just pulls out the kth smallest (obviously!), which isn't quite the same as .sort().head(3) (since it only pulls out one value). The algorithm just doesn't sort the smaller or larger elements (it's pretty darn clever).

Maybe I'll just look at how heapq.nlargest is implemented, or anyone know a better one?

jreback · 2013-06-19T20:16:18Z

you can prob just copy it and reverse the signs

and you only need the kth smallest, because if v=kth_smallest
then s[s<v].head(k) gives you the smallest k values (and this is very fast)
this is what I do in my function

hayd · 2013-06-19T20:27:31Z

Doe this mean it doesn't sort those k values?

And also... if there're duplicates this might not include the largest item?

Isn't kth_largest(n-k) = kth_smallest(k).

jreback · 2013-06-19T20:29:32Z

yes I think you would need to sort the kth smallest.....s[s<v].order().head(k) I think will do it (as if there are dups then that expression would have len > k

I thnk you are right about kth_largest....prob just as fast too (but not sure)

jreback · 2013-06-19T20:30:58Z

http://stackoverflow.com/questions/251781/how-to-find-the-kth-largest-element-in-an-unsorted-array-of-length-n-in-on

jreback · 2013-06-19T20:33:23Z

fyi I believe the current kth algo and bottleneck are pretty similarr, as I recall a discussion maybe last year between wes and ken about how to fast median (which is of course kth = n/2)

hayd · 2013-06-19T20:35:41Z

http://mail.scipy.org/pipermail/numpy-discussion/2009-August/044893.html ?

hayd · 2013-06-19T20:36:37Z

If there are dupes this isn't well defined so I don't think it matters.

jreback · 2013-06-19T20:43:12Z

2 issues to consider,

non-numeric dtypes (though you can convert to view('i8') for datetimelike)

there is a function, needs_i8_conversion which detects this

nan's, i would always exclude, maybe even do a dropna first (then you don't have to deal)

jreback · 2013-06-19T20:59:20Z

also I would only make this method for Series, can always be applied to frame if needed (e.g. this is much like say argsort)

jreback · 2013-09-22T20:11:01Z

this is a nice little function, should do in 0.14

jreback · 2014-03-22T21:52:24Z

@hayd this look fine ...were you waiting on something?

hayd · 2014-03-24T04:05:42Z

No, just sulking about perf, will have a look again this week. Should def put in 0.14.

jreback · 2014-05-05T00:24:42Z

@hayd ping!!!!

ghost assigned hayd Jun 21, 2013

hayd mentioned this issue Nov 17, 2013

ENH nlargest and nsmallest Series methods #5534

Closed

jreback modified the milestones: 0.14.1, 0.14.0 May 12, 2014

cpcloud mentioned this issue May 13, 2014

ENH: add nlargest nsmallest to Series #7113

Merged

jreback modified the milestones: 0.14.0, 0.14.1 May 13, 2014

cpcloud assigned cpcloud and unassigned hayd May 13, 2014

cpcloud closed this as completed in #7113 May 14, 2014

wesm unassigned cpcloud Oct 12, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: nlargest for DataFrame #3960

ENH: nlargest for DataFrame #3960

hayd commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

jtratner commented Jun 19, 2013

jtratner commented Jun 19, 2013

jtratner commented Jun 19, 2013

cpcloud commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

cpcloud commented Jun 19, 2013

hayd commented Jun 19, 2013

jtratner commented Jun 19, 2013

jreback commented Jun 19, 2013

jtratner commented Jun 19, 2013

hayd commented Jun 19, 2013

jtratner commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

jreback commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

jreback commented Jun 19, 2013

jreback commented Sep 22, 2013

jreback commented Mar 22, 2014

hayd commented Mar 24, 2014

jreback commented May 5, 2014

ENH: nlargest for DataFrame #3960

ENH: nlargest for DataFrame #3960

Comments

hayd commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

jtratner commented Jun 19, 2013

jtratner commented Jun 19, 2013

jtratner commented Jun 19, 2013

cpcloud commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

cpcloud commented Jun 19, 2013

hayd commented Jun 19, 2013

jtratner commented Jun 19, 2013

jreback commented Jun 19, 2013

jtratner commented Jun 19, 2013

hayd commented Jun 19, 2013

jtratner commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

jreback commented Jun 19, 2013

jreback commented Jun 19, 2013

hayd commented Jun 19, 2013

hayd commented Jun 19, 2013

jreback commented Jun 19, 2013

jreback commented Jun 19, 2013

jreback commented Sep 22, 2013

jreback commented Mar 22, 2014

hayd commented Mar 24, 2014

jreback commented May 5, 2014