PERF: DataFrame.groupby.nunique is non-performant #15197

Closed
jreback opened this Issue Jan 23, 2017 · 1 comment

Comments

Projects
None yet
2 participants
Contributor

jreback commented Jan 23, 2017 edited

xref #14376

# from the asv
In [10]: n = 10000
    ...:     df = DataFrame({'key1': randint(0, 500, size=n),
    ...:                              'key2': randint(0, 100, size=n),
    ...:                              'ints': randint(0, 1000, size=n),
    ...:                              'ints2': randint(0, 1000, size=n), })
    ...: 

In [11]: %timeit df.groupby(['key1', 'key2']).nunique()
1 loop, best of 3: 4.25 s per loop

In [12]: result = df.groupby(['key1', 'key2']).nunique()

In [13]: g = df.groupby(['key1', 'key2'])

In [14]: expected = pd.concat([getattr(g, col).nunique() for col in g._selected_obj.columns], axis=1)

In [15]: result.equals(expected)
Out[15]: True

In [16]: %timeit pd.concat([getattr(g, col).nunique() for col in g._selected_obj.columns], axis=1)
100 loops, best of 3: 6.94 ms per loop

Series.groupby.nunique has a very performant implementation, but the way the DataFrame.groupby.nunique is implemented (via .apply) it ends up in a python loop over the groups, which nullifies this.

should be straightforward to fix this. need to make sure to test with as_index=True/False

jreback added this to the 0.20.0 milestone Jan 23, 2017

Contributor

jreback commented Jan 23, 2017

cc @xflr6

@jreback jreback added a commit to jreback/pandas that referenced this issue Jan 23, 2017

@jreback jreback PERF: DataFrame.groupby.nunique
closes #15197
6d02616

jreback closed this in dc40058 Jan 24, 2017

@AnkurDedania AnkurDedania added a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017

@jreback @AnkurDedania jreback + AnkurDedania PERF: DataFrame.groupby.nunique
closes #15197

Author: Jeff Reback <jeff@reback.net>

Closes #15201 from jreback/nunique and squashes the following commits:

6d02616 [Jeff Reback] PERF: DataFrame.groupby.nunique
983fdd2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment