PERF: speed up multi-key groupby #8128

Merged
merged 1 commit into from Aug 29, 2014

Conversation

Projects
None yet
2 participants
Contributor

behzadnouri commented Aug 28, 2014

Improves multi-key groupby speed; On master:

In [3]: pd.__version__
Out[3]: '0.14.1-276-g995f91c'

In [4]: np.random.seed(2718281)

In [5]: n = 20000

In [6]: df = pd.DataFrame(np.random.randint(1, n, (n, 3)),
   ...:         columns=['jim', 'joe', 'jolie'])

In [7]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
1 loops, best of 3: 1.09 s per loop

In [8]: df['joe'] = df['jim']

In [9]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
1 loops, best of 3: 1.02 s per loop

note that it is not responsive to the reduction in number of groups. With this patch:

In [9]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
10 loops, best of 3: 122 ms per loop

In [10]: df['joe'] = df['jim']

In [11]: %timeit df.groupby(['jim', 'joe'])['jolie'].transform('max')
10 loops, best of 3: 82.3 ms per loop

benching:

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
groupby_transform_multi_key2                 |  48.2820 | 810.2150 |   0.0596 |
groupby_transform_multi_key4                 | 137.5050 | 1986.8030 |   0.0692 |
groupby_transform_multi_key1                 |  70.1749 | 841.8140 |   0.0834 |
groupby_transform_multi_key3                 | 713.8393 | 2613.6247 |   0.2731 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [11cc057] : Merge branch 'groupby-speed-up' of https://github.com/behzadnouri/pandas into behzadnouri-groupby-speed-up
Base   [3bb0803] : Merge pull request #8103 from sinhrks/pivot_dt

BUG: pivot_table raises KeyError with nameless index and columns
Contributor

jreback commented Aug 28, 2014

  • need a vbench (or 2) (and then show the results)
  • release note

jreback added this to the 0.15.0 milestone Aug 28, 2014

behzadnouri changed the title from ENH: speed up multi-key groupby to PERF: speed up multi-key groupby Aug 28, 2014

Contributor

behzadnouri commented Aug 28, 2014

I had some dependency issue, so ran the benchmarks manually;
On master:

>>> groupby_transform_multi_key1.run()
{'loops': 1, 'timing': 1880.3930282592773, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key2.run()
{'loops': 1, 'timing': 1886.4881992340088, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key3.run()
{'loops': 1, 'timing': 5588.175058364868, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key4.run()
{'loops': 1, 'timing': 4688.298940658569, 'repeat': 3, 'succeeded': True, 'units': 'ms'}

on branch:

>>> groupby_transform_multi_key1.run()
{'loops': 1, 'timing': 106.5061092376709, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key2.run()
{'loops': 10, 'timing': 73.46320152282715, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key3.run()
{'loops': 1, 'timing': 1123.687982559204, 'repeat': 3, 'succeeded': True, 'units': 'ms'}
>>> groupby_transform_multi_key4.run()
{'loops': 1, 'timing': 200.47903060913086, 'repeat': 3, 'succeeded': True, 'units': 'ms'}

groupby_transform_multi_key[3|4] needs 6.5G memory to run on master.

Contributor

jreback commented Aug 29, 2014

why is the memory usage so high? the Cartesian product of the groups is not represented here (it's only the compressed space)

Contributor

behzadnouri commented Aug 29, 2014

The master branch calls into groupsort_indexer with ngroups = np.prod(shape). np.prod(shape) is the size of Cartesian product space of unique values across each key and groupsort_indexer allocates an array counts with this size.

Elsewhere also, the code falls back on argsort to avoid memory error.

Contributor

jreback commented Aug 29, 2014

argsort says is O(n**2) (as the default is quicksort)....but only using it on the smaller ones anyhow.
thanks..this is great! (and the mem issue is avoided!)

@jreback jreback merged commit c5a3514 into pandas-dev:master Aug 29, 2014

1 check passed

continuous-integration/travis-ci The Travis CI build passed
Details
Contributor

jreback commented Aug 29, 2014

thanks! this was great!

Contributor

behzadnouri commented Aug 29, 2014

@jreback On further tests, it seems to me that we need a stable sorter for group_index. see for example

I need to change the code to .argsort(kind='mergesort'). Should I make a new pull-request or it can work within this pull-request?

I did some tests with merge-sort and benchmarks still look good. It is also inline with the fact that Wes uses merge sort in here.

Contributor

jreback commented Aug 29, 2014

ok make a new pr

hmm no tests break
can u show a test case?

behzadnouri deleted the behzadnouri:groupby-speed-up branch Aug 30, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment