Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

GroupBy shifting performance issue #2162

Closed
wesm opened this Issue Nov 2, 2012 · 2 comments

Comments

Projects
None yet
2 participants
Owner

wesm commented Nov 2, 2012

ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))

xref http://stackoverflow.com/questions/13180499/most-efficient-way-to-shift-multiindex-time-series

@wesm wesm added a commit that referenced this issue Nov 2, 2012

@wesm wesm ENH: revert Index mutability change. improve performance of dropna by…
… using take. related to #2162
669c606
Contributor

jreback commented Mar 27, 2013

This obviously affected by issues fixed in pydata#3145,
still prob should add to the vbenchs, and see what we can do

In [11]: %time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
CPU times: user 10.13 s, sys: 0.18 s, total: 10.31 s
Wall time: 10.35 s

In [12]: pd.__version__
Out[12]: '0.11.0.dev-e6140e9'
In [11]: %time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
CPU times: user 51.07 s, sys: 0.25 s, total: 51.32 s
Wall time: 51.48 s

In [13]: pd.__version__
Out[13]: '0.10.1'
  Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    96005    1.092    0.000    2.360    0.000 common.py:456(take_nd)
    48000    0.560    0.000    3.790    0.000 index.py:1536(values)
    96006    0.486    0.000    0.895    0.000 internals.py:762(make_block)
   144007    0.474    0.000    0.764    0.000 common.py:690(_maybe_promote)
    96006    0.450    0.000    1.504    0.000 internals.py:818(__init__)
    48000    0.445    0.000    4.000    0.000 frame.py:3975(shift)
    48000    0.426    0.000    0.958    0.000 index.py:1817(__getitem__)

prob should have a shift-like operator built into the apply/transform, rather than a generic apply,
would obviously be much faster

Contributor

jreback commented Sep 21, 2013

fixed as indicated above

@jreback jreback closed this Sep 21, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment