PERF: transform speedup #6496

Closed
jreback opened this Issue Feb 27, 2014 · 4 comments

Comments

Projects
None yet
2 participants
Contributor

jreback commented Feb 27, 2014

http://stackoverflow.com/questions/22072943/faster-way-to-transform-group-with-mean-value-in-pandas/22073449#22073449

I think this actually will work in a general case. but prob only really makes sense when you have a cythonized work function (or a ufunc) that is much faster than iteratively calling the groups.

np.random.seed(0)

N = 120000
N_TRANSITIONS = 1400

# generate groups
transition_points = np.random.permutation(np.arange(N))[:N_TRANSITIONS]
transition_points.sort()
transitions = np.zeros((N,), dtype=np.bool)
transitions[transition_points] = True
g = transitions.cumsum()

df = pd.DataFrame({ "signal" : np.random.rand(N)})
In [44]: grp = df["signal"].groupby(g)

In [45]: result2 = df["signal"].groupby(g).transform(np.mean)

In [47]: %timeit df["signal"].groupby(g).transform(np.mean)
1 loops, best of 3: 535 ms per loop

Using broadcasting

 In [43]: result = pd.concat([ Series([r]*len(grp.groups[i])) for i, r in enumerate(grp.mean().values) ],ignore_index=True)

In [42]: %timeit pd.concat([ Series([r]*len(grp.groups[i])) for i, r in enumerate(grp.mean().values) ],ignore_index=True)
10 loops, best of 3: 119 ms per loop

In [46]: result.equals(result2)
Out[46]: True

I think you might need to set the index of the returned on the broadcast result (it happens to work here because its a default index

result = pd.concat([ Series([r]*len(grp.groups[i])) for i, r in enumerate(grp.mean().values) ],ignore_index=True)
result.index = df.index

Final Result is best

pd.Series(np.repeat(grp.mean().values, grp.count().values))

jreback added this to the 0.14.0 milestone Feb 27, 2014

Contributor

dsm054 commented Feb 27, 2014

That last one doesn't actually work for me for a familiar reason:

>>> pd.Series(np.repeat(grp.mean().values, grp.count().values))
Traceback (most recent call last):
  File "<ipython-input-12-c5646446d59d>", line 1, in <module>
    pd.Series(np.repeat(grp.mean().values, grp.count().values))
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 391, in repeat
    return repeat(repeats, axis)
TypeError: Cannot cast array data from dtype('int64') to dtype('int32') according to the rule 'safe'

with versions

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.3.final.0
python-bits: 32
OS: Linux
OS-release: 3.2.0-58-generic-pae
machine: i686
processor: i686
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8

pandas: 0.13.1-328-gd0e2a9f
Cython: 0.20
numpy: 1.9.0.dev-3a2f048
scipy: 0.14.0.dev-432f16b
statsmodels: 0.6.0.dev-2bc4041
IPython: 1.2.0
sphinx: 1.2
patsy: 0.1.0
scikits.timeseries: None
dateutil: 1.5
pytz: 2013.9
bottleneck: 0.6.0
tables: 2.4.0
numexpr: 2.0.1
matplotlib: 1.4.x
openpyxl: 1.7.0
xlrd: 0.9.0
xlwt: 0.7.4
xlsxwriter: 0.5.2
lxml: 3.1.0
bs4: 4.1.0
html5lib: 0.95-dev
bq: None
apiclient: 1.2
rpy2: 2.3.9
sqlalchemy: 0.6.6
pymysql: None
psycopg2: None

I guess the numpy fixes from the other day didn't address this case.

Contributor

jreback commented Feb 27, 2014

and actually don't need to do that as this

(and maybe already done in the grp.grouper object)

lens = dict([ (k,len(g)) for k,g in grp.grouper.groups.iteritems() ])
Contributor

jreback commented Apr 9, 2014

@dsm054 want to implement this?

@jreback jreback modified the milestone: 0.14.1, 0.14.0 Apr 28, 2014

Contributor

jreback commented Jun 10, 2014

@dsm054 interested?

jreback closed this in #7421 Jun 11, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment