Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance of DataFrame.apply between 0.12 and 0.13rc1 #5654

Closed
rosnfeld opened this issue Dec 6, 2013 · 1 comment · Fixed by #5656
Closed

performance of DataFrame.apply between 0.12 and 0.13rc1 #5654

rosnfeld opened this issue Dec 6, 2013 · 1 comment · Fixed by #5656
Labels
Performance Memory or execution speed performance
Milestone

Comments

@rosnfeld
Copy link
Contributor

rosnfeld commented Dec 6, 2013

Here is a small example of a performance regression I've noticed between 0.12 and 0.13rc1:

s = pd.Series(np.arange(4096.))
df = pd.DataFrame({i:s for i in range(4096)})

# under 0.12
 %timeit df.apply((lambda x: np.corrcoef(x, s)[0, 1]), axis=1)
1 loops, best of 3: 792 ms per loop

# under 0.13rc1
%timeit df.apply((lambda x: np.corrcoef(x, s)[0, 1]), axis=1)
1 loops, best of 3: 1.7 s per loop

These are run on the same machine, with the following pip requirements consistent between the two setups besides pandas 0.12.0 vs 0.13.0rc1:
Cython==0.19.2
Jinja2==2.7.1
MarkupSafe==0.18
Pygments==1.6
Sphinx==1.1.3
argparse==1.2.1
docutils==0.11
ipython==1.0.0
matplotlib==1.3.0
nose==1.3.0
numpy==1.7.1
pyparsing==2.0.1
python-dateutil==2.2
pytz==2013.8
pyzmq==14.0.1
scipy==0.12.0
six==1.4.1
tornado==3.1.1
wsgiref==0.1.2

@jreback
Copy link
Contributor

jreback commented Dec 6, 2013

This was a missed case, thanks!
0.13 underwent a large internal refactoring so easy to miss some things.

Note that in general you don't want to use apply if you can vectorize the calculation (in this case you can't of course),

Best way (unchanged by this PR)

In [6]: %timeit df.corrwith(df,axis=0)
1 loops, best of 3: 703 ms per loop

#5656

In [1]: s = Series(np.arange(4096.))

In [2]: df = DataFrame({ i:s for i in range(4096) })

In [3]: %timeit -n 3 df.apply(lambda x: np.corrcoef(x,s)[0,1])
3 loops, best of 3: 1.13 s per loop

In [4]: %timeit -n 3 df.apply(lambda x: np.corrcoef(x.values,s.values)[0,1])
3 loops, best of 3: 938 ms per loop

before this PR

In [3]: %timeit -n 3 df.apply(lambda x: np.corrcoef(x,s)[0,1])
3 loops, best of 3: 1.53 s per loop

0.12

In [3]: %timeit -n 3 df.apply(lambda x: np.corrcoef(x,s)[0,1])
3 loops, best of 3: 793 ms per loop

In [4]: %timeit -n 3 df.apply(lambda x: np.corrcoef(x.values,s.values)[0,1])
3 loops, best of 3: 812 ms per loop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants