Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fast Cython Series iterator, for speeding up DataFrame.apply #309

Closed
wesm opened this issue Oct 31, 2011 · 7 comments

Comments

@wesm
Copy link
Member

commented Oct 31, 2011

Having tons of calls to Series.__new__ seriously degrades performance because most of the logic isn't necessary. Could play tricks in Cython with the data pointers to avoid this.

@natekupp

This comment has been minimized.

Copy link

commented Nov 11, 2011

Hey Wes - any way I can help on this? I just ran into this on my own, then came here and found your open issue. Some code that demonstrates the performance issue:

import pandas, time
import numpy as np
data  = pandas.DataFrame(np.random.random((10000,100)))
fn    = lambda x: len(np.unique(x)) > 100

start = time.time()
data.apply(fn, axis=0)
print time.time() - start

start = time.time()
np.apply_along_axis(fn, 0, data)
print time.time() - start

## Output
4.69282603264
0.103554964066

My use case is similar to the above example, so it'd be great to close the performance gap between DataFrame.apply and np.apply_along_axis. From your comment I'm guessing that, at the moment, DataFrame.apply calls Series.__new__ for every Series in the DataFrame? Thanks!

@wesm

This comment has been minimized.

Copy link
Member Author

commented Nov 12, 2011

Low-hanging fruit would be an option in apply that calls np.apply_along_axis-- the reason that it doesn't is because apply by default assumes that each slice is a Series, whereas in your case that may not be strictly necessary

@wesm

This comment has been minimized.

Copy link
Member Author

commented Nov 12, 2011

maybe like

df.apply(f, axis=0, raw=True)
@wesm

This comment has been minimized.

Copy link
Member Author

commented Nov 13, 2011

What version of pandas are you using? I fixed a performance problem that was causing np.unique to be very slow

@natekupp

This comment has been minimized.

Copy link

commented Nov 13, 2011

I was on the latest version from PyPI. I just installed from github source and it looks much better:

## Output
0.256111860275
0.103078842163

Thanks!

@wesm

This comment has been minimized.

Copy link
Member Author

commented Nov 13, 2011

OK I made some further tweaks and things so apply actually beats apply_along_axis quite a bit in the axis=1 case with your example (most of the time is spent calling unique in axis=0 case):

In [6]: timeit data.apply(fn, axis=1, raw=True)
1 loops, best of 3: 288 ms per loop

In [7]: timeit data.apply(fn, axis=0, raw=True)
10 loops, best of 3: 82 ms per loop

In [8]: timeit np.apply_along_axis(fn, 1, data.values)
1 loops, best of 3: 518 ms per loop

In [9]: timeit np.apply_along_axis(fn, 0, data.values)
10 loops, best of 3: 82.7 ms per loop

@wesm wesm closed this Nov 13, 2011

@natekupp

This comment has been minimized.

Copy link

commented Nov 14, 2011

Thanks Wes!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.