Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.iterrows takes too long and generate large memory footprint #7683

Closed
yrlihuan opened this issue Jul 7, 2014 · 10 comments · Fixed by #7720
Closed

.iterrows takes too long and generate large memory footprint #7683

yrlihuan opened this issue Jul 7, 2014 · 10 comments · Fixed by #7720
Labels
Performance Memory or execution speed performance
Milestone

Comments

@yrlihuan
Copy link
Contributor

yrlihuan commented Jul 7, 2014

When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.

The name of the function implies that it is an iterator and should not take much to run. However, in the method it uses builtin method 'zip', which can sometimes generate huge temporary list of tuples if optimisation is not done correctly.

Below is the code which can reproduce the issue on a box with 16GB memory.

s1 = range(30000000)
s2 = np.random.randn(30000000)
ts = pd.date_range('20140101', freq='S', periods=30000000)
df = pd.DataFrame({'s1': s1, 's2': s2}, index=ts)
for r in df.iterrows():
    break # expected to return immediately, yet it takes more than 2 minutes and uses 4G memory
@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

what are you doing that requires iterrows, you almost never need, nor should use this. use vectorization instead.

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

see here for some tips: #7194

@jreback
Copy link
Contributor

jreback commented Jul 7, 2014

This does return a generator. The problem is since you have mixed dtypes, it has to create a single dtyped object, BEFORE IT DOES ANYTHING, which takes a lot of time (the zipping doesn't take much).

@yrlihuan
Copy link
Contributor Author

yrlihuan commented Jul 8, 2014

Profiling shows it has nothing to do with zipping, though it's not about mixed dtyped data frame either. It's slow when DatetimeIndex.__iter__ is called, which seems to be creating all the Timestamp objects in one shot.

1 0.000 0.000 37.101 37.101 index.py:784(_get_object_index)
10000000 32.939 0.000 32.939 0.000 index.py:785()

@jreback
Copy link
Contributor

jreback commented Jul 8, 2014

you haven't answered the question

why are you using iterrows?

@yrlihuan
Copy link
Contributor Author

yrlihuan commented Jul 8, 2014

There's a method I want to apply to each row sequentially. The method itself takes some time so vectorize it or not doesn't make much different for running time. I prefer iteration because it gives more control.

@jreback
Copy link
Contributor

jreback commented Jul 8, 2014

you might try iterating over df.T.iteritems() or better yet use df.apply(...., axis=1)

I suppose this could be updated to iterate over the index, rather than all at once (as it loses its identity as an Index and becomes a list). Would you like to submit a pull-request for this?

@jreback jreback added this to the 0.15.1 milestone Jul 8, 2014
@yrlihuan
Copy link
Contributor Author

yrlihuan commented Jul 8, 2014

sure. i can look into this

@jreback
Copy link
Contributor

jreback commented Jul 8, 2014

that would be gr8!.

@yrlihuan
Copy link
Contributor Author

yrlihuan commented Jul 9, 2014

PR submitted #7702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment