.iterrows takes too long and generate large memory footprint #7683

Closed
yrlihuan opened this Issue Jul 7, 2014 · 10 comments

Comments

Projects
None yet
2 participants
Contributor

yrlihuan commented Jul 7, 2014

When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.

The name of the function implies that it is an iterator and should not take much to run. However, in the method it uses builtin method 'zip', which can sometimes generate huge temporary list of tuples if optimisation is not done correctly.

Below is the code which can reproduce the issue on a box with 16GB memory.

s1 = range(30000000)
s2 = np.random.randn(30000000)
ts = pd.date_range('20140101', freq='S', periods=30000000)
df = pd.DataFrame({'s1': s1, 's2': s2}, index=ts)
for r in df.iterrows():
    break # expected to return immediately, yet it takes more than 2 minutes and uses 4G memory
Contributor

jreback commented Jul 7, 2014

what are you doing that requires iterrows, you almost never need, nor should use this. use vectorization instead.

Contributor

jreback commented Jul 7, 2014

see here for some tips: pydata#7194

Contributor

jreback commented Jul 7, 2014

This does return a generator. The problem is since you have mixed dtypes, it has to create a single dtyped object, BEFORE IT DOES ANYTHING, which takes a lot of time (the zipping doesn't take much).

Contributor

yrlihuan commented Jul 8, 2014

Profiling shows it has nothing to do with zipping, though it's not about mixed dtyped data frame either. It's slow when DatetimeIndex.__iter__ is called, which seems to be creating all the Timestamp objects in one shot.

1 0.000 0.000 37.101 37.101 index.py:784(_get_object_index)
10000000 32.939 0.000 32.939 0.000 index.py:785()

Contributor

jreback commented Jul 8, 2014

you haven't answered the question

why are you using iterrows?

Contributor

yrlihuan commented Jul 8, 2014

There's a method I want to apply to each row sequentially. The method itself takes some time so vectorize it or not doesn't make much different for running time. I prefer iteration because it gives more control.

Contributor

jreback commented Jul 8, 2014

you might try iterating over df.T.iteritems() or better yet use df.apply(...., axis=1)

I suppose this could be updated to iterate over the index, rather than all at once (as it loses its identity as an Index and becomes a list). Would you like to submit a pull-request for this?

jreback added this to the 0.15.1 milestone Jul 8, 2014

jreback added the Performance label Jul 8, 2014

Contributor

yrlihuan commented Jul 8, 2014

sure. i can look into this

Contributor

jreback commented Jul 8, 2014

that would be gr8!.

Contributor

yrlihuan commented Jul 9, 2014

PR submitted #7702

@jreback jreback modified the milestone: 0.15.0, 0.15.1 Jul 9, 2014

jreback closed this in #7720 Jul 16, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment