.iterrows takes too long and generate large memory footprint #7683

yrlihuan · 2014-07-07T18:21:52Z

When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.

The name of the function implies that it is an iterator and should not take much to run. ~~However~~, ~~in the method it uses builtin method 'zip'~~, ~~which can sometimes generate huge temporary list of tuples if optimisation is not done correctly~~.

Below is the code which can reproduce the issue on a box with 16GB memory.

s1 = range(30000000)
s2 = np.random.randn(30000000)
ts = pd.date_range('20140101', freq='S', periods=30000000)
df = pd.DataFrame({'s1': s1, 's2': s2}, index=ts)
for r in df.iterrows():
    break # expected to return immediately, yet it takes more than 2 minutes and uses 4G memory

jreback · 2014-07-07T18:28:24Z

what are you doing that requires iterrows, you almost never need, nor should use this. use vectorization instead.

jreback · 2014-07-07T18:54:46Z

see here for some tips: #7194

jreback · 2014-07-07T18:59:38Z

This does return a generator. The problem is since you have mixed dtypes, it has to create a single dtyped object, BEFORE IT DOES ANYTHING, which takes a lot of time (the zipping doesn't take much).

yrlihuan · 2014-07-08T03:03:07Z

Profiling shows it has nothing to do with zipping, though it's not about mixed dtyped data frame either. It's slow when DatetimeIndex.__iter__ is called, which seems to be creating all the Timestamp objects in one shot.

1 0.000 0.000 37.101 37.101 index.py:784(_get_object_index)
10000000 32.939 0.000 32.939 0.000 index.py:785()

jreback · 2014-07-08T03:20:56Z

you haven't answered the question

why are you using iterrows?

yrlihuan · 2014-07-08T04:02:56Z

There's a method I want to apply to each row sequentially. The method itself takes some time so vectorize it or not doesn't make much different for running time. I prefer iteration because it gives more control.

jreback · 2014-07-08T10:27:30Z

you might try iterating over df.T.iteritems() or better yet use df.apply(...., axis=1)

I suppose this could be updated to iterate over the index, rather than all at once (as it loses its identity as an Index and becomes a list). Would you like to submit a pull-request for this?

yrlihuan · 2014-07-08T14:13:57Z

sure. i can look into this

jreback · 2014-07-08T14:17:05Z

that would be gr8!.

yrlihuan · 2014-07-09T08:54:17Z

PR submitted #7702

jreback added this to the 0.15.1 milestone Jul 8, 2014

jreback added the Performance label Jul 8, 2014

yrlihuan mentioned this issue Jul 9, 2014

BUG: DatetimeIndex.__iter__ creates a temp array of Timestamp (GH7683) #7702

Closed

jreback modified the milestones: 0.15.0, 0.15.1 Jul 9, 2014

yrlihuan mentioned this issue Jul 9, 2014

BUG: DatetimeIndex.__iter__ creates a temp array of Timestamp (GH7683) #7709

Closed

jreback mentioned this issue Jul 10, 2014

PERF: improve perf of index iteration (GH7683) #7720

Merged

jreback closed this as completed in #7720 Jul 16, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.iterrows takes too long and generate large memory footprint #7683

.iterrows takes too long and generate large memory footprint #7683

yrlihuan commented Jul 7, 2014

jreback commented Jul 7, 2014

jreback commented Jul 7, 2014

jreback commented Jul 7, 2014

yrlihuan commented Jul 8, 2014

jreback commented Jul 8, 2014

yrlihuan commented Jul 8, 2014

jreback commented Jul 8, 2014

yrlihuan commented Jul 8, 2014

jreback commented Jul 8, 2014

yrlihuan commented Jul 9, 2014

.iterrows takes too long and generate large memory footprint #7683

.iterrows takes too long and generate large memory footprint #7683

Comments

yrlihuan commented Jul 7, 2014

jreback commented Jul 7, 2014

jreback commented Jul 7, 2014

jreback commented Jul 7, 2014

yrlihuan commented Jul 8, 2014

jreback commented Jul 8, 2014

yrlihuan commented Jul 8, 2014

jreback commented Jul 8, 2014

yrlihuan commented Jul 8, 2014

jreback commented Jul 8, 2014

yrlihuan commented Jul 9, 2014