Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
.iterrows takes too long and generate large memory footprint #7683
Comments
|
what are you doing that requires |
|
see here for some tips: pydata#7194 |
|
This does return a generator. The problem is since you have mixed dtypes, it has to create a single dtyped object, BEFORE IT DOES ANYTHING, which takes a lot of time (the zipping doesn't take much). |
|
Profiling shows it has nothing to do with zipping, though it's not about mixed dtyped data frame either. It's slow when
|
|
you haven't answered the question why are you using iterrows? |
|
There's a method I want to apply to each row sequentially. The method itself takes some time so vectorize it or not doesn't make much different for running time. I prefer iteration because it gives more control. |
|
you might try iterating over I suppose this could be updated to iterate over the index, rather than all at once (as it loses its identity as an Index and becomes a list). Would you like to submit a pull-request for this? |
jreback
added this to the
0.15.1
milestone
Jul 8, 2014
jreback
added the
Performance
label
Jul 8, 2014
|
sure. i can look into this |
|
that would be gr8!. |
|
PR submitted #7702 |
yrlihuan commentedJul 7, 2014
When using df.iterrows on large data frame, it takes a long time to run and consumes huge amount of memory.
The name of the function implies that it is an iterator and should not take much to run.
However,in the method it uses builtin method 'zip',which can sometimes generate huge temporary list of tuples if optimisation is not done correctly.Below is the code which can reproduce the issue on a box with 16GB memory.