memory error when skipping rows #8681

nkulki · 2014-10-30T07:53:55Z

I have a file with over 100Million rows. When I do
pd.read_csv(filename, skiprows=100000000, iterator=True)
python crashes with a memory error. I have 32 gigs of memory and python eats up all that memory!!

jreback · 2014-10-30T11:01:16Z

if you specify chunksize=1000000 (or something I think it might help).

That said, still could be a bug. Would appreciate you having a deeper look if you can.

jreback · 2014-10-30T11:03:15Z

prob related #8661, #8679

cc @mdmueller

nkulki · 2014-10-30T15:46:44Z

Hi,
I tried your suggestion. Unfortunately it did not really work. I dont have a dev version of pandas installed. However I was reading through the code, and I suspect the issue could be the use of lrange(skiprows)
if com.is_integer(skiprows):
skiprows = lrange(skiprows)
if skiprows is 100M, then lrange will generate a very large array.

nkulki · 2014-10-30T15:53:19Z

I know that skiprows can be a array of ints or a int. I think that if its simply an int, then we should use a more efficient code path to skip the rows instead of generating an 100M size array.

jreback · 2014-10-30T16:02:52Z

yep that sounds right
but would need some modifications in the code to deal with that

have a go!

nkulki · 2014-10-30T16:05:14Z

Do you have any suggestions on how best to tackle this issue

Sent from my iPhone

On Oct 30, 2014, at 9:03 AM, jreback notifications@github.com wrote:

yep that sounds right
but would need some modifications in the code to deal with that

have a go!

—
Reply to this email directly or view it on GitHub.

jreback · 2014-10-31T13:23:26Z

so currently the skipped rows are transformed to a list of rows with range (this is why this blows memory up). You could instead change the impl a bit to pass a list of a list was original passed (e.g. skiprows=[0,3,5], and then pass say slice to represent the interval. This object appears as .skiprows in parser.pyx and then is directly passed to the tokenizer.c

You prob need to change the c-impl as well (so maybe make the original skiprows_list, and the new one skiprows_slice). The original is handled just as it was, the new can do a range change (this is all done in c-land). to just skip if the start >= slice.start and end < slice.end.

So a bit involved, but would be a nice change.

jreback · 2014-10-31T13:23:34Z

@cpcloud

guyrt · 2014-12-12T19:24:56Z

This bug is listed in the latest release notes as closed:
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0152-performance

Is it actually fixed?

jreback · 2014-12-12T19:40:08Z

it was linked incorrectly, PR is here: #8752

yes this is fixed.

jreback added IO CSV read_csv, to_csv Bug Performance Memory or execution speed performance and removed Bug labels Oct 30, 2014

jreback added this to the 0.16.0 milestone Oct 30, 2014

selasley mentioned this issue Dec 12, 2014

Update tokenizer to fix #8679 #8661 #8752

Merged

jreback closed this as completed Dec 12, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory error when skipping rows #8681

memory error when skipping rows #8681

nkulki commented Oct 30, 2014

jreback commented Oct 30, 2014

jreback commented Oct 30, 2014

nkulki commented Oct 30, 2014

nkulki commented Oct 30, 2014

jreback commented Oct 30, 2014

nkulki commented Oct 30, 2014

jreback commented Oct 31, 2014

jreback commented Oct 31, 2014

guyrt commented Dec 12, 2014

jreback commented Dec 12, 2014

memory error when skipping rows #8681

memory error when skipping rows #8681

Comments

nkulki commented Oct 30, 2014

jreback commented Oct 30, 2014

jreback commented Oct 30, 2014

nkulki commented Oct 30, 2014

nkulki commented Oct 30, 2014

jreback commented Oct 30, 2014

nkulki commented Oct 30, 2014

jreback commented Oct 31, 2014

jreback commented Oct 31, 2014

guyrt commented Dec 12, 2014

jreback commented Dec 12, 2014