Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory error when skipping rows #8681

Closed
nkulki opened this issue Oct 30, 2014 · 10 comments · Fixed by #8752
Closed

memory error when skipping rows #8681

nkulki opened this issue Oct 30, 2014 · 10 comments · Fixed by #8752
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Milestone

Comments

@nkulki
Copy link

nkulki commented Oct 30, 2014

I have a file with over 100Million rows. When I do
pd.read_csv(filename, skiprows=100000000, iterator=True)
python crashes with a memory error. I have 32 gigs of memory and python eats up all that memory!!

@jreback
Copy link
Contributor

jreback commented Oct 30, 2014

if you specify chunksize=1000000 (or something I think it might help).

That said, still could be a bug. Would appreciate you having a deeper look if you can.

@jreback jreback added IO CSV read_csv, to_csv Bug Performance Memory or execution speed performance and removed Bug labels Oct 30, 2014
@jreback jreback added this to the 0.16.0 milestone Oct 30, 2014
@jreback
Copy link
Contributor

jreback commented Oct 30, 2014

prob related #8661, #8679

cc @mdmueller

@nkulki
Copy link
Author

nkulki commented Oct 30, 2014

Hi,
I tried your suggestion. Unfortunately it did not really work. I dont have a dev version of pandas installed. However I was reading through the code, and I suspect the issue could be the use of lrange(skiprows)
if com.is_integer(skiprows):
skiprows = lrange(skiprows)
if skiprows is 100M, then lrange will generate a very large array.

@nkulki
Copy link
Author

nkulki commented Oct 30, 2014

I know that skiprows can be a array of ints or a int. I think that if its simply an int, then we should use a more efficient code path to skip the rows instead of generating an 100M size array.

@jreback
Copy link
Contributor

jreback commented Oct 30, 2014

yep that sounds right
but would need some modifications in the code to deal with that

have a go!

@nkulki
Copy link
Author

nkulki commented Oct 30, 2014

Do you have any suggestions on how best to tackle this issue

Sent from my iPhone

On Oct 30, 2014, at 9:03 AM, jreback notifications@github.com wrote:

yep that sounds right
but would need some modifications in the code to deal with that

have a go!


Reply to this email directly or view it on GitHub.

@jreback
Copy link
Contributor

jreback commented Oct 31, 2014

so currently the skipped rows are transformed to a list of rows with range (this is why this blows memory up). You could instead change the impl a bit to pass a list of a list was original passed (e.g. skiprows=[0,3,5], and then pass say slice to represent the interval. This object appears as .skiprows in parser.pyx and then is directly passed to the tokenizer.c

You prob need to change the c-impl as well (so maybe make the original skiprows_list, and the new one skiprows_slice). The original is handled just as it was, the new can do a range change (this is all done in c-land). to just skip if the start >= slice.start and end < slice.end.

So a bit involved, but would be a nice change.

@jreback
Copy link
Contributor

jreback commented Oct 31, 2014

@cpcloud

@guyrt
Copy link
Contributor

guyrt commented Dec 12, 2014

This bug is listed in the latest release notes as closed:
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0152-performance

Is it actually fixed?

@jreback
Copy link
Contributor

jreback commented Dec 12, 2014

it was linked incorrectly, PR is here: #8752

yes this is fixed.

@jreback jreback closed this as completed Dec 12, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants