BUG: Don't over-optimize memory with jagged CSV #23527

gfyoung · 2018-11-06T09:08:08Z

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks.

Closes #23509.

pep8speaks · 2018-11-06T09:08:12Z

Hello @gfyoung! Thanks for submitting the PR.

There are no PEP8 issues in the file pandas/tests/io/parser/common.py !

codecov · 2018-11-06T09:49:11Z

Codecov Report

Merging #23527 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #23527   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files         161      161           
  Lines       51324    51324           
=======================================
  Hits        47339    47339           
  Misses       3985     3985

Flag	Coverage Δ
#multiple	`90.62% <ø> (ø)`	⬆️
#single	`42.29% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcb8b6a...17f7822. Read the comment docs.

gfyoung · 2018-11-07T17:11:10Z

@jreback : Any thoughts on this one? (I only removed the 0.24.0 label because of the group discussion to not add the milestone unless it was about to be merged...)

jreback · 2018-11-08T13:34:31Z

can you run the current benchmarks on this. do we have any that specifically target this?

gfyoung · 2018-11-08T16:59:36Z

@jreback : No specific benchmarks as far as know, though I did not observe any meaningful changes to the benchmarks after making this change.

gfyoung · 2018-11-10T10:55:08Z

@jreback : Any other thoughts on this?

jreback

@gfyoung ok I reviewed the original issue. so lgtm. if you would add some comments on the parts you added for future readers should be great. ping on green.

pandas/_libs/src/parser/tokenizer.c

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.

gfyoung · 2018-11-12T07:19:08Z

@jreback : Addressed the doc comments, and all is still green. PTAL.

jreback · 2018-11-12T13:13:05Z

thanks @gfyoung

* upstream/master: BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527) DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635) More helpful Stata string length error. (pandas-dev#23629) BUG: astype fill_value for SparseArray.astype (pandas-dev#23547) CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587) CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627) CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249) DOC: Enhancing pivot / reshape docs (pandas-dev#21038) TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)

…fixed * upstream/master: DOC: avoid SparseArray.take error (pandas-dev#23637) CLN: remove incorrect usages of com.AbstractMethodError (pandas-dev#23625) DOC: Adding validation of the section order in docstrings (pandas-dev#23607) BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527) DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635) More helpful Stata string length error. (pandas-dev#23629) BUG: astype fill_value for SparseArray.astype (pandas-dev#23547) CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587) CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627) CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.

* Fix memory growth bug in read_csv The edge case where we hit powers of 2 every time during allocation can be painful. Closes gh-24805. xref gh-23527. * TST: Add ASV benchmark for issue

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.

* Fix memory growth bug in read_csv The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527. * TST: Add ASV benchmark for issue

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.

* Fix memory growth bug in read_csv The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527. * TST: Add ASV benchmark for issue

gfyoung added Bug IO CSV read_csv, to_csv labels Nov 6, 2018

gfyoung added this to the 0.24.0 milestone Nov 6, 2018

gfyoung removed this from the 0.24.0 milestone Nov 6, 2018

gfyoung force-pushed the jagged-csv-buffer-overflow branch from feccd27 to 015a193 Compare November 11, 2018 10:41

jreback added this to the 0.24.0 milestone Nov 11, 2018

jreback requested changes Nov 11, 2018

View reviewed changes

pandas/_libs/src/parser/tokenizer.c Show resolved Hide resolved

pandas/_libs/src/parser/tokenizer.c Show resolved Hide resolved

BUG: Don't over-optimize memory with jagged CSV

17f7822

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks. Closes pandas-devgh-23509.

gfyoung force-pushed the jagged-csv-buffer-overflow branch from 015a193 to 17f7822 Compare November 12, 2018 00:46

jreback approved these changes Nov 12, 2018

View reviewed changes

jreback merged commit 011b79f into pandas-dev:master Nov 12, 2018

gfyoung deleted the jagged-csv-buffer-overflow branch November 12, 2018 19:43

gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 19, 2019

Fix memory growth bug in read_csv

12aaad0

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.

gfyoung mentioned this pull request Jan 19, 2019

Fix memory growth bug in read_csv #24837

Merged

gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 20, 2019

Fix memory growth bug in read_csv

e241796

The edge case where we hit powers of 2 every time during allocation can be painful. Closes pandas-devgh-24805. xref pandas-devgh-23527.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Don't over-optimize memory with jagged CSV #23527

BUG: Don't over-optimize memory with jagged CSV #23527

gfyoung commented Nov 6, 2018

pep8speaks commented Nov 6, 2018

codecov bot commented Nov 6, 2018 •

edited

Loading

gfyoung commented Nov 7, 2018 •

edited

Loading

jreback commented Nov 8, 2018

gfyoung commented Nov 8, 2018

gfyoung commented Nov 10, 2018

jreback left a comment

gfyoung commented Nov 12, 2018

jreback commented Nov 12, 2018

BUG: Don't over-optimize memory with jagged CSV #23527

BUG: Don't over-optimize memory with jagged CSV #23527

Conversation

gfyoung commented Nov 6, 2018

pep8speaks commented Nov 6, 2018

codecov bot commented Nov 6, 2018 • edited Loading

Codecov Report

gfyoung commented Nov 7, 2018 • edited Loading

jreback commented Nov 8, 2018

gfyoung commented Nov 8, 2018

gfyoung commented Nov 10, 2018

jreback left a comment

Choose a reason for hiding this comment

gfyoung commented Nov 12, 2018

jreback commented Nov 12, 2018

codecov bot commented Nov 6, 2018 •

edited

Loading

gfyoung commented Nov 7, 2018 •

edited

Loading