Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Don't over-optimize memory with jagged CSV #23527

Merged
merged 1 commit into from
Nov 12, 2018

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Nov 6, 2018

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks.

Closes #23509.

@gfyoung gfyoung added Bug IO CSV read_csv, to_csv labels Nov 6, 2018
@pep8speaks
Copy link

Hello @gfyoung! Thanks for submitting the PR.

@gfyoung gfyoung added this to the 0.24.0 milestone Nov 6, 2018
@codecov
Copy link

codecov bot commented Nov 6, 2018

Codecov Report

Merging #23527 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23527   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files         161      161           
  Lines       51324    51324           
=======================================
  Hits        47339    47339           
  Misses       3985     3985
Flag Coverage Δ
#multiple 90.62% <ø> (ø) ⬆️
#single 42.29% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcb8b6a...17f7822. Read the comment docs.

@gfyoung gfyoung removed this from the 0.24.0 milestone Nov 6, 2018
@gfyoung
Copy link
Member Author

gfyoung commented Nov 7, 2018

@jreback : Any thoughts on this one? (I only removed the 0.24.0 label because of the group discussion to not add the milestone unless it was about to be merged...)

@jreback
Copy link
Contributor

jreback commented Nov 8, 2018

can you run the current benchmarks on this. do we have any that specifically target this?

@gfyoung
Copy link
Member Author

gfyoung commented Nov 8, 2018

@jreback : No specific benchmarks as far as know, though I did not observe any meaningful changes to the benchmarks after making this change.

@gfyoung
Copy link
Member Author

gfyoung commented Nov 10, 2018

@jreback : Any other thoughts on this?

@jreback jreback added this to the 0.24.0 milestone Nov 11, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gfyoung ok I reviewed the original issue. so lgtm. if you would add some comments on the parts you added for future readers should be great. ping on green.

pandas/_libs/src/parser/tokenizer.c Show resolved Hide resolved
pandas/_libs/src/parser/tokenizer.c Show resolved Hide resolved
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
@gfyoung
Copy link
Member Author

gfyoung commented Nov 12, 2018

@jreback : Addressed the doc comments, and all is still green. PTAL.

@jreback jreback merged commit 011b79f into pandas-dev:master Nov 12, 2018
@jreback
Copy link
Contributor

jreback commented Nov 12, 2018

thanks @gfyoung

thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018
* upstream/master:
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)
thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018
…fixed

* upstream/master:
  DOC: avoid SparseArray.take error (pandas-dev#23637)
  CLN: remove incorrect usages of com.AbstractMethodError (pandas-dev#23625)
  DOC: Adding validation of the section order in docstrings (pandas-dev#23607)
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
@gfyoung gfyoung deleted the jagged-csv-buffer-overflow branch November 12, 2018 19:43
JustinZhengBC pushed a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
tm9k1 pushed a commit to tm9k1/pandas that referenced this pull request Nov 19, 2018
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 19, 2019
The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.
gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 20, 2019
The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.
jreback pushed a commit that referenced this pull request Jan 20, 2019
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes gh-24805.

xref gh-23527.

* TST: Add ASV benchmark for issue
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.

* TST: Add ASV benchmark for issue
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.
Pingviinituutti pushed a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.

* TST: Add ASV benchmark for issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants