Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Don't over-optimize memory with jagged CSV #23527

Merged
merged 1 commit into from Nov 12, 2018

Conversation

Projects
None yet
3 participants
@gfyoung
Copy link
Member

commented Nov 6, 2018

With jagged CSV's, we risk being too quick to dump memory that we need to allocate because previous chunks would have indicated much larger rows than we can anticipate in subsequent chunks.

Closes #23509.

@gfyoung gfyoung added Bug IO CSV labels Nov 6, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Nov 6, 2018

Hello @gfyoung! Thanks for submitting the PR.

@gfyoung gfyoung added this to the 0.24.0 milestone Nov 6, 2018

@codecov

This comment has been minimized.

Copy link

commented Nov 6, 2018

Codecov Report

Merging #23527 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #23527   +/-   ##
=======================================
  Coverage   92.23%   92.23%           
=======================================
  Files         161      161           
  Lines       51324    51324           
=======================================
  Hits        47339    47339           
  Misses       3985     3985
Flag Coverage Δ
#multiple 90.62% <ø> (ø) ⬆️
#single 42.29% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dcb8b6a...17f7822. Read the comment docs.

@gfyoung gfyoung removed this from the 0.24.0 milestone Nov 6, 2018

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Nov 7, 2018

@jreback : Any thoughts on this one? (I only removed the 0.24.0 label because of the group discussion to not add the milestone unless it was about to be merged...)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 8, 2018

can you run the current benchmarks on this. do we have any that specifically target this?

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Nov 8, 2018

@jreback : No specific benchmarks as far as know, though I did not observe any meaningful changes to the benchmarks after making this change.

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Nov 10, 2018

@jreback : Any other thoughts on this?

@gfyoung gfyoung force-pushed the forking-repos:jagged-csv-buffer-overflow branch from feccd27 to 015a193 Nov 11, 2018

@jreback jreback added this to the 0.24.0 milestone Nov 11, 2018

@jreback
Copy link
Contributor

left a comment

@gfyoung ok I reviewed the original issue. so lgtm. if you would add some comments on the parts you added for future readers should be great. ping on green.

Show resolved Hide resolved pandas/_libs/src/parser/tokenizer.c
Show resolved Hide resolved pandas/_libs/src/parser/tokenizer.c
BUG: Don't over-optimize memory with jagged CSV
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes gh-23509.

@gfyoung gfyoung force-pushed the forking-repos:jagged-csv-buffer-overflow branch from 015a193 to 17f7822 Nov 12, 2018

@gfyoung

This comment has been minimized.

Copy link
Member Author

commented Nov 12, 2018

@jreback : Addressed the doc comments, and all is still green. PTAL.

@jreback jreback merged commit 011b79f into pandas-dev:master Nov 12, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181112.8 succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 12, 2018

thanks @gfyoung

thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018

Merge remote-tracking branch 'upstream/master' into to_html-to_string
* upstream/master:
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)
  DOC: Enhancing pivot / reshape docs (pandas-dev#21038)
  TST: Fix xfailing DataFrame arithmetic tests by transposing (pandas-dev#23620)

thoo added a commit to thoo/pandas that referenced this pull request Nov 12, 2018

Merge remote-tracking branch 'upstream/master' into io_csv_docstring_…
…fixed

* upstream/master:
  DOC: avoid SparseArray.take error (pandas-dev#23637)
  CLN: remove incorrect usages of com.AbstractMethodError (pandas-dev#23625)
  DOC: Adding validation of the section order in docstrings (pandas-dev#23607)
  BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
  DEPR: Deprecate usecols as int in read_excel (pandas-dev#23635)
  More helpful Stata string length error. (pandas-dev#23629)
  BUG: astype fill_value for SparseArray.astype (pandas-dev#23547)
  CLN: datetimelike arrays: isort, small reorg (pandas-dev#23587)
  CI: Check in the CI that assert_raises_regex is not being used (pandas-dev#23627)
  CLN:Remove unused **kwargs from user facing methods (pandas-dev#23249)

@gfyoung gfyoung deleted the forking-repos:jagged-csv-buffer-overflow branch Nov 12, 2018

JustinZhengBC added a commit to JustinZhengBC/pandas that referenced this pull request Nov 14, 2018

BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.

brute4s99 added a commit to brute4s99/pandas that referenced this pull request Nov 19, 2018

BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 19, 2019

Fix memory growth bug in read_csv
The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.

gfyoung added a commit to forking-repos/pandas that referenced this pull request Jan 20, 2019

Fix memory growth bug in read_csv
The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.

jreback added a commit that referenced this pull request Jan 20, 2019

Fix memory growth bug in read_csv (#24837)
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes gh-24805.

xref gh-23527.

* TST: Add ASV benchmark for issue

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Fix memory growth bug in read_csv (pandas-dev#24837)
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.

* TST: Add ASV benchmark for issue

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

BUG: Don't over-optimize memory with jagged CSV (pandas-dev#23527)
With jagged CSV's, we risk being too quick
to dump memory that we need to allocate
because previous chunks would have
indicated much larger rows than we can
anticipate in subsequent chunks.

Closes pandas-devgh-23509.

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Fix memory growth bug in read_csv (pandas-dev#24837)
* Fix memory growth bug in read_csv

The edge case where we hit powers of 2
every time during allocation can be painful.

Closes pandas-devgh-24805.

xref pandas-devgh-23527.

* TST: Add ASV benchmark for issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.