Update tokenizer to fix #8679 #8661 #8752

selasley · 2014-11-07T22:53:11Z

Update tokenizer's handling of skipped lines.
Fixes a problem with read_csv(delim_whitespace=True) and
read_table(engine='c') when lines being skipped have trailing
whitespace.

closes #8679
closes #8681

jreback · 2014-11-08T20:45:11Z

pandas/io/tests/test_parsers.py

@@ -3036,7 +3036,7 @@ def test_line_comment(self):

    def test_comment_skiprows(self):
        data = """# empty
-random line
+random line with trailing spaces  


can you leave the original test in, and add a new test (you can just copy-paste), same method is fine. Also add the issue numbers as a comment.

jreback · 2014-11-08T20:45:28Z

pls add a release note in v0.15.2 (reference both issues)

jreback · 2014-11-09T14:00:20Z

cc @mdmueller pls have a look

mdmueller · 2014-11-09T15:26:55Z

Looks good to me--might as well leave the old test in as @jreback noted above though.

Thanks for the fix @selasley!

jreback · 2014-11-09T15:40:21Z

@selasley

pls add back old test
add release note
rebase / squash

jreback · 2014-11-09T21:39:56Z

@selasley ok, rebase again, you'll have a merge conflict (as you are including the whole of v0.15.2.txt again). You should be fine otherwise, repush and i'll have a look.

selasley · 2014-11-10T03:36:55Z

I added my comments to whatsnew/v0.15.2.txt from e8b3862, did a rebase/squash and pushed the result. Let me know if I was supposed to do something different.

selasley · 2014-11-10T08:32:24Z

commit 3c16f33 contains a bug fix in tokenize_delim_customterm, general code cleanup and updates to various files to e8b3862 versions

jreback · 2014-11-10T08:36:20Z

you seem to be picking up other commits and then rebasing them in

what are you doing to rebase?

jreback · 2014-11-10T08:37:06Z

https://github.com/pydata/pandas/wiki/Using-Git

selasley · 2014-11-10T08:48:09Z

On Nov 10, 2014, at 3:37 AM, jreback notifications@github.com wrote:

https://github.com/pydata/pandas/wiki/Using-Git

Thanks for the link. I'll definitely read through it.

I've been doing git rebase -i HEAD~2 then changing the second pick to squash and editing the commit comments. Then I do a git push origin +trailing_spaces_fix

For the latest commit I copied several files from the latest master branch to my trailing_spaces_fix branch to make sure the tests still passed. In retrospect that was probably not a good thing to do.=

jreback · 2014-11-11T00:15:14Z

@selasley can you rebase this.

and can you post a memory comparison of old vs new? (use a relatively small example)

selasley · 2014-11-11T04:19:10Z

I'll try again to rebase tonight.
I created a text file with 10000000 rows to skip, each containing 8 fields, followed by two rows each with 4 integers. In ipython3
%timeit -r 5 -n 5 df = pd.read_csv('bigfile.txt', header=None, delim_whitespace=True, skiprows=10000000)
5 loops, best of 5: 5.74 s per loop for new pandas
5 loops, best of 5: 10.3 s per loop for pandas 0.15.1 with numba and BottleNeck

According to the Allocations instrument with Xcode, reading the file allocated 4.67GB total with pandas 0.15.1 and 1.40GB with the new pandas.

With 100000 rows to skip the numbers are
5 loops, best of 5: 43.9 ms per loop, 29.23MB for new pandas
5 loops, best of 5: 88.0 ms per loop, 71.45MB for pandas 0.15.1

selasley · 2014-11-11T05:39:29Z

I found a problem with custom line terminators in 0.15.1 and in my version
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', sep=' ')
works as expected, but
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'.replace(' ', ',')), lineterminator='', sep=r',+')
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', delim_whitespace=True)
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', sep=r'\ +')
all return an empty DataFrame. I would like to fix this.

jreback · 2014-11-11T05:50:47Z

@selasley fine to fix (I think their is an open issue about line terminators - can't find it in s quick search)
pls put in separate commits (or do a separate pr)
the skip rows memory issue ok in here
but need to rebase

selasley · 2014-11-11T07:25:04Z

I think I rebased properly this time for the trailing spaces fix. Thanks for your patience

jreback · 2014-11-11T12:13:25Z

rebase looks good!

https://github.com/pydata/pandas/wiki/Performance-Testing

see:
need to test for -r csv
eg the vbenches related to csv reading
pls post results

I think it makes send to add s vbench for skiprows as well
don't make the example so long maybe
like them to take < less than say 500ms (or so)
for this type of bench - maybe 20k rows skip first 10k

selasley · 2014-11-11T21:47:28Z

I added a 20000 row, skiprows=10000 benchtest to io_bench.py

./test_perf.sh -b master -t HEAD -r csv

Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_to_csv_date_formatting                 |  21.4180 |  26.3120 |   0.8140 |
read_csv_skiprows                            |  16.4446 |  18.6180 |   0.8833 |
read_csv_precise_converter                   |   1.8516 |   1.9027 |   0.9732 |
frame_to_csv                                 | 159.6503 | 163.0680 |   0.9790 |
read_csv_thou_vb                             |  24.6903 |  25.0720 |   0.9848 |
frame_to_csv2                                | 149.5183 | 150.1943 |   0.9955 |
write_csv_standard                           |  55.2851 |  55.2817 |   1.0001 |
frame_to_csv_mixed                           | 967.0726 | 962.2053 |   1.0051 |
packers_write_csv                            | 1647.7884 | 1638.5774 |   1.0056 |
read_csv_infer_datetime_format_custom        |  13.7510 |  13.6327 |   1.0087 |
read_csv_infer_datetime_format_iso8601       |   2.2984 |   2.2717 |   1.0118 |
read_csv_infer_datetime_format_ymd           |   2.4959 |   2.4653 |   1.0124 |
read_csv_roundtrip_converter                 |   2.8143 |   2.7333 |   1.0296 |
packers_read_csv                             | 220.8546 | 213.4230 |   1.0348 |
read_csv_standard                            |  13.8597 |  13.3590 |   1.0375 |
read_csv_vb                                  |  26.3373 |  25.1070 |   1.0490 |
read_csv_default_converter                   |   2.0603 |   1.9317 |   1.0666 |
read_csv_comment2                            |  32.6854 |  30.3557 |   1.0767 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [746afa2] : BUG CSV: fix problem with trailing whitespace, issues 8661, 8679
ENH CSV: improve memory footprint when skipping first N rows, issue 8681
Base   [cd93914] : BUG: Allow non-float values in get_yahoo_quotes.
Fixes issue: #5229

jreback · 2014-11-11T23:48:13Z

perf looks good (and vbench).

you need to rebase again. somehow picking up odd things.

git rebase thisbranch -i origin/master should do it

selasley · 2014-11-12T05:35:50Z

git rebase trailing_spaces_fix -i origin/master and git rebase -i trailing_spaces_fix origin/master gave fatal errors so I tried the rebase that seemed to work yesterday

jreback · 2014-11-15T17:09:50Z

@selasley pls rebase and push. is this working locally?

selasley · 2014-11-15T17:20:07Z

I am trying to follow the recipe on the wiki. Starting from scratch
git clone https://github.com/selasley/pandas.git
git remote add upstream git://github.com/pydata/pandas
git fetch upstream
git rebase -i upstream/master
git diff usually shows that doc/source/whatsnew/v0.15.2.txt needs to be cleaned up. I do that then
git rebase --continue
git push origin +trailing_spaces_fix
Right now git rebase -i says Successfully rebased and updated refs/heads/trailing_spaces_fix.
I did add some files to my fork and pushed before reading the wiki page. Maybe that's what is causing problems. Should I start from scratch and fork from the latest master branch?

jreback · 2014-11-15T17:25:08Z

git push origin trailing_spaces_fix -f

the -f is key

selasley · 2014-11-15T17:31:01Z

OK, I did a girl push with -f. I mistakenly thought +trailing_spaces_fix was the same as --force from some stack overflow answer. Now that I read the git push docs I see they are different.

jreback · 2014-11-15T17:35:44Z

https://travis-ci.org/pydata/pandas/jobs/41103767#L251

it's not cythonizing/compiling correctly

selasley · 2014-11-15T17:57:46Z

I misinterpreted the failure message on Travis. It builds without errors on my mac with clang from Xcode 6.1 and Cython 0.21.1, but if I run nosetests in the pandas directory I get the ImportError: cannot import name 'hashtable' error.
I'll work on getting it to build in Travis

jreback · 2014-11-26T10:48:33Z

doc/source/whatsnew/v0.15.2.txt

@@ -66,6 +66,7 @@ Enhancements
 - Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`).
 - Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`.
 - Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`).  See :ref:`here <io.stata-categorical>` for more information on importing categorical variables from Stata data files.
+- Reduce memory usage when skiprows is an integer in read_csv (:issue:`8681`)


Move this to performance section

jreback · 2014-11-26T10:49:03Z

looks good. minor comment. pls rebase / squash. ping when green.

selasley · 2014-11-26T16:37:58Z

I moved the comment to performance in whatsnew and rebased

jreback · 2014-11-26T21:06:40Z

pandas/io/tests/test_parsers.py

+        df = self.read_csv(StringIO(data.replace(',', '  ')), 
+                           header=None, delim_whitespace=True,
+                           skiprows=[0,1,2,3,5,6], skip_blank_lines=True)
+        tm.assert_almost_equal(df.values, expected)


compare an actual dataframe here (and below), NOT the values, just tm.assert_frame_equal, this makes sure dtypes, index, columns are all correct

… 8661, 8679 ENH CSV: Reduce memory usage when skiprows is an integer in read_csv, issue 8681

selasley · 2014-11-26T23:18:35Z

replaced tm.assert_almost_equal(df.values, expected) with tm.assert_frame_equal(df, expected) where expected is now a dataframe. I had to push a couple of times. The latest travis build should succeed

Update tokenizer to fix #8679 #8661

jreback · 2014-11-27T02:52:42Z

thanks!

xdliao · 2014-12-02T21:12:25Z

Although it works for "skiprows = integer" now, there is still problem with "skiprows = list of rows"
when the list does not start from row=0.
such as read_table(file, skiprows = (1,2)) when all lines have trailing spaces

selasley · 2014-12-02T22:19:31Z

I can reproduce the problem. I'll look into it
data1="""A B C D E\n2spaces \n1space \n3spaces \nF G H I J \n0 1 2 3 4 \n5. 6 7. 8 9. """
data2="""A B C D E \n2spaces \n1space \n3spaces \nF G H I J \n0 1 2 3 4 \n5. 6 7. 8 9. """
pd.read_csv(StringIO(data1), skiprows=4, delim_whitespace=True)
F G H I J
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data1), skiprows=(1,2,3,4), delim_whitespace=True)
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data2), skiprows=4, delim_whitespace=True)
F G H I J
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data2), skiprows=(1,2,3,4), delim_whitespace=True)
A B C D E
0 NaN NaN NaN NaN NaN
1 F G H I J
2 0 1 2 3 4
3 5. 6 7. 8 9.

jreback · 2014-12-03T00:13:49Z

@selasley excellent why don't u create a new issue for this (and xref this one)
@xdliao thanks for the report!

selasley · 2015-02-01T03:19:49Z

Test name | head[ms] | base[ms] | ratio |

frame_to_csv_date_formatting | 10.0880 | 10.4263 | 0.9676 |
frame_to_csv | 110.0133 | 113.4770 | 0.9695 |
write_csv_standard | 32.4697 | 33.2243 | 0.9773 |
read_csv_infer_datetime_format_ymd | 1.5241 | 1.5490 | 0.9839 |
read_csv_infer_datetime_format_iso8601 | 1.3983 | 1.3986 | 0.9998 |
read_csv_comment2 | 20.4786 | 20.3200 | 1.0078 |
packers_write_csv | 935.2113 | 927.8717 | 1.0079 |
read_csv_infer_datetime_format_custom | 7.3980 | 7.3347 | 1.0086 |
packers_read_csv | 138.5036 | 135.7140 | 1.0206 |
frame_to_csv2 | 91.6687 | 89.4313 | 1.0250 |
read_csv_precise_converter | 1.2287 | 1.1946 | 1.0285 |
frame_to_csv_mixed | 454.8573 | 441.8160 | 1.0295 |
read_csv_roundtrip_converter | 1.8710 | 1.8086 | 1.0344 |
read_csv_skiprows | 10.4700 | 10.0637 | 1.0404 |
read_csv_thou_vb | 14.9264 | 14.2634 | 1.0465 |
read_csv_standard | 8.9880 | 8.5763 | 1.0480 |
read_csv_vb | 17.1566 | 16.2590 | 1.0552 |

read_csv_default_converter | 1.2707 | 1.1950 | 1.0633 |

Test name | head[ms] | base[ms] | ratio |

Seed used: 1234

Target [8602c23] : BUG: Fix buffer overflows in tokenizer.c with certain malformed input files. GH9205
Base [9f439f0] : Merge pull request #9380 from behzadnouri/i8grby

jreback added the IO CSV read_csv, to_csv label Nov 8, 2014

jreback added this to the 0.15.2 milestone Nov 8, 2014

jreback added the Bug label Nov 8, 2014

jreback reviewed Nov 8, 2014
View reviewed changes

jreback reviewed Nov 26, 2014
View reviewed changes

BUG CSV: fix problem with trailing whitespace in skipped rows, issues…

6bf83c5

… 8661, 8679 ENH CSV: Reduce memory usage when skiprows is an integer in read_csv, issue 8681

jreback added a commit that referenced this pull request Nov 27, 2014

Merge pull request #8752 from selasley/trailing_spaces_fix

e5fe75e

Update tokenizer to fix #8679 #8661

jreback merged commit e5fe75e into pandas-dev:master Nov 27, 2014

jreback mentioned this pull request Dec 2, 2014

bug? with delim_whitespace and 'skiprows' in read_csv #8960

Closed

selasley mentioned this pull request Dec 3, 2014

BUG in read_csv skipping a row after a row with trailing spaces #8983

Closed

jreback mentioned this pull request Dec 12, 2014

memory error when skipping rows #8681

Closed

jreback mentioned this pull request Jan 9, 2015

pandas error kills IPython kernel #9205

Closed

tlmaloney mentioned this pull request Jan 20, 2015

BUG: API Change with skiprows from v0.14.1 to v0.15.2 #9306

Closed

Update tokenizer to fix #8679 #8661 #8752

Update tokenizer to fix #8679 #8661 #8752

Conversation

selasley commented Nov 7, 2014

jreback Nov 8, 2014

Choose a reason for hiding this comment

jreback commented Nov 8, 2014

jreback commented Nov 9, 2014

mdmueller commented Nov 9, 2014

jreback commented Nov 9, 2014

jreback commented Nov 9, 2014

selasley commented Nov 10, 2014

selasley commented Nov 10, 2014

jreback commented Nov 10, 2014

jreback commented Nov 10, 2014

selasley commented Nov 10, 2014

jreback commented Nov 11, 2014

selasley commented Nov 11, 2014

selasley commented Nov 11, 2014

jreback commented Nov 11, 2014

selasley commented Nov 11, 2014

jreback commented Nov 11, 2014

selasley commented Nov 11, 2014

jreback commented Nov 11, 2014

selasley commented Nov 12, 2014

jreback commented Nov 15, 2014

selasley commented Nov 15, 2014

jreback commented Nov 15, 2014

selasley commented Nov 15, 2014

jreback commented Nov 15, 2014

selasley commented Nov 15, 2014

jreback Nov 26, 2014

Choose a reason for hiding this comment

jreback commented Nov 26, 2014

selasley commented Nov 26, 2014

jreback Nov 26, 2014

Choose a reason for hiding this comment

selasley commented Nov 26, 2014

jreback commented Nov 27, 2014

xdliao commented Dec 2, 2014

selasley commented Dec 2, 2014

jreback commented Dec 3, 2014

selasley commented Feb 1, 2015

Test name | head[ms] | base[ms] | ratio |

read_csv_default_converter | 1.2707 | 1.1950 | 1.0633 |

Test name | head[ms] | base[ms] | ratio |