Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update tokenizer to fix #8679 #8661 #8752

Merged
merged 1 commit into from
Nov 27, 2014
Merged

Update tokenizer to fix #8679 #8661 #8752

merged 1 commit into from
Nov 27, 2014

Conversation

selasley
Copy link
Contributor

@selasley selasley commented Nov 7, 2014

Update tokenizer's handling of skipped lines.
Fixes a problem with read_csv(delim_whitespace=True) and
read_table(engine='c') when lines being skipped have trailing
whitespace.

closes #8679
closes #8681

@jreback jreback added the IO CSV read_csv, to_csv label Nov 8, 2014
@jreback jreback added this to the 0.15.2 milestone Nov 8, 2014
@jreback jreback added the Bug label Nov 8, 2014
@@ -3036,7 +3036,7 @@ def test_line_comment(self):

def test_comment_skiprows(self):
data = """# empty
random line
random line with trailing spaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave the original test in, and add a new test (you can just copy-paste), same method is fine. Also add the issue numbers as a comment.

@jreback
Copy link
Contributor

jreback commented Nov 8, 2014

pls add a release note in v0.15.2 (reference both issues)

@jreback
Copy link
Contributor

jreback commented Nov 9, 2014

cc @mdmueller pls have a look

@mdmueller
Copy link
Contributor

Looks good to me--might as well leave the old test in as @jreback noted above though.

Thanks for the fix @selasley!

@jreback
Copy link
Contributor

jreback commented Nov 9, 2014

@selasley

  • pls add back old test
  • add release note
  • rebase / squash

@jreback
Copy link
Contributor

jreback commented Nov 9, 2014

@selasley ok, rebase again, you'll have a merge conflict (as you are including the whole of v0.15.2.txt again). You should be fine otherwise, repush and i'll have a look.

@selasley
Copy link
Contributor Author

I added my comments to whatsnew/v0.15.2.txt from e8b3862, did a rebase/squash and pushed the result. Let me know if I was supposed to do something different.

@selasley
Copy link
Contributor Author

commit 3c16f33 contains a bug fix in tokenize_delim_customterm, general code cleanup and updates to various files to e8b3862 versions

@jreback
Copy link
Contributor

jreback commented Nov 10, 2014

you seem to be picking up other commits and then rebasing them in

what are you doing to rebase?

@jreback
Copy link
Contributor

jreback commented Nov 10, 2014

@selasley
Copy link
Contributor Author

On Nov 10, 2014, at 3:37 AM, jreback notifications@github.com wrote:

https://github.com/pydata/pandas/wiki/Using-Git

Thanks for the link. I'll definitely read through it.

I've been doing git rebase -i HEAD~2 then changing the second pick to squash and editing the commit comments. Then I do a git push origin +trailing_spaces_fix

For the latest commit I copied several files from the latest master branch to my trailing_spaces_fix branch to make sure the tests still passed. In retrospect that was probably not a good thing to do.=

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@selasley can you rebase this.

and can you post a memory comparison of old vs new? (use a relatively small example)

@selasley
Copy link
Contributor Author

I'll try again to rebase tonight.
I created a text file with 10000000 rows to skip, each containing 8 fields, followed by two rows each with 4 integers. In ipython3
%timeit -r 5 -n 5 df = pd.read_csv('bigfile.txt', header=None, delim_whitespace=True, skiprows=10000000)
5 loops, best of 5: 5.74 s per loop for new pandas
5 loops, best of 5: 10.3 s per loop for pandas 0.15.1 with numba and BottleNeck

According to the Allocations instrument with Xcode, reading the file allocated 4.67GB total with pandas 0.15.1 and 1.40GB with the new pandas.

With 100000 rows to skip the numbers are
5 loops, best of 5: 43.9 ms per loop, 29.23MB for new pandas
5 loops, best of 5: 88.0 ms per loop, 71.45MB for pandas 0.15.1

@selasley
Copy link
Contributor Author

I found a problem with custom line terminators in 0.15.1 and in my version
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', sep=' ')
works as expected, but
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'.replace(' ', ',')), lineterminator='
', sep=r',+')
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='', delim_whitespace=True)
pd.read_csv(StringIO('A B C_0 1 2_3 4 5'), lineterminator='
', sep=r'\ +')
all return an empty DataFrame. I would like to fix this.

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

@selasley fine to fix (I think their is an open issue about line terminators - can't find it in s quick search)
pls put in separate commits (or do a separate pr)
the skip rows memory issue ok in here
but need to rebase

@selasley
Copy link
Contributor Author

I think I rebased properly this time for the trailing spaces fix. Thanks for your patience

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

rebase looks good!

https://github.com/pydata/pandas/wiki/Performance-Testing

see:
need to test for -r csv
eg the vbenches related to csv reading
pls post results

I think it makes send to add s vbench for skiprows as well
don't make the example so long maybe
like them to take < less than say 500ms (or so)
for this type of bench - maybe 20k rows skip first 10k

@selasley
Copy link
Contributor Author

I added a 20000 row, skiprows=10000 benchtest to io_bench.py

./test_perf.sh -b master -t HEAD -r csv

Invoked with :
--ncalls: 3
--repeats: 3


-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
frame_to_csv_date_formatting                 |  21.4180 |  26.3120 |   0.8140 |
read_csv_skiprows                            |  16.4446 |  18.6180 |   0.8833 |
read_csv_precise_converter                   |   1.8516 |   1.9027 |   0.9732 |
frame_to_csv                                 | 159.6503 | 163.0680 |   0.9790 |
read_csv_thou_vb                             |  24.6903 |  25.0720 |   0.9848 |
frame_to_csv2                                | 149.5183 | 150.1943 |   0.9955 |
write_csv_standard                           |  55.2851 |  55.2817 |   1.0001 |
frame_to_csv_mixed                           | 967.0726 | 962.2053 |   1.0051 |
packers_write_csv                            | 1647.7884 | 1638.5774 |   1.0056 |
read_csv_infer_datetime_format_custom        |  13.7510 |  13.6327 |   1.0087 |
read_csv_infer_datetime_format_iso8601       |   2.2984 |   2.2717 |   1.0118 |
read_csv_infer_datetime_format_ymd           |   2.4959 |   2.4653 |   1.0124 |
read_csv_roundtrip_converter                 |   2.8143 |   2.7333 |   1.0296 |
packers_read_csv                             | 220.8546 | 213.4230 |   1.0348 |
read_csv_standard                            |  13.8597 |  13.3590 |   1.0375 |
read_csv_vb                                  |  26.3373 |  25.1070 |   1.0490 |
read_csv_default_converter                   |   2.0603 |   1.9317 |   1.0666 |
read_csv_comment2                            |  32.6854 |  30.3557 |   1.0767 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.
Seed used: 1234

Target [746afa2] : BUG CSV: fix problem with trailing whitespace, issues 8661, 8679
ENH CSV: improve memory footprint when skipping first N rows, issue 8681
Base   [cd93914] : BUG: Allow non-float values in get_yahoo_quotes.
Fixes issue: #5229

@jreback
Copy link
Contributor

jreback commented Nov 11, 2014

perf looks good (and vbench).

you need to rebase again. somehow picking up odd things.

git rebase thisbranch -i origin/master should do it

@selasley
Copy link
Contributor Author

git rebase trailing_spaces_fix -i origin/master and git rebase -i trailing_spaces_fix origin/master gave fatal errors so I tried the rebase that seemed to work yesterday

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

@selasley pls rebase and push. is this working locally?

@selasley
Copy link
Contributor Author

I am trying to follow the recipe on the wiki. Starting from scratch
git clone https://github.com/selasley/pandas.git
git remote add upstream git://github.com/pydata/pandas
git fetch upstream
git rebase -i upstream/master
git diff usually shows that doc/source/whatsnew/v0.15.2.txt needs to be cleaned up. I do that then
git rebase --continue
git push origin +trailing_spaces_fix
Right now git rebase -i says Successfully rebased and updated refs/heads/trailing_spaces_fix.
I did add some files to my fork and pushed before reading the wiki page. Maybe that's what is causing problems. Should I start from scratch and fork from the latest master branch?

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

git push origin trailing_spaces_fix -f

the -f is key

@selasley
Copy link
Contributor Author

OK, I did a girl push with -f. I mistakenly thought +trailing_spaces_fix was the same as --force from some stack overflow answer. Now that I read the git push docs I see they are different.

@jreback
Copy link
Contributor

jreback commented Nov 15, 2014

https://travis-ci.org/pydata/pandas/jobs/41103767#L251

it's not cythonizing/compiling correctly

@selasley
Copy link
Contributor Author

I misinterpreted the failure message on Travis. It builds without errors on my mac with clang from Xcode 6.1 and Cython 0.21.1, but if I run nosetests in the pandas directory I get the ImportError: cannot import name 'hashtable' error.
I'll work on getting it to build in Travis

@@ -66,6 +66,7 @@ Enhancements
- Added support for ``utcfromtimestamp()``, ``fromtimestamp()``, and ``combine()`` on `Timestamp` class (:issue:`5351`).
- Added Google Analytics (`pandas.io.ga`) basic documentation (:issue:`8835`). See :ref:`here<remote_data.ga>`.
- Added flag ``order_categoricals`` to ``StataReader`` and ``read_stata`` to select whether to order imported categorical data (:issue:`8836`). See :ref:`here <io.stata-categorical>` for more information on importing categorical variables from Stata data files.
- Reduce memory usage when skiprows is an integer in read_csv (:issue:`8681`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to performance section

@jreback
Copy link
Contributor

jreback commented Nov 26, 2014

looks good. minor comment. pls rebase / squash. ping when green.

@selasley
Copy link
Contributor Author

I moved the comment to performance in whatsnew and rebased

df = self.read_csv(StringIO(data.replace(',', ' ')),
header=None, delim_whitespace=True,
skiprows=[0,1,2,3,5,6], skip_blank_lines=True)
tm.assert_almost_equal(df.values, expected)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compare an actual dataframe here (and below), NOT the values, just tm.assert_frame_equal, this makes sure dtypes, index, columns are all correct

… 8661, 8679

ENH CSV: Reduce memory usage when skiprows is an integer in read_csv, issue 8681
@selasley
Copy link
Contributor Author

replaced tm.assert_almost_equal(df.values, expected) with tm.assert_frame_equal(df, expected) where expected is now a dataframe. I had to push a couple of times. The latest travis build should succeed

jreback added a commit that referenced this pull request Nov 27, 2014
@jreback jreback merged commit e5fe75e into pandas-dev:master Nov 27, 2014
@jreback
Copy link
Contributor

jreback commented Nov 27, 2014

thanks!

@xdliao
Copy link

xdliao commented Dec 2, 2014

Although it works for "skiprows = integer" now, there is still problem with "skiprows = list of rows"
when the list does not start from row=0.
such as read_table(file, skiprows = (1,2)) when all lines have trailing spaces

@selasley
Copy link
Contributor Author

selasley commented Dec 2, 2014

I can reproduce the problem. I'll look into it
data1="""A B C D E\n2spaces \n1space \n3spaces \nF G H I J \n0 1 2 3 4 \n5. 6 7. 8 9. """
data2="""A B C D E \n2spaces \n1space \n3spaces \nF G H I J \n0 1 2 3 4 \n5. 6 7. 8 9. """
pd.read_csv(StringIO(data1), skiprows=4, delim_whitespace=True)
F G H I J
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data1), skiprows=(1,2,3,4), delim_whitespace=True)
A B C D E
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data2), skiprows=4, delim_whitespace=True)
F G H I J
0 0 1 2 3 4
1 5 6 7 8 9
pd.read_csv(StringIO(data2), skiprows=(1,2,3,4), delim_whitespace=True)
A B C D E
0 NaN NaN NaN NaN NaN
1 F G H I J
2 0 1 2 3 4
3 5. 6 7. 8 9.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2014

@selasley excellent why don't u create a new issue for this (and xref this one)
@xdliao thanks for the report!

@selasley
Copy link
Contributor Author

selasley commented Feb 1, 2015


Test name | head[ms] | base[ms] | ratio |

frame_to_csv_date_formatting | 10.0880 | 10.4263 | 0.9676 |
frame_to_csv | 110.0133 | 113.4770 | 0.9695 |
write_csv_standard | 32.4697 | 33.2243 | 0.9773 |
read_csv_infer_datetime_format_ymd | 1.5241 | 1.5490 | 0.9839 |
read_csv_infer_datetime_format_iso8601 | 1.3983 | 1.3986 | 0.9998 |
read_csv_comment2 | 20.4786 | 20.3200 | 1.0078 |
packers_write_csv | 935.2113 | 927.8717 | 1.0079 |
read_csv_infer_datetime_format_custom | 7.3980 | 7.3347 | 1.0086 |
packers_read_csv | 138.5036 | 135.7140 | 1.0206 |
frame_to_csv2 | 91.6687 | 89.4313 | 1.0250 |
read_csv_precise_converter | 1.2287 | 1.1946 | 1.0285 |
frame_to_csv_mixed | 454.8573 | 441.8160 | 1.0295 |
read_csv_roundtrip_converter | 1.8710 | 1.8086 | 1.0344 |
read_csv_skiprows | 10.4700 | 10.0637 | 1.0404 |
read_csv_thou_vb | 14.9264 | 14.2634 | 1.0465 |
read_csv_standard | 8.9880 | 8.5763 | 1.0480 |
read_csv_vb | 17.1566 | 16.2590 | 1.0552 |

read_csv_default_converter | 1.2707 | 1.1950 | 1.0633 |

Test name | head[ms] | base[ms] | ratio |

Seed used: 1234

Target [8602c23] : BUG: Fix buffer overflows in tokenizer.c with certain malformed input files. GH9205
Base [9f439f0] : Merge pull request #9380 from behzadnouri/i8grby

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

memory error when skipping rows Bug with read_table, skiprows, and C engine
5 participants