Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up tokenizing of a row in csv and xstrtod parsing #25784

Merged
merged 7 commits into from Mar 20, 2019

Conversation

@vnlitvin
Copy link
Contributor

commented Mar 19, 2019

  • closes: N/A
  • tests passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

I will update the PR when CI finishes running, as I locally tested io.parser.test_common only.

@jreback
Copy link
Contributor

left a comment

can you run the asv benchmarks for csv and report the changes. we may not have benchmarks that are specificying hitting the code you changes; if so can you add one.

@@ -132,7 +132,8 @@ Performance Improvements
- Improved performance of :meth:`Series.searchsorted`. The speedup is especially large when the dtype is
int8/int16/int32 and the searched key is within the integer bounds for the dtype (:issue:`22034`)
- Improved performance of :meth:`pandas.core.groupby.GroupBy.quantile` (:issue:`20405`)

- Improved performance of `tokenize_bytes` in `tokenizer.c`

This comment has been minimized.

Copy link
@jreback

jreback Mar 19, 2019

Contributor

also say :meth:`read_csv` a user has no idea what any of the other things are

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2019

asv continuous -f 1.05 origin/master HEAD -b io.csv results:

before after ratio test name
[e8d951d] [a4f6dcd]
master speed-up-tokenizer
1.58±0.02ms 1.49±0.02ms 0.94 io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
5.23±0.1ms 4.87±0.03ms 0.93 io.csv.ReadUint64Integers.time_read_uint64_na_values
10.1±0.2ms 9.15±0.3ms 0.91 io.csv.ReadCSVSkipRows.time_skipprows(10000)
13.0±0.3ms 11.6±0.2ms 0.89 io.csv.ReadCSVThousands.time_thousands('|', None)

So this basically gives from 5% to 10% increase on mini-benchmarks. Note that this scales quite well, and on cases where csv files are million lines long it still gives same 5-10%.

@vnlitvin vnlitvin force-pushed the anmyachev:speed-up-tokenizer branch from a4f6dcd to 12a6df9 Mar 20, 2019

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2019

The results are even more promising if I allow more warmup and more sampling time so Turboboost and frequency scaling don't impact the performance too much.

Running asv continuous -f 1.05 origin/master HEAD -b io.csv -a sample_time=2 -a warmup_time=2 yields:

before after ratio test name
[e8d951d] [a4f6dcd]
master speed-up-tokenizer
34.9±0.2ms 32.9±1ms 0.94 io.csv.ReadCSVCategorical.time_convert_direct
13.5±0.02ms 12.6±0.08ms 0.93 io.csv.ReadCSVThousands.time_thousands(',', ',')
5.10±0.07ms 4.71±0.09ms 0.92 io.csv.ReadUint64Integers.time_read_uint64_neg_values
14.6±0.06ms 13.0±0.04ms 0.89 io.csv.ReadCSVThousands.time_thousands('|', ',')
16.0±0.3ms 13.8±0.09ms 0.86 io.csv.ReadCSVSkipRows.time_skipprows(None)
10.3±0.1ms 8.81±0.1ms 0.86 io.csv.ReadCSVSkipRows.time_skipprows(10000)
12.9±0.1ms 10.7±0.09ms 0.84 io.csv.ReadCSVThousands.time_thousands('|', None)
13.0±0.05ms 10.8±0.08ms 0.83 io.csv.ReadCSVThousands.time_thousands(',', None)
@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2019

@jreback I've fixed whatsnew entry per your comment and rebased to latest master for clean history.

@codecov

This comment has been minimized.

Copy link

commented Mar 20, 2019

Codecov Report

Merging #25784 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25784      +/-   ##
==========================================
- Coverage   91.26%   91.26%   -0.01%     
==========================================
  Files         173      173              
  Lines       52982    52982              
==========================================
- Hits        48356    48355       -1     
- Misses       4626     4627       +1
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.76% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/util/testing.py 89.3% <0%> (-0.11%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...12a6df9. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

commented Mar 20, 2019

Codecov Report

Merging #25784 into master will increase coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #25784      +/-   ##
==========================================
+ Coverage   91.26%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       52982    53002      +20     
==========================================
+ Hits        48356    48375      +19     
- Misses       4626     4627       +1
Flag Coverage Δ
#multiple 89.83% <ø> (ø) ⬆️
#single 41.77% <ø> (+0.01%) ⬆️
Impacted Files Coverage Δ
pandas/util/testing.py 89.3% <0%> (-0.11%) ⬇️
pandas/core/series.py 93.67% <0%> (-0.01%) ⬇️
pandas/core/ops.py 91.74% <0%> (ø) ⬆️
pandas/core/frame.py 96.79% <0%> (ø) ⬆️
pandas/core/generic.py 93.52% <0%> (ø) ⬆️
pandas/core/computation/expr.py 88.52% <0%> (+0.35%) ⬆️
pandas/core/computation/common.py 89.47% <0%> (+3.75%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...72c7570. Read the comment docs.

@jreback jreback added this to the 0.25.0 milestone Mar 20, 2019

@jreback
Copy link
Contributor

left a comment

comment on whatsnew, ping on green.

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

lgtm.

@vnlitvin

This comment has been minimized.

Copy link
Contributor Author

commented Mar 20, 2019

ping on green

@jreback all tests are green

@jreback jreback merged commit 4c21e5c into pandas-dev:master Mar 20, 2019

8 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20190320.38 succeeded
Details
pandas-dev.pandas (Checks_and_doc) Checks_and_doc succeeded
Details
pandas-dev.pandas (Linux py36_locale_slow) Linux py36_locale_slow succeeded
Details
pandas-dev.pandas (Linux py37_locale) Linux py37_locale succeeded
Details
pandas-dev.pandas (Linux py37_np_dev) Linux py37_np_dev succeeded
Details
pandas-dev.pandas (Windows py36_np14) Windows py36_np14 succeeded
Details
pandas-dev.pandas (macOS py35_macos) macOS py35_macos succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Mar 20, 2019

thanks @vnlitvin nice patch!

thoo added a commit to thoo/pandas that referenced this pull request Mar 20, 2019

Merge remote-tracking branch 'upstream/master' into pivot
* upstream/master: (55 commits)
  PERF: Improve performance of StataReader (pandas-dev#25780)
  Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784)
  BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588)
  BUG-24971 copying blocks also considers ndim (pandas-dev#25521)
  CLN: Panel reference from documentation (pandas-dev#25649)
  ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955)
  BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769)
  DOC: clean bug fix section in whatsnew (pandas-dev#25792)
  DOC: Fixed PeriodArray api ref (pandas-dev#25526)
  Move locale code out of tm, into _config (pandas-dev#25757)
  Unpin pycodestyle (pandas-dev#25789)
  Add test for rdivmod on EA array (GH23287) (pandas-dev#24047)
  ENH: Support datetime.timezone objects (pandas-dev#25065)
  Cython language level 3 (pandas-dev#24538)
  API: concat on sparse values (pandas-dev#25719)
  TST: assert_produces_warning works with filterwarnings (pandas-dev#25721)
  make core.config self-contained (pandas-dev#25613)
  CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721)
  TST: Check pytables<3.5.1 when skipping (pandas-dev#25773)
  DOC: Fix typo in docstring of DataFrame.memory_usage  (pandas-dev#25770)
  ...

@vnlitvin vnlitvin deleted the anmyachev:speed-up-tokenizer branch Mar 28, 2019

anmyachev added a commit to anmyachev/pandas that referenced this pull request Apr 18, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.