Speed up tokenizing of a row in csv and xstrtod parsing #25784

vnlitvinov · 2019-03-19T16:46:03Z

closes: N/A
tests passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

I will update the PR when CI finishes running, as I locally tested io.parser.test_common only.

jreback

can you run the asv benchmarks for csv and report the changes. we may not have benchmarks that are specificying hitting the code you changes; if so can you add one.

jreback · 2019-03-19T16:50:45Z

doc/source/whatsnew/v0.25.0.rst

@@ -132,7 +132,8 @@ Performance Improvements
 - Improved performance of :meth:`Series.searchsorted`. The speedup is especially large when the dtype is
  int8/int16/int32 and the searched key is within the integer bounds for the dtype (:issue:`22034`)
 - Improved performance of :meth:`pandas.core.groupby.GroupBy.quantile` (:issue:`20405`)
-
+- Improved performance of `tokenize_bytes` in `tokenizer.c`


also say :meth:`read_csv` a user has no idea what any of the other things are

vnlitvinov · 2019-03-20T10:05:54Z

asv continuous -f 1.05 origin/master HEAD -b io.csv results:

before	after	ratio	test name
[`e8d951d`]	[`a4f6dcd`]
master	speed-up-tokenizer
1.58±0.02ms	1.49±0.02ms	0.94	io.csv.ReadCSVFloatPrecision.time_read_csv(',', '.', 'high')
5.23±0.1ms	4.87±0.03ms	0.93	io.csv.ReadUint64Integers.time_read_uint64_na_values
10.1±0.2ms	9.15±0.3ms	0.91	io.csv.ReadCSVSkipRows.time_skipprows(10000)
13.0±0.3ms	11.6±0.2ms	0.89	io.csv.ReadCSVThousands.time_thousands('\|', None)

So this basically gives from 5% to 10% increase on mini-benchmarks. Note that this scales quite well, and on cases where csv files are million lines long it still gives same 5-10%.

… int variable

vnlitvinov · 2019-03-20T11:20:09Z

The results are even more promising if I allow more warmup and more sampling time so Turboboost and frequency scaling don't impact the performance too much.

Running asv continuous -f 1.05 origin/master HEAD -b io.csv -a sample_time=2 -a warmup_time=2 yields:

before	after	ratio	test name
[`e8d951d`]	[`a4f6dcd`]
master	speed-up-tokenizer
34.9±0.2ms	32.9±1ms	0.94	io.csv.ReadCSVCategorical.time_convert_direct
13.5±0.02ms	12.6±0.08ms	0.93	io.csv.ReadCSVThousands.time_thousands(',', ',')
5.10±0.07ms	4.71±0.09ms	0.92	io.csv.ReadUint64Integers.time_read_uint64_neg_values
14.6±0.06ms	13.0±0.04ms	0.89	io.csv.ReadCSVThousands.time_thousands('\|', ',')
16.0±0.3ms	13.8±0.09ms	0.86	io.csv.ReadCSVSkipRows.time_skipprows(None)
10.3±0.1ms	8.81±0.1ms	0.86	io.csv.ReadCSVSkipRows.time_skipprows(10000)
12.9±0.1ms	10.7±0.09ms	0.84	io.csv.ReadCSVThousands.time_thousands('\|', None)
13.0±0.05ms	10.8±0.08ms	0.83	io.csv.ReadCSVThousands.time_thousands(',', None)

vnlitvinov · 2019-03-20T11:20:47Z

@jreback I've fixed whatsnew entry per your comment and rebased to latest master for clean history.

codecov · 2019-03-20T11:58:12Z

Codecov Report

Merging #25784 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25784      +/-   ##
==========================================
- Coverage   91.26%   91.26%   -0.01%     
==========================================
  Files         173      173              
  Lines       52982    52982              
==========================================
- Hits        48356    48355       -1     
- Misses       4626     4627       +1

Flag	Coverage Δ
#multiple	`89.83% <ø> (ø)`	⬆️
#single	`41.76% <ø> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/util/testing.py	`89.3% <0%> (-0.11%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...12a6df9. Read the comment docs.

codecov · 2019-03-20T11:58:12Z

Codecov Report

Merging #25784 into master will increase coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #25784      +/-   ##
==========================================
+ Coverage   91.26%   91.27%   +<.01%     
==========================================
  Files         173      173              
  Lines       52982    53002      +20     
==========================================
+ Hits        48356    48375      +19     
- Misses       4626     4627       +1

Flag	Coverage Δ
#multiple	`89.83% <ø> (ø)`	⬆️
#single	`41.77% <ø> (+0.01%)`	⬆️

Impacted Files	Coverage Δ
pandas/util/testing.py	`89.3% <0%> (-0.11%)`	⬇️
pandas/core/series.py	`93.67% <0%> (-0.01%)`	⬇️
pandas/core/ops.py	`91.74% <0%> (ø)`	⬆️
pandas/core/frame.py	`96.79% <0%> (ø)`	⬆️
pandas/core/generic.py	`93.52% <0%> (ø)`	⬆️
pandas/core/computation/expr.py	`88.52% <0%> (+0.35%)`	⬆️
pandas/core/computation/common.py	`89.47% <0%> (+3.75%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4663951...72c7570. Read the comment docs.

jreback

comment on whatsnew, ping on green.

doc/source/whatsnew/v0.25.0.rst

jreback · 2019-03-20T13:16:14Z

lgtm.

vnlitvinov · 2019-03-20T13:51:23Z

ping on green

@jreback all tests are green

jreback · 2019-03-20T14:03:59Z

thanks @vnlitvin nice patch!

* upstream/master: (55 commits) PERF: Improve performance of StataReader (pandas-dev#25780) Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#25784) BUG: Fix _binop for operators for serials which has more than one returns (divmod/rdivmod). (pandas-dev#25588) BUG-24971 copying blocks also considers ndim (pandas-dev#25521) CLN: Panel reference from documentation (pandas-dev#25649) ENH: Quoting column names containing spaces with backticks to use them in query and eval. (pandas-dev#24955) BUG: reading windows utf8 filenames in py3.6 (pandas-dev#25769) DOC: clean bug fix section in whatsnew (pandas-dev#25792) DOC: Fixed PeriodArray api ref (pandas-dev#25526) Move locale code out of tm, into _config (pandas-dev#25757) Unpin pycodestyle (pandas-dev#25789) Add test for rdivmod on EA array (GH23287) (pandas-dev#24047) ENH: Support datetime.timezone objects (pandas-dev#25065) Cython language level 3 (pandas-dev#24538) API: concat on sparse values (pandas-dev#25719) TST: assert_produces_warning works with filterwarnings (pandas-dev#25721) make core.config self-contained (pandas-dev#25613) CLN: replace %s syntax with .format in pandas.io.parsers (pandas-dev#24721) TST: Check pytables<3.5.1 when skipping (pandas-dev#25773) DOC: Fix typo in docstring of DataFrame.memory_usage (pandas-dev#25770) ...

…5784)

jreback requested changes Mar 19, 2019

View reviewed changes

jreback added Performance Memory or execution speed performance IO CSV read_csv, to_csv labels Mar 19, 2019

vnlitvinov force-pushed the speed-up-tokenizer branch from a4f6dcd to 12a6df9 Compare March 20, 2019 11:16

vnlitvinov and others added 6 commits March 20, 2019 06:17

Try to speed up COLITER_NEXT

3e0cd7f

speed up parser's functions by register key word

c157716

Speed up comparisons for special symbols during tokenizing csv row

73edece

Speed up xstrtod by first computing integer part of a float number in…

b4c2ca8

… int variable

Added whatsnew entry

41d1747

Fixed whatsnew entry per @jreback comment

12a6df9

jreback added this to the 0.25.0 milestone Mar 20, 2019

jreback requested changes Mar 20, 2019

View reviewed changes

doc/source/whatsnew/v0.25.0.rst Outdated Show resolved Hide resolved

Fine-tuning whatsnew entry for improved readability

72c7570

jreback approved these changes Mar 20, 2019

View reviewed changes

jreback merged commit 4c21e5c into pandas-dev:master Mar 20, 2019

vnlitvinov deleted the speed-up-tokenizer branch March 28, 2019 14:30

anmyachev pushed a commit to anmyachev/pandas that referenced this pull request Apr 18, 2019

Speed up tokenizing of a row in csv and xstrtod parsing (pandas-dev#2…

ad6e391

…5784)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Speed up tokenizing of a row in csv and xstrtod parsing #25784

Speed up tokenizing of a row in csv and xstrtod parsing #25784

Uh oh!

vnlitvinov commented Mar 19, 2019 •

edited

Loading

Uh oh!

jreback left a comment

Uh oh!

jreback Mar 19, 2019

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

codecov bot commented Mar 20, 2019

Uh oh!

codecov bot commented Mar 20, 2019 •

edited

Loading

Uh oh!

jreback left a comment

Uh oh!

Uh oh!

jreback commented Mar 20, 2019

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

jreback commented Mar 20, 2019

Uh oh!

Uh oh!

Uh oh!

Speed up tokenizing of a row in csv and xstrtod parsing #25784

Speed up tokenizing of a row in csv and xstrtod parsing #25784

Uh oh!

Conversation

vnlitvinov commented Mar 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Mar 19, 2019

Choose a reason for hiding this comment

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

codecov bot commented Mar 20, 2019

Codecov Report

Uh oh!

codecov bot commented Mar 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback commented Mar 20, 2019

Uh oh!

vnlitvinov commented Mar 20, 2019

Uh oh!

jreback commented Mar 20, 2019

Uh oh!

Uh oh!

vnlitvinov commented Mar 19, 2019 •

edited

Loading

codecov bot commented Mar 20, 2019 •

edited

Loading