Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Avoid unnecessary string operations in loadtxt. #19734

Closed
wants to merge 5 commits into from

Conversation

anntzer
Copy link
Contributor

@anntzer anntzer commented Aug 23, 2021

This PR goes on top of #19687 (only the last commit is new), but showcases another speed benefit of special-casing numeric types in loadtxt, so I thought I may as well post it already :-)


When using _IMPLICIT_CONVERTERS, it is actually OK if strings are
passed with trailing newlines (so we don't need to strip "\r\n"),
and comments can be implicitly detected because the converters raise
ValueError on them. Therefore, one can use an "approximate" line
splitter, which doesn't remove trailing comments and newlines, falling
back on the full line splitter if needed. This provides a 10-20%
speedup in the case where there are actually no comments in the file (we
could instead check the value of the comments kwarg, but it defaults
to a non-empty value and it seems likely most users will not notice that
a large speedup can be achieved by emptying it).

However, if there are actual comments in the file, then recreating the
original string from the approximately split one and then re-splitting
it is very costly (it would incur a >2x slowdown), so switch back to the
full splitter (controlled by a local flag) in that case. Overall, only
very short (10 rows) loads that include comments are slowed down by ~10%
(likely by the extra processing on the row with comments).

(To be fully explicit, despite its name, the "approximate" splitter will
never parse incorrect values; it may simply "fail", but we just fall
back to the full/slow splitter in that case.)


The obligatory benchmarks:

       before           after         ratio
     [df5ee9f3]       [a40561a7]
     <loadtxtflatdtype>       <loadtxt-approx-split-line>
+      44.9±0.5μs       50.0±0.1μs     1.11  bench_io.LoadtxtCSVComments.time_comment_loadtxt_csv(10)
-     46.3±0.05μs       44.0±0.1μs     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10)
-     47.5±0.05μs       45.1±0.4μs     0.95  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10)
-      45.3±0.1μs      42.7±0.08μs     0.94  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10)
-       133±0.7ms        113±0.2ms     0.85  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(10000)
-       146±0.7ms        124±0.5ms     0.85  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(0)
-       145±0.7ms        123±0.4ms     0.85  bench_io.LoadtxtCSVSkipRows.time_skiprows_csv(500)
-         115±1μs       97.2±0.2μs     0.84  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100)
-         125±1μs        106±0.5μs     0.84  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100)
-     8.94±0.05ms      7.49±0.09ms     0.84  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 10000)
-         116±1μs       96.9±0.6μs     0.83  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100)
-      88.9±0.6ms       73.8±0.6ms     0.83  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('complex128', 100000)
-         113±2μs       93.2±0.6μs     0.82  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100)
-         795±6μs          654±4μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64(1000)
-         459±4μs          376±1μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(550)
-         798±6μs          654±5μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(1000)
-         458±2μs          375±2μs     0.82  bench_io.LoadtxtReadUint64Integers.time_read_uint64(550)
-     7.94±0.03ms      6.47±0.04ms     0.82  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 10000)
-         114±1μs       92.5±0.3μs     0.81  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100)
-      78.3±0.3ms       63.6±0.2ms     0.81  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float32', 100000)
-      7.68±0.1ms      6.23±0.09ms     0.81  bench_io.LoadtxtReadUint64Integers.time_read_uint64_neg_values(10000)
-     7.97±0.04ms      6.47±0.05ms     0.81  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 10000)
-     7.67±0.07ms      6.21±0.08ms     0.81  bench_io.LoadtxtReadUint64Integers.time_read_uint64(10000)
-      79.2±0.7ms       64.0±0.4ms     0.81  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('float64', 100000)
-     7.67±0.05ms       6.00±0.1ms     0.78  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 10000)
-      76.1±0.8ms       59.5±0.9ms     0.78  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 100000)
-      7.69±0.1ms       5.98±0.1ms     0.78  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int32', 10000)
-      76.6±0.4ms       59.1±0.9ms     0.77  bench_io.LoadtxtCSVdtypes.time_loadtxt_dtypes_csv('int64', 100000)

DFEvans and others added 5 commits August 26, 2021 16:20
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype,
the size is set automatically from the longest entry in the first
50000 lines. If longer entries appeared later, they were silently
truncated.
This is much faster (~30%) for loading actual structured dtypes (by
skipping the recursive packer), somewhat faster (~5-10%) for large loads
(>10000 rows, perhaps because shape inference of the final array is
faster?), and much slower (nearly 2x) for very small loads (10 rows) or
for reads using `dtype=object` (due to the extraneous limitation on
object views, which could be fixed separately); however, the main point
is to allow further optimizations.
This patch takes advantage of the possibility of assigning a tuple of
*strs* to a structured dtype with e.g. float fields, and have the strs
be implicitly converted to floats by numpy at the C-level.  (A
Python-level fallback is kept to support e.g. hex floats.)  Together
with the previous commit, this provides a massive speedup (~2x on the
loadtxt_dtypes_csv benchmark for 10_000+ ints or floats), but is
beneficial with as little as 100 rows.  Very small reads (10 rows) are
still slower (nearly 2x for object), as well as reads using object
dtypes (due to the extra copy), but the tradeoff seems worthwhile.
In the fast-path of loadtxt, the conversion to np.void implicitly checks
the number of fields.  Removing the explicit length check saves ~5% for
the largest loads (100_000 rows) of numeric scalar types.
When using _IMPLICIT_CONVERTERS, it is actually OK if strings are
passed with trailing newlines (so we don't need to strip `"\r\n"`),
and comments can be implicitly detected because the converters raise
ValueError on them.  Therefore, one can use an "approximate" line
splitter, which doesn't remove trailing comments and newlines, falling
back on the full line splitter if needed.  This provides a 5-23%
speedup in the case where there are actually no comments in the file (we
could instead check the value of the `comments` kwarg, but it defaults
to a non-empty value and it seems likely most users will not notice that
a large speedup can be achieved by emptying it).

However, if there *are* actual comments in the file, then recreating the
original string from the approximately split one and then re-splitting
it is very costly (it would incur a >2x slowdown), so switch back to the
full splitter (controlled by a local flag) in that case.  Overall, only
very short (10 rows) loads that include comments are slowed down by ~10%
(likely by the extra processing on the row with comments).

(To be fully explicit, despite its name, the "approximate" splitter will
never parse incorrect values; it may simply "fail", but we just fall
back to the full/slow splitter in that case.)
@seberg
Copy link
Member

seberg commented Jan 16, 2022

Going to close this, I am very sure gh-20580 will land soon enough that it is not worthwhile to keep this open. Plus, you get around 10× faster :).

@seberg seberg closed this Jan 16, 2022
@anntzer anntzer deleted the loadtxt-approx-split-line branch January 16, 2022 22:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants