-
-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Avoid unnecessary string operations in loadtxt. #19734
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
anntzer
force-pushed
the
loadtxt-approx-split-line
branch
2 times, most recently
from
August 25, 2021 12:18
8575415
to
1b0b344
Compare
Closes numpy#17277. If loadtxt is passed an unsized string or byte dtype, the size is set automatically from the longest entry in the first 50000 lines. If longer entries appeared later, they were silently truncated.
This is much faster (~30%) for loading actual structured dtypes (by skipping the recursive packer), somewhat faster (~5-10%) for large loads (>10000 rows, perhaps because shape inference of the final array is faster?), and much slower (nearly 2x) for very small loads (10 rows) or for reads using `dtype=object` (due to the extraneous limitation on object views, which could be fixed separately); however, the main point is to allow further optimizations.
This patch takes advantage of the possibility of assigning a tuple of *strs* to a structured dtype with e.g. float fields, and have the strs be implicitly converted to floats by numpy at the C-level. (A Python-level fallback is kept to support e.g. hex floats.) Together with the previous commit, this provides a massive speedup (~2x on the loadtxt_dtypes_csv benchmark for 10_000+ ints or floats), but is beneficial with as little as 100 rows. Very small reads (10 rows) are still slower (nearly 2x for object), as well as reads using object dtypes (due to the extra copy), but the tradeoff seems worthwhile.
In the fast-path of loadtxt, the conversion to np.void implicitly checks the number of fields. Removing the explicit length check saves ~5% for the largest loads (100_000 rows) of numeric scalar types.
When using _IMPLICIT_CONVERTERS, it is actually OK if strings are passed with trailing newlines (so we don't need to strip `"\r\n"`), and comments can be implicitly detected because the converters raise ValueError on them. Therefore, one can use an "approximate" line splitter, which doesn't remove trailing comments and newlines, falling back on the full line splitter if needed. This provides a 5-23% speedup in the case where there are actually no comments in the file (we could instead check the value of the `comments` kwarg, but it defaults to a non-empty value and it seems likely most users will not notice that a large speedup can be achieved by emptying it). However, if there *are* actual comments in the file, then recreating the original string from the approximately split one and then re-splitting it is very costly (it would incur a >2x slowdown), so switch back to the full splitter (controlled by a local flag) in that case. Overall, only very short (10 rows) loads that include comments are slowed down by ~10% (likely by the extra processing on the row with comments). (To be fully explicit, despite its name, the "approximate" splitter will never parse incorrect values; it may simply "fail", but we just fall back to the full/slow splitter in that case.)
anntzer
force-pushed
the
loadtxt-approx-split-line
branch
from
August 26, 2021 14:23
1b0b344
to
0d5642d
Compare
Going to close this, I am very sure gh-20580 will land soon enough that it is not worthwhile to keep this open. Plus, you get around 10× faster :). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR goes on top of #19687 (only the last commit is new), but showcases another speed benefit of special-casing numeric types in loadtxt, so I thought I may as well post it already :-)
When using _IMPLICIT_CONVERTERS, it is actually OK if strings are
passed with trailing newlines (so we don't need to strip
"\r\n"
),and comments can be implicitly detected because the converters raise
ValueError on them. Therefore, one can use an "approximate" line
splitter, which doesn't remove trailing comments and newlines, falling
back on the full line splitter if needed. This provides a 10-20%
speedup in the case where there are actually no comments in the file (we
could instead check the value of the
comments
kwarg, but it defaultsto a non-empty value and it seems likely most users will not notice that
a large speedup can be achieved by emptying it).
However, if there are actual comments in the file, then recreating the
original string from the approximately split one and then re-splitting
it is very costly (it would incur a >2x slowdown), so switch back to the
full splitter (controlled by a local flag) in that case. Overall, only
very short (10 rows) loads that include comments are slowed down by ~10%
(likely by the extra processing on the row with comments).
(To be fully explicit, despite its name, the "approximate" splitter will
never parse incorrect values; it may simply "fail", but we just fall
back to the full/slow splitter in that case.)
The obligatory benchmarks: