perf(encoding): UTF-8 encoder walks code units inline instead of for char in src#242
Open
mizchi wants to merge 3 commits into
Open
perf(encoding): UTF-8 encoder walks code units inline instead of for char in src#242mizchi wants to merge 3 commits into
mizchi wants to merge 3 commits into
Conversation
… of for char in src encoding::encode(UTF8 | UTF16BE, ...) walked the source with 'for char in src', which goes through String::iter + Iter::next<Char> per code point plus surrogate-pair decoding inside Iter::next. On an ASCII- heavy 64 KiB payload the iterator path was ~30% of self time (String::iter 19.6% + Iter::next<Char> 10.2%) -- more than the actual UTF-8 emission. Replace with an explicit code-unit walk that handles surrogate pairs inline; the ASCII (cu < 0xD800) path skips surrogate detection entirely. encoding_utf8 bench (native, 3-run median, 256 KiB ASCII x 5000 iters): baseline: 528 ms patched : 408 ms (-22.7%)
Collaborator
|
Same as #241, this should be fixed with the compiler |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
encoding::encode(UTF8 | UTF16BE, ...)walks the source string withfor char in src. That goes throughString::iter+Iter::next<Char>per code point, plus surrogate-pair decoding insideIter::next. On an ASCII-heavy 64 KiB payload the iterator path is ~30% of self time (String::iter19.6% +Iter::next<Char>10.2%) — more than the actual UTF-8 emission.Replace with an explicit UTF-16 code-unit walk that handles surrogate pairs inline:
ASCII code points (
cu < 0xD800) skip the surrogate-detection arm entirely, which matches the common case.Benchmark
Scenario:
bench-x/cmd/encoding_utf8/main.mbt— 4 KiB ASCII chunk × 64 = ~256 KiB payload, UTF-8 encoded 5000× per run. Native release, Linux x86_64, 3-run median wall time.Tests
Background
Same iter-overhead pattern as the moonbitlang/async gzip
crc32_updatepatch and the base64 patch in this series (both also -20% to -37% from the same root cause).