Skip to content

perf(encoding): UTF-8 encoder walks code units inline instead of for char in src#242

Open
mizchi wants to merge 3 commits into
moonbitlang:mainfrom
mizchi:pr-encoding-utf8-code-unit-walk
Open

perf(encoding): UTF-8 encoder walks code units inline instead of for char in src#242
mizchi wants to merge 3 commits into
moonbitlang:mainfrom
mizchi:pr-encoding-utf8-code-unit-walk

Conversation

@mizchi
Copy link
Copy Markdown
Contributor

@mizchi mizchi commented May 23, 2026

Summary

encoding::encode(UTF8 | UTF16BE, ...) walks the source string with for char in src. That goes through String::iter + Iter::next<Char> per code point, plus surrogate-pair decoding inside Iter::next. On an ASCII-heavy 64 KiB payload the iterator path is ~30% of self time (String::iter 19.6% + Iter::next<Char> 10.2%) — more than the actual UTF-8 emission.

Replace with an explicit UTF-16 code-unit walk that handles surrogate pairs inline:

let len = src.length()
let mut i = 0
while i < len {
  let cu = src.unsafe_get(i).to_int()
  i += 1
  let cp = if cu >= 0xD800 && cu <= 0xDBFF && i < len {
    let cu2 = src.unsafe_get(i).to_int()
    if cu2 >= 0xDC00 && cu2 <= 0xDFFF {
      i += 1
      ((cu - 0xD800) * 0x400) + (cu2 - 0xDC00) + 0x10000
    } else { cu }
  } else { cu }
  write(new_buf, cp.unsafe_to_char())
}

ASCII code points (cu < 0xD800) skip the surrogate-detection arm entirely, which matches the common case.

Benchmark

Scenario: bench-x/cmd/encoding_utf8/main.mbt — 4 KiB ASCII chunk × 64 = ~256 KiB payload, UTF-8 encoded 5000× per run. Native release, Linux x86_64, 3-run median wall time.

baseline patched delta
encoding_utf8 528 ms 408 ms -22.7%

Tests

moonbitlang/x/encoding    71 / 71 pass

Background

Same iter-overhead pattern as the moonbitlang/async gzip crc32_update patch and the base64 patch in this series (both also -20% to -37% from the same root cause).

mizchi and others added 3 commits May 23, 2026 19:23
… of for char in src

encoding::encode(UTF8 | UTF16BE, ...) walked the source with 'for char
in src', which goes through String::iter + Iter::next<Char> per code
point plus surrogate-pair decoding inside Iter::next. On an ASCII-
heavy 64 KiB payload the iterator path was ~30% of self time
(String::iter 19.6% + Iter::next<Char> 10.2%) -- more than the
actual UTF-8 emission.

Replace with an explicit code-unit walk that handles surrogate pairs
inline; the ASCII (cu < 0xD800) path skips surrogate detection
entirely.

encoding_utf8 bench (native, 3-run median, 256 KiB ASCII x 5000 iters):
  baseline: 528 ms
  patched : 408 ms  (-22.7%)
@peter-jerry-ye
Copy link
Copy Markdown
Collaborator

Same as #241, this should be fixed with the compiler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants