perf(encoding): UTF-8 encoder walks code units inline instead of for char in src by mizchi · Pull Request #242 · moonbitlang/x

mizchi · 2026-05-23T10:23:29Z

Summary

encoding::encode(UTF8 | UTF16BE, ...) walks the source string with for char in src. That goes through String::iter + Iter::next<Char> per code point, plus surrogate-pair decoding inside Iter::next. On an ASCII-heavy 64 KiB payload the iterator path is ~30% of self time (String::iter 19.6% + Iter::next<Char> 10.2%) — more than the actual UTF-8 emission.

Replace with an explicit UTF-16 code-unit walk that handles surrogate pairs inline:

let len = src.length()
let mut i = 0
while i < len {
  let cu = src.unsafe_get(i).to_int()
  i += 1
  let cp = if cu >= 0xD800 && cu <= 0xDBFF && i < len {
    let cu2 = src.unsafe_get(i).to_int()
    if cu2 >= 0xDC00 && cu2 <= 0xDFFF {
      i += 1
      ((cu - 0xD800) * 0x400) + (cu2 - 0xDC00) + 0x10000
    } else { cu }
  } else { cu }
  write(new_buf, cp.unsafe_to_char())
}

ASCII code points (cu < 0xD800) skip the surrogate-detection arm entirely, which matches the common case.

Benchmark

Scenario: bench-x/cmd/encoding_utf8/main.mbt — 4 KiB ASCII chunk × 64 = ~256 KiB payload, UTF-8 encoded 5000× per run. Native release, Linux x86_64, 3-run median wall time.

	baseline	patched	delta
encoding_utf8	528 ms	408 ms	-22.7%

Tests

moonbitlang/x/encoding    71 / 71 pass

Background

Same iter-overhead pattern as the moonbitlang/async gzip crc32_update patch and the base64 patch in this series (both also -20% to -37% from the same root cause).

… of for char in src encoding::encode(UTF8 | UTF16BE, ...) walked the source with 'for char in src', which goes through String::iter + Iter::next<Char> per code point plus surrogate-pair decoding inside Iter::next. On an ASCII- heavy 64 KiB payload the iterator path was ~30% of self time (String::iter 19.6% + Iter::next<Char> 10.2%) -- more than the actual UTF-8 emission. Replace with an explicit code-unit walk that handles surrogate pairs inline; the ASCII (cu < 0xD800) path skips surrogate detection entirely. encoding_utf8 bench (native, 3-run median, 256 KiB ASCII x 5000 iters): baseline: 528 ms patched : 408 ms (-22.7%)

peter-jerry-ye · 2026-05-26T02:04:50Z

Same as #241, this should be fixed with the compiler

mizchi and others added 3 commits May 23, 2026 19:23

Update generated interfaces for stable moon

1c6e471

Apply stable formatter output

435d448

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(encoding): UTF-8 encoder walks code units inline instead of for char in src#242

perf(encoding): UTF-8 encoder walks code units inline instead of for char in src#242
mizchi wants to merge 3 commits into
moonbitlang:mainfrom
mizchi:pr-encoding-utf8-code-unit-walk

mizchi commented May 23, 2026

Uh oh!

peter-jerry-ye commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mizchi commented May 23, 2026

Summary

Benchmark

Tests

Background

Uh oh!

peter-jerry-ye commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants