Skip to content

refactor: Preallocate UTF-8 encoding output#3434

Open
peter-jerry-ye wants to merge 1 commit intomainfrom
zihang/adjust-utf8-encode
Open

refactor: Preallocate UTF-8 encoding output#3434
peter-jerry-ye wants to merge 1 commit intomainfrom
zihang/adjust-utf8-encode

Conversation

@peter-jerry-ye
Copy link
Copy Markdown
Collaborator

Summary

  • replace @encoding/utf8.encode's buffer-based implementation with a two-pass encoder that sizes the UTF-8 output first and writes directly into one FixedArray[Byte]
  • validate surrogate structure during encoding so malformed UTF-16 input aborts instead of being silently encoded
  • add regression tests for valid encoding behavior and malformed surrogate panic cases

Why

The previous implementation encoded through Buffer::write_string_utf8, which iterated by code point and paid repeated grow checks on the hot path. This change keeps the work at the UTF-16 code-unit level, does one preallocation, and then writes bytes directly.

Impact

  • improves UTF-16 -> UTF-8 encoding throughput
  • preserves BOM behavior and existing output for valid inputs
  • now rejects invalid surrogate pairs explicitly

Validation

  • moon test encoding/utf8/encode_test.mbt --target all
  • moon check encoding/utf8 --target all
  • moon fmt
  • moon info encoding/utf8
  • moon bench -p tmp/utf8_encode_bench --target native --release
  • moon bench -p tmp/utf8_encode_bench --target js --release
  • moon bench -p tmp/utf8_encode_bench --target wasm --release
  • moon bench -p tmp/utf8_encode_bench --target wasm-gc --release

Notes

The temporary tmp/utf8_encode_bench package used for measurement was removed before commit.

@coveralls
Copy link
Copy Markdown
Collaborator

Coverage Report for CI Build 3833

Coverage decreased (-0.02%) to 94.892%

Details

  • Coverage decreased (-0.02%) from the base build.
  • Patch coverage: 5 uncovered changes across 1 file (36 of 41 lines covered, 87.8%).
  • No coverage regressions found.

Uncovered Changes

File Changed Covered %
encoding/utf8/encode.mbt 41 36 87.8%

Coverage Regressions

No coverage regressions found.


Coverage Stats

Coverage Status
Relevant Lines: 15504
Covered Lines: 14712
Line Coverage: 94.89%
Coverage Strength: 220621.5 hits per line

💛 - Coveralls

@peter-jerry-ye peter-jerry-ye marked this pull request as ready for review April 17, 2026 07:23
Copy link
Copy Markdown

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@peter-jerry-ye peter-jerry-ye changed the title [codex] Preallocate UTF-8 encoding output refactor: Preallocate UTF-8 encoding output Apr 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants