fix(fast-build): properly decode multi-byte UTF-8 in JSON string parser#7410
Merged
fix(fast-build): properly decode multi-byte UTF-8 in JSON string parser#7410
Conversation
…tring parser Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes crates/microsoft-fast-build’s hand-rolled JSON string parsing so that non-ASCII literal characters in JSON strings (e.g., emoji, accented letters) are decoded as proper multi-byte UTF-8 sequences instead of being corrupted by per-byte as char casting. This improves correctness for SSR/state-driven rendering paths that consume JSON.
Changes:
- Update
parse_stringto decode multi-byte UTF-8 sequences as a whole (keeping a fast ASCII path). - Add unit tests in
json.rscovering emoji and 2/3/4-byte UTF-8 sequences in strings and objects. - Add integration tests in
tests/bindings.rsto ensure multi-byte characters round-trip through bindings/rendering.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| crates/microsoft-fast-build/src/json.rs | Fix UTF-8 decoding in parse_string and add targeted unit tests. |
| crates/microsoft-fast-build/tests/bindings.rs | Add integration tests to verify correct rendering/binding with non-ASCII content. |
| crates/microsoft-fast-build/DESIGN.md | Document the updated parse_string behavior for literal non-ASCII UTF-8. |
Replace manual lead-byte range inference with from_utf8 on the remaining slice + chars().next(), then advance by ch.len_utf8(). This avoids misclassifying continuation bytes (0x80..0xBF) as 2-byte lead bytes, and makes malformed-input behaviour unambiguous. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
📖 Description
Fixes a bug in the hand-rolled JSON string parser (
json.rs) where non-ASCII characters were corrupted during parsing.The
parse_stringfunction advanced through the input byte-by-byte and usedb as charfor unescaped characters. For ASCII bytes this is correct, but for multi-byte UTF-8 sequences it casts each byte independently to a Unicode scalar, producing garbage. For example, ⭐ (U+2B50, UTF-8:0xE2 0xAD 0x90) was rendered asâ(U+00E2) followed by two invisible control characters.The fix determines the byte-length of each UTF-8 sequence from the leading byte, reads the full sequence, and decodes it with
std::str::from_utf8. ASCII bytes (< 0x80) continue to use the fastb as charpath.This explains the
â SELECTEDcorruption observed in the observer-map fixture output (issue seen in #7388).📑 Test Plan
New unit tests in
json.rs:test_parse_string_with_emoji— ⭐ round-trips correctlytest_parse_string_with_multibyte_chars— 2-byte (é), 3-byte (✓), and 4-byte (🎉) sequencestest_parse_string_emoji_in_object— emoji values inside a JSON objectNew integration tests in
tests/bindings.rs:test_binding_emoji_from_state— emoji from JSON state renders correctly in a bindingtest_binding_emoji_literal_in_template— emoji in template literal text is preserved verbatimtest_binding_multibyte_chars— general multi-byte character round-tripAll existing tests pass (
cargo test).✅ Checklist
General
$ npm run change