perf(json): scan whitespace directly to avoid per-call StringView alloc#3632
perf(json): scan whitespace directly to avoid per-call StringView alloc#3632mizchi wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Optimizes JSON whitespace skipping to reduce allocations in the lexer hot path by replacing regex-based matching with a direct scan over the input.
Changes:
- Replaced
lexmatchon a freshly createdStringViewwith a loop that scans ASCII whitespace directly. - Added rationale comments describing allocation reductions and UTF-16 code unit scanning semantics.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| while ctx.offset < ctx.end_offset { | ||
| let c = ctx.input.unsafe_get(ctx.offset) | ||
| if c == ' ' || c == '\t' || c == '\r' || c == '\n' { | ||
| ctx.offset = ctx.offset + 1 | ||
| } else { | ||
| break | ||
| } | ||
| } |
There was a problem hiding this comment.
Done — offset and end are now hoisted into locals at the top of the function and ctx.offset is written once after the loop. New revision pushed.
| fn ParseContext::lex_skip_whitespace(ctx : ParseContext) -> Unit { | ||
| let rest = ctx.input.view(start_offset=ctx.offset, end_offset=ctx.end_offset) | ||
| lexmatch rest with longest { | ||
| ("[ \t\r\n]+", next) => ctx.offset = ctx.end_offset - next.length() | ||
| _ => () | ||
| while ctx.offset < ctx.end_offset { | ||
| let c = ctx.input.unsafe_get(ctx.offset) | ||
| if c == ' ' || c == '\t' || c == '\r' || c == '\n' { | ||
| ctx.offset = ctx.offset + 1 | ||
| } else { | ||
| break | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Added a focused covering empty, all-whitespace, leading-whitespace, no-whitespace, and the non-ASCII (U+00A0 NBSP must not count) cases — included in the amended commit.
`ParseContext::lex_skip_whitespace` constructs a fresh StringView of the remaining input on every call (to feed `lexmatch [ \t\r\n]+`), which the new memprofile flagged as the #1 alloc source in JSON parsing. Replacing it with a direct ASCII scan over the existing view drops the json_parse bench from 145.13 MB / 13.15 M allocs to 107.37 MB / 9.85 M allocs (-26.0 % / -25.1 %) at unchanged output. Lives under notes/pr-drafts/06- to follow the existing PR-draft convention. Filed upstream as moonbitlang/core#3632. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ParseContext::lex_skip_whitespace` is called at the start of every `lex_value` / `lex_after_*` step. The original implementation built a `StringView` of the remaining input on every call and ran `lexmatch [ \t\r\n]+` on it. That StringView wrapper showed up as the #1 allocation source in JSON parsing workloads. Since JSON whitespace is ASCII-only, we can scan UTF-16 code units directly on the existing input view with identical semantics and zero allocations. `offset` is held in a local for the duration of the scan so each consumed whitespace character is one `local.set` instead of a `struct.set` against `ctx`. Measured on a 197 KB JSON-array-of-1000-objects parsed 50 times (wasm-gc, wasmtime): Total alloc bytes : 145.13 MB -> 107.37 MB (-26.0 %) Total #allocs : 13.15 M -> 9.85 M (-25.1 %) StringView::view : 32.62 MB -> 7.44 MB (-77.2 %) String::view : 12.59 MB -> 0 (gone) Adds a focused `test "lex_skip_whitespace"` covering the empty-input, all-whitespace, leading-whitespace, no-leading-whitespace, and non-ASCII (U+00A0 NBSP must NOT count as JSON whitespace per RFC 8259) cases. Reproduction: github.com/mizchi/pprof-mbt (moon-pprof memprofile + summary --diff). Tests: moon test passes (6507 / 6507). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
e0c8af3 to
d1d9d2a
Compare
Summary
ParseContext::lex_skip_whitespaceis called at the start of everylex_value/lex_after_*step. The original implementation built aStringViewof the remaining input on every call and ranlexmatch [ \t\r\n]+on it. ThatStringViewwrapper shows up as the #1 allocation source in JSON parsing workloads.Since JSON whitespace is ASCII-only, we can scan UTF-16 code units directly on the existing input view with identical semantics and zero allocations.
Numbers
Measured on a 197 KB JSON-array-of-1000-objects parsed 50 times under wasmtime + the wasm-gc backend, with every allocation captured via walrus instrumentation +
wasmtime::WasmBacktrace:StringView::viewString::viewThe
String::viewline disappears because the previous code path went through thelexmatchview-of-view materialization, which is no longer constructed.Reproduction tooling: mizchi/pprof-mbt (
moon-pprof memprofile+moon-pprof summary --diff).Test plan
moon test— 6506 / 6506 passmoon fmt --check— cleancmd/json_parsestill produces identical parse output (parsed=50 input_len=197001matches baseline)🤖 Generated with Claude Code