Skip to content

perf(json): scan whitespace directly to avoid per-call StringView alloc#3632

Open
mizchi wants to merge 1 commit into
moonbitlang:mainfrom
mizchi:json/lex-skip-whitespace-no-view
Open

perf(json): scan whitespace directly to avoid per-call StringView alloc#3632
mizchi wants to merge 1 commit into
moonbitlang:mainfrom
mizchi:json/lex-skip-whitespace-no-view

Conversation

@mizchi
Copy link
Copy Markdown
Contributor

@mizchi mizchi commented May 26, 2026

Summary

ParseContext::lex_skip_whitespace is called at the start of every lex_value / lex_after_* step. The original implementation built a StringView of the remaining input on every call and ran lexmatch [ \t\r\n]+ on it. That StringView wrapper shows up as the #1 allocation source in JSON parsing workloads.

Since JSON whitespace is ASCII-only, we can scan UTF-16 code units directly on the existing input view with identical semantics and zero allocations.

Numbers

Measured on a 197 KB JSON-array-of-1000-objects parsed 50 times under wasmtime + the wasm-gc backend, with every allocation captured via walrus instrumentation + wasmtime::WasmBacktrace:

metric before after delta
Total alloc bytes 145.13 MB 107.37 MB −26.0 %
Total #allocations 13.15 M 9.85 M −25.1 %
StringView::view 32.62 MB 7.44 MB −77.2 %
String::view 12.59 MB 0 gone

The String::view line disappears because the previous code path went through the lexmatch view-of-view materialization, which is no longer constructed.

Reproduction tooling: mizchi/pprof-mbt (moon-pprof memprofile + moon-pprof summary --diff).

Test plan

  • moon test — 6506 / 6506 pass
  • moon fmt --check — clean
  • Bench cmd/json_parse still produces identical parse output (parsed=50 input_len=197001 matches baseline)

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings May 26, 2026 09:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Optimizes JSON whitespace skipping to reduce allocations in the lexer hot path by replacing regex-based matching with a direct scan over the input.

Changes:

  • Replaced lexmatch on a freshly created StringView with a loop that scans ASCII whitespace directly.
  • Added rationale comments describing allocation reductions and UTF-16 code unit scanning semantics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread json/lex_misc.mbt Outdated
Comment on lines 124 to 131
while ctx.offset < ctx.end_offset {
let c = ctx.input.unsafe_get(ctx.offset)
if c == ' ' || c == '\t' || c == '\r' || c == '\n' {
ctx.offset = ctx.offset + 1
} else {
break
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — offset and end are now hoisted into locals at the top of the function and ctx.offset is written once after the loop. New revision pushed.

Comment thread json/lex_misc.mbt
Comment on lines 123 to 132
fn ParseContext::lex_skip_whitespace(ctx : ParseContext) -> Unit {
let rest = ctx.input.view(start_offset=ctx.offset, end_offset=ctx.end_offset)
lexmatch rest with longest {
("[ \t\r\n]+", next) => ctx.offset = ctx.end_offset - next.length()
_ => ()
while ctx.offset < ctx.end_offset {
let c = ctx.input.unsafe_get(ctx.offset)
if c == ' ' || c == '\t' || c == '\r' || c == '\n' {
ctx.offset = ctx.offset + 1
} else {
break
}
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a focused covering empty, all-whitespace, leading-whitespace, no-whitespace, and the non-ASCII (U+00A0 NBSP must not count) cases — included in the amended commit.

mizchi added a commit to mizchi/pprof-mbt that referenced this pull request May 26, 2026
`ParseContext::lex_skip_whitespace` constructs a fresh StringView of
the remaining input on every call (to feed `lexmatch [ \t\r\n]+`),
which the new memprofile flagged as the #1 alloc source in JSON
parsing. Replacing it with a direct ASCII scan over the existing
view drops the json_parse bench from 145.13 MB / 13.15 M allocs to
107.37 MB / 9.85 M allocs (-26.0 % / -25.1 %) at unchanged output.

Lives under notes/pr-drafts/06- to follow the existing PR-draft
convention. Filed upstream as moonbitlang/core#3632.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`ParseContext::lex_skip_whitespace` is called at the start of every
`lex_value` / `lex_after_*` step. The original implementation built
a `StringView` of the remaining input on every call and ran
`lexmatch [ \t\r\n]+` on it. That StringView wrapper showed up as the
#1 allocation source in JSON parsing workloads.

Since JSON whitespace is ASCII-only, we can scan UTF-16 code units
directly on the existing input view with identical semantics and zero
allocations. `offset` is held in a local for the duration of the scan
so each consumed whitespace character is one `local.set` instead of a
`struct.set` against `ctx`.

Measured on a 197 KB JSON-array-of-1000-objects parsed 50 times
(wasm-gc, wasmtime):

  Total alloc bytes : 145.13 MB ->  107.37 MB   (-26.0 %)
  Total #allocs     :  13.15 M  ->    9.85 M    (-25.1 %)
  StringView::view  :  32.62 MB ->    7.44 MB   (-77.2 %)
  String::view      :  12.59 MB ->   0          (gone)

Adds a focused `test "lex_skip_whitespace"` covering the empty-input,
all-whitespace, leading-whitespace, no-leading-whitespace, and
non-ASCII (U+00A0 NBSP must NOT count as JSON whitespace per
RFC 8259) cases.

Reproduction: github.com/mizchi/pprof-mbt (moon-pprof memprofile +
summary --diff).

Tests: moon test passes (6507 / 6507).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants