Replies: 1 comment 1 reply
-
Learned a lot, and also spotted a minor link error - PR 10377 should point to swc-project/swc#10377. :) |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Original Chinese Version
Background
Recently both Rspack and Rslint ran into bugs that stemmed from misusing their lexers. Those incidents exposed a few interesting quirks in JavaScript lexers, so I took the opportunity to study them.
getTokensFromNode
is wrong when the ast contains template microsoft/typescript-go#1554Tokens
A lexer’s core job is to split a plain string into a stream of tokens, for example turning
a b c
into['a', 'b', 'c']
.However, more complicated cases exist. Consider the input
/a/g
. Which of the following tokenizations should be considered correct?To some extent, both are valid.
These two outputs happen to match the two current ways that swc drives its lexer:
Option A:
lexer + collect
When you call
lexer.next()
unconditionally outside of the parser, you get result A.Option B:
lexer
inside the parserWhen the parser drives the lexer, swc can collect the tokens produced during parsing via capturing, and you obtain result B.
We see that the two approaches produce different streams, yet both are meaningful. Let’s call the tokens produced by option A lexer tokens, and those produced by option B ECMAScript tokens.
Lexer Tokens vs. ECMAScript Tokens
There is no strict definition of lexer tokens, and different parser/lexer implementations vary widely. The same source code can generate very different lexer tokens across tools.
For example, the tokenization of
>>
differs greatly between biome and swc.swc turns it into a single
>>
tokenbiome emits two separate
>
tokensThis happens because a lexer ultimately serves the parser, and the parsing strategies of swc and biome differ significantly. Biome’s lexer performs some aggressive optimizations for performance.
See https://github.com/biomejs/biome/blob/main/crates/biome_js_parser/src/lexer/mod.rs#L11-L14.
The core goal of lexer tokens is therefore to drive the parser efficiently. Once you leave the parsing context, the behavior of the raw token stream becomes unstable, so it is better not to expose it as a public API.
Even though lexer tokens differ wildly, the ECMAScript tokens obtained when the parser drives the lexer stay consistent: both swc and biome produce
>>
.Unlike lexer tokens, ECMAScript tokens generated during parsing tend to be similar across tools because they follow a well-defined specification:
This shows that regardless of how the underlying lexer tokenizes the input, the parser layer can produce a fairly uniform token sequence. Most tools therefore offer a way to capture ECMAScript tokens during parsing:
tokens
optionsyntax_node.token
node.getChildren(sourceFile);
Capturing
Interestingly, each of these parsers records ECMAScript tokens using a completely different approach, and each approach has its own pros and cons:
next
calls to collect tokens.Relationship Among
swc_ecma_parser::lexer
,swc_ecma_parser::parser
,swc_ecma_lexer::lexer
, andswc_ecma_lexer::parser
PR 10377 introduced a new implementation of the lexer and parser for performance reasons. That can be confusing for swc users. My understanding is:
swc_ecma_lexer
is the legacy module that provides the old parser and lexer. Some community users (primarily Deno) still depend on it, and migrating to the new version would incur costs, so it continues to be maintained.swc_ecma_parser
is the new high-performance parser and lexer. Rspack hopes to switch entirely to this version. For API compatibility, the new implementation mimics the old interfaces, which means the new crates depend on the old ones.Rspack Bug Analysis
PR 11357 attempted to swap
swc_ecma_parser::lexer
in forswc_ecma_lexer::lexer
for performance reasons, and all tests passed.A key difference between the new and old lexers is that the new one no longer guarantees the validity of lexer tokens when run outside the parser (loosening constraints allows more optimizations). Its token stream can contain many
TokenError
s, which is what ultimately triggered Issue 11551. As mentioned earlier, lexer tokens have no formal contract, so producing manyTokenError
s is not considered a bug as long as the parser still yields valid ECMAScript tokens.The real root cause, then, is that Rspack relied on the unstable lexer tokens for Automatic Semicolon Insertion (ASI) analysis. Although PR 11555 reverted to the old lexer to fix the regression, I do not think that is an ideal solution:
Discussion of Fixes
Rspack should base its analysis on the more stable ECMAScript tokens. Today it can capture them during parsing via
Capturing
, which simultaneously avoids the extra lexer cost and depends on a more stable API. PR 11577 experiments with analyzing parser tokens, and two observations stand out:The Rust benchmarks speed up (fix: using tokens from parser to handle asi web-infra-dev/rspack#11577 (comment)), likely because the redundant lexer runs go away.
Binary size grows noticeably because
Capturing
causes two generic instantiations of the parser.Beta Was this translation helpful? Give feedback.
All reactions