Javascript Compiler: Lexer Token #2

hardfist · 2025-10-11T12:40:31Z

hardfist
Oct 11, 2025
Maintainer

背景

最近 Rspack 和 Rslint 都碰到了因为 Lexer 的使用不当导致的 bug，其揭示了 JavaScript Lexer 里一些有意思的点，趁此机会学习下 Lexer。

Rspack Issue: [Bug]: v1.5 compiled code reports an error when running, the same code compiled with v1.4 does not have this problem web-infra-dev/rspack#11551
Rslint Issue(tsgo issue): getTokensFromNode is wrong when the ast contains template microsoft/typescript-go#1554

tokens

Lexer 的核心功能就是将 string 格式的文本拆分为一个 token stream，如将 a b c 拆分为 ['a', 'b','c']，

但是也有一些更为复杂的场景，考虑如下输入

/a/g，那么如下两种 tokens 结果哪个才是正确的呢？

A

[
 {Kind: SLASH,value:'/'}, 
 {Kind: IDENTIFER,value:'a'}, 
 {Kind: SLASH,value:'/'}, 
 {Kind:IDENTIFER,value:'g'}
]

B

[
  {Kind: REGEXLITERAL: value: '/a/g'}
]

一定程度上两者都算正确

这其实正对应着目前 swc 两种调用 lexer 的方式生成的不同结果

A: lexer + collect

在脱离parser的情况下，无条件的一直调用 lexer.next()获得的结果，其结果正是上述的结果 A

let lexer = Lexer::new(syntax, Default::default(), StringInput::from(&*fm), None);
let token1: Vec<_> = lexer.clone().collect();

B: lexer in parser

既在 parser 里驱动调用 lexer 生成的结果， swc 里可以通过 capturing 来收集 parse 过程的 token 结果，其结果正是上述的结果 B

let lexer = Lexer::new(syntax, Default::default(), StringInput::from(&*fm), None);
let capturing = input::Capturing::new(lexer);
let mut parser = parser::Parser::new_from(capturing);
let _ = parser.parse_module()?;
let tokens = parser.input_mut().iter_mut().take());

我们发现上面两者行为产生了不同的结果，但是两者都有意义，没法说哪个是错误，哪个是正确，我们姑且把方案 A 生成的 tokens 叫 lexer tokens，把方案 B 生成的 tokens 叫 ecmascript tokens

lexer tokens vs ecmscript tokens

lexer tokens 本身没有严格的定义，不同的 parser lexer 在这块的实现差异巨大，同样的 code 在不同 lexer 实现里生成的 lexer tokens 差别很大

如 >>在 biome 和 swc 下的差异很大

swc: tokenize 为一个 >> token

[
    TokenAndSpan {
        token: >>,
        had_line_break: true,
        span: 1..3,
    },
]

biome: tokenize 为两个 > token

R_ANGLE@0..1 ">" [] [],
R_ANGLE@1..2 ">" [] [],

这是因为 lexer 本身是为了服务 parse，而 swc 和 biome 的 parse 实现过程有较大差异，biome 的 lexer 出于性能做了一些比较高级的优化。

https://github.com/biomejs/biome/blob/main/crates/biome_js_parser/src/lexer/mod.rs#L11-L14

//! An extremely fast, lookup table based, ECMAScript lexer which yields SyntaxKind tokens used by the rome-js parser.
//! For the purposes of error recovery, tokens may have an error attached to them, which is reflected in the Iterator Item.
//! The lexer will also yield `COMMENT` and `WHITESPACE` tokens.
//!
//! The lexer operates on raw bytes to take full advantage of lookup table optimizations, these bytes **must** be valid utf8,
//! therefore making a lexer from a `&[u8]` is unsafe since you must make sure the bytes are valid utf8.
//! Do not use this to learn how to lex JavaScript, this is just needlessly fast and demonic because i can't control myself :)
//!
//! basic ANSI syntax highlighting is also offered through the `highlight` feature.
//!
//! # Warning ⚠️
//!
//! `>>` and `>>>` are not emitted as single tokens, they are emitted as multiple `>` tokens. This is because of
//! TypeScript parsing and productions such as `T<U<N>>`

所以 lexer tokens 本身的核心目标就是如何更高效的驱动 parse 过程，而且在脱离 parse 的场景下生成的 tokens 流行为本身一般是很不固定的，最好不要将其作为 public api 使用。

虽然 lexer tokens 的结果差别很大，但是两者的 ecmascript tokens（在 parser 里调用 lexer 生成的 tokens）结果一致，都是生成 >>。

相比 lexer tokens 的结果毫无规范可言，大部分工具在 parse 过程生成的 ecmascript tokens 都比较接近和一致，这是因为其有比较严格的定义

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language. Moreover, line terminators, although not considered to be tokens, also become part of the stream of input elements and guide the process of automatic semicolon insertion (12.10). Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar. A MultiLineComment (that is, a comment of the form /* … */ regardless of whether it spans more than one line) is likewise simply discarded if it contains no line terminator; but if a MultiLineComment contains one or more line terminators, then it is replaced by a single line terminator, which becomes part of the stream of input elements for the syntactic grammar. ecmascript tokens

这其实说明无论底层的 lexer 的 tokenize 过程如何差异，都可以在 parse 层面生成较为一致的 tokens 序列，目前大部分的工具都提供在 parse 中收集Ecmascript tokens 的功能

babel: tokens 配置
biome: syntax_node.token
typescript: node.getChildren(sourceFile);
swc: Capturing

有意思的是这四个 parser 生成的 ecmascript tokens 的思路完全不同，且各有优劣

babel: parse 过程记录 tokens 信息，性能较好，只是多了一个 parse options
biome: 使用 cst，cst 节点本身 attach 了 token 信息
typescript: 因为使用 ast 且没有存储 tokens 信息，所以每次需要从 sourceFile 里基于 node 的 span 重新进行 relexer 来生成 token 信息，性能最差，但是该 api 主要用于语言服务，较少的批量操作，所以性能影响较小，不要在编译的场景下大量调用该 api
swc: 对 lexer 本身进行 decorate，拦截 next 操作，来收集 tokens 信息

swc_ecma_parser::lexer swc_ecma_parser::parser swc_ecma_lexer::lexer swc_ecma_lexer::parser 四者的关系

PR 10377 出于性能考虑，引入了一个新版本的 lexer 和 parser 的实现，因此对于 swc 用户来讲可能存在一定的理解成本，根据我的理解

swc_ecma_lexer 模块是个 legacy 模块，旧的 parser & lexer 的实现，但是因为一些社区用户(主要是 deno ) 依赖了这俩模块，短时间内迁移到新版本的实现会有一定的迁移成本,所以继续维护。
swc_ecma_parser 模式是个新版本的高性能的 parser & lexer 实现，未来 Rspack 希望完全切换到新版本的 parser 和 lexer，出于 api 兼容性的考虑，新版本实现了旧版本实现的接口，因此导致新版本依赖了旧版本的 crate。

Rspack Bug 分析

PR 11357 中出于性能考虑，试图用 swc_ecma_parser::lexer 替换了 swc_ecma_lexer::lexer，并且过了测试 case，

新 lexer 和旧 lexer 的一个核心差别是，新 lexer 不再保证 在脱离 parser 的情况下生成的 lexer tokens 结果的合法性(约束越少，可以有更多的优化空间)，其 tokens 可能会包含很多的 TokenError，这也是导致 Issue 11551 的原因，如前述所述，lexer tokens 没有规范可言，因此新版 lexer 生成包含很多 TokenError 的 tokens，也不算 bug，因为其只要保证 parser 生成的 ecmascript tokens 合法即可。

所以这个 bug 的核心原因还是 Rspack 不合理的依赖了不稳定的 lexer tokens 来生成实现 ASI 分析，虽然 PR 11555 通过 revert 到旧版本的 lexer 实现修复了 bug，但是个人认为这个修复方式仍然不是很合理

重复创建了 lexer 有额外的开销
旧版本的 lexer tokens 虽然结果一定程度上稳定了，但是其结果仍然不适合作为 public api，Rspack 本身并不承诺 tokens 始终产生相同的结果，从而对 Rspack 产生一定影响。

修复方式讨论

Rspack 本身应该依赖更稳定的ecmascript tokens来进行分析，目前在Rspack里可以通过 Capturing 来对 parse 过程中生成的token进行捕获，这一方面可以避免重复的 lexer 开销，另一方面依赖的API更加稳定，目前在 PR 11577 中尝试了使用 parser tokens来进行分析，可以观察到两个现象

rust benchmark有性能提升: https://github.com/web-infra-dev/rspack/pull/11577#issuecomment-3252033003,应该是因为少了重复的lexer

体积有显著提升，这是因为使用Capturing导致出现两份Parser的泛型实例

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Javascript Compiler: Lexer Token #2

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Javascript Compiler: Lexer Token #2

Uh oh!

hardfist Oct 11, 2025 Maintainer

背景

tokens

lexer tokens vs ecmscript tokens

swc_ecma_parser::lexer swc_ecma_parser::parser swc_ecma_lexer::lexer swc_ecma_lexer::parser 四者的关系

Rspack Bug 分析

修复方式讨论

Replies: 0 comments

hardfist
Oct 11, 2025
Maintainer