JavaScript Compiler: TokenKind #5

hardfist · 2025-10-11T13:22:06Z

hardfist
Oct 11, 2025
Maintainer

Background

Recently both Rspack and Rslint ran into bugs that stemmed from misusing their lexers. Those incidents exposed a few interesting quirks in JavaScript lexers, so I took the opportunity to study them.

Rspack Issue: [Bug]: v1.5 compiled code reports an error when running, the same code compiled with v1.4 does not have this problem web-infra-dev/rspack#11551
Rslint Issue (tsgo issue): getTokensFromNode is wrong when the ast contains template microsoft/typescript-go#1554

Tokens

A lexer’s core job is to split a plain string into a stream of tokens, for example turning a b c into ['a', 'b', 'c'].

However, more complicated cases exist. Consider the input /a/g. Which of the following tokenizations should be considered correct?

Option A

[
  {Kind: SLASH, value: '/'},
  {Kind: IDENTIFIER, value: 'a'},
  {Kind: SLASH, value: '/'},
  {Kind: IDENTIFIER, value: 'g'}
]

Option B

[
  {Kind: REGEXLITERAL, value: '/a/g'}
]

To some extent, both are valid.

These two outputs happen to match the two current ways that swc drives its lexer:

Option A: lexer + collect

When you call lexer.next() unconditionally outside of the parser, you get result A.

let lexer = Lexer::new(syntax, Default::default(), StringInput::from(&*fm), None);
let token1: Vec<_> = lexer.clone().collect();

Option B: lexer inside the parser

When the parser drives the lexer, swc can collect the tokens produced during parsing via capturing, and you obtain result B.

let lexer = Lexer::new(syntax, Default::default(), StringInput::from(&*fm), None);
let capturing = input::Capturing::new(lexer);
let mut parser = parser::Parser::new_from(capturing);
let _ = parser.parse_module()?;
let tokens = parser.input_mut().iter_mut().take());

We see that the two approaches produce different streams, yet both are meaningful. Let’s call the tokens produced by option A lexer tokens, and those produced by option B ECMAScript tokens.

Lexer Tokens vs. ECMAScript Tokens

There is no strict definition of lexer tokens, and different parser/lexer implementations vary widely. The same source code can generate very different lexer tokens across tools.

For example, the tokenization of >> differs greatly between biome and swc.

swc turns it into a single >> token

[
    TokenAndSpan {
        token: >>,
        had_line_break: true,
        span: 1..3,
    },
]

biome emits two separate > tokens

R_ANGLE@0..1 ">" [] [],
R_ANGLE@1..2 ">" [] [],

This happens because a lexer ultimately serves the parser, and the parsing strategies of swc and biome differ significantly. Biome’s lexer performs some aggressive optimizations for performance.

See https://github.com/biomejs/biome/blob/main/crates/biome_js_parser/src/lexer/mod.rs#L11-L14.

//! An extremely fast, lookup table based, ECMAScript lexer which yields SyntaxKind tokens used by the rome-js parser.
//! For the purposes of error recovery, tokens may have an error attached to them, which is reflected in the Iterator Item.
//! The lexer will also yield `COMMENT` and `WHITESPACE` tokens.
//!
//! The lexer operates on raw bytes to take full advantage of lookup table optimizations, these bytes **must** be valid utf8,
//! therefore making a lexer from a `&[u8]` is unsafe since you must make sure the bytes are valid utf8.
//! Do not use this to learn how to lex JavaScript, this is just needlessly fast and demonic because i can't control myself :)
//!
//! basic ANSI syntax highlighting is also offered through the `highlight` feature.
//!
//! # Warning ⚠️
//!
//! `>>` and `>>>` are not emitted as single tokens, they are emitted as multiple `>` tokens. This is because of
//! TypeScript parsing and productions such as `T<U<N>>`

The core goal of lexer tokens is therefore to drive the parser efficiently. Once you leave the parsing context, the behavior of the raw token stream becomes unstable, so it is better not to expose it as a public API.

Even though lexer tokens differ wildly, the ECMAScript tokens obtained when the parser drives the lexer stay consistent: both swc and biome produce >>.

Unlike lexer tokens, ECMAScript tokens generated during parsing tend to be similar across tools because they follow a well-defined specification:

Input elements other than white space and comments form the terminal symbols for the syntactic grammar for ECMAScript and are called ECMAScript tokens. These tokens are the reserved words, identifiers, literals, and punctuators of the ECMAScript language. Moreover, line terminators, although not considered to be tokens, also become part of the stream of input elements and guide the process of automatic semicolon insertion (12.10). Simple white space and single-line comments are discarded and do not appear in the stream of input elements for the syntactic grammar. A MultiLineComment (that is, a comment of the form /* … */ regardless of whether it spans more than one line) is likewise simply discarded if it contains no line terminator; but if a MultiLineComment contains one or more line terminators, then it is replaced by a single line terminator, which becomes part of the stream of input elements for the syntactic grammar. ECMAScript tokens

This shows that regardless of how the underlying lexer tokenizes the input, the parser layer can produce a fairly uniform token sequence. Most tools therefore offer a way to capture ECMAScript tokens during parsing:

babel: the tokens option
biome: syntax_node.token
TypeScript: node.getChildren(sourceFile);
swc: Capturing

Interestingly, each of these parsers records ECMAScript tokens using a completely different approach, and each approach has its own pros and cons:

babel records token information during parsing. Performance is good; it just requires enabling an extra parse option.
biome uses a concrete syntax tree (CST); each CST node carries the token information directly.
TypeScript relies on an AST without storing tokens. Whenever you need tokens, it re-lexes from the source file based on the node span. Performance is the worst, but the API mainly serves language tooling with limited batch usage, so the impact is acceptable. Avoid calling this API excessively during compilation.
swc decorates the lexer, intercepting next calls to collect tokens.

Relationship Among `swc_ecma_parser::lexer`, `swc_ecma_parser::parser`, `swc_ecma_lexer::lexer`, and `swc_ecma_lexer::parser`

PR 10377 introduced a new implementation of the lexer and parser for performance reasons. That can be confusing for swc users. My understanding is:

swc_ecma_lexer is the legacy module that provides the old parser and lexer. Some community users (primarily Deno) still depend on it, and migrating to the new version would incur costs, so it continues to be maintained.
swc_ecma_parser is the new high-performance parser and lexer. Rspack hopes to switch entirely to this version. For API compatibility, the new implementation mimics the old interfaces, which means the new crates depend on the old ones.

Rspack Bug Analysis

PR 11357 attempted to swap swc_ecma_parser::lexer in for swc_ecma_lexer::lexer for performance reasons, and all tests passed.

A key difference between the new and old lexers is that the new one no longer guarantees the validity of lexer tokens when run outside the parser (loosening constraints allows more optimizations). Its token stream can contain many TokenErrors, which is what ultimately triggered Issue 11551. As mentioned earlier, lexer tokens have no formal contract, so producing many TokenErrors is not considered a bug as long as the parser still yields valid ECMAScript tokens.

The real root cause, then, is that Rspack relied on the unstable lexer tokens for Automatic Semicolon Insertion (ASI) analysis. Although PR 11555 reverted to the old lexer to fix the regression, I do not think that is an ideal solution:

Spinning up the lexer twice adds overhead.
Even if the old lexer produces somewhat stable tokens, they still are not appropriate as a public API. Rspack does not promise that the tokens will stay identical over time, so relying on them can affect the project.

Discussion of Fixes

Rspack should base its analysis on the more stable ECMAScript tokens. Today it can capture them during parsing via Capturing, which simultaneously avoids the extra lexer cost and depends on a more stable API. PR 11577 experiments with analyzing parser tokens, and two observations stand out:

The Rust benchmarks speed up (fix: using tokens from parser to handle asi web-infra-dev/rspack#11577 (comment)), likely because the redundant lexer runs go away.
Binary size grows noticeably because Capturing causes two generic instantiations of the parser.

hai-x · 2025-10-13T02:44:43Z

hai-x
Oct 13, 2025

Learned a lot, and also spotted a minor link error - PR 10377 should point to swc-project/swc#10377. :)

1 reply

hardfist Oct 13, 2025
Maintainer Author

Thanks, fix the wrong link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JavaScript Compiler: TokenKind #5

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

JavaScript Compiler: TokenKind #5

Uh oh!

Uh oh!

hardfist Oct 11, 2025 Maintainer

Background

Tokens

Lexer Tokens vs. ECMAScript Tokens

Relationship Among swc_ecma_parser::lexer, swc_ecma_parser::parser, swc_ecma_lexer::lexer, and swc_ecma_lexer::parser

Rspack Bug Analysis

Discussion of Fixes

Replies: 1 comment · 1 reply

Uh oh!

hai-x Oct 13, 2025

Uh oh!

hardfist Oct 13, 2025 Maintainer Author

hardfist
Oct 11, 2025
Maintainer

Relationship Among `swc_ecma_parser::lexer`, `swc_ecma_parser::parser`, `swc_ecma_lexer::lexer`, and `swc_ecma_lexer::parser`

Replies: 1 comment 1 reply

hai-x
Oct 13, 2025

hardfist Oct 13, 2025
Maintainer Author