You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The previous post, JavaScript Lexer 1: TokenKind, explained how to split tokens. The next question is how to represent a token. A typical token carries at least the following elements:
TokenKind: the token type such as EQUALS, SLASH, or REGEX
TokenText (RawText): the textual representation, for example EQUALS ('='), SLASH ('/'), or REGEX ('/\.js$/g')
TokenValue: the decoded form of TokenText, where escape sequences (such as Unicode escapes) are resolved into the host language representation, e.g. Rust or Go String
Previously we looked at how different lexers treat TokenKind. Now we will see how various compilers handle TokenText and TokenValue.
It just so happens that Rspack recently ran into two categories of TokenValue bugs (the last three issues fall into the same bucket):
The first challenge with TokenValue is how to store it. JavaScript strings use UTF-16, whereas Rust and Go use UTF-8 (UTF-8 was in fact invented by one of Go's creators). Translating between the two encodings introduces format conversion issues. Before we dive deeper, we need to clarify a few core concepts about strings.
Byte, Char, and String
These three notions describe strings from different angles:
String: loosely speaking, a string is a sequence of characters. (This is a simplification—different languages have different expectations. Some treat strings as arbitrary byte streams. Here we follow the "sequence of characters" interpretation.)
The string "我爱🦀" corresponds to the character sequence "我", "爱", and "🦀".
Character: a single textual symbol such as 我, 爱, or 🦀. A character is an abstract entity: it can be rendered in different fonts or systems and can even represent control characters. To communicate unambiguously, we assign each character an identifier—its code point. (Again this is a simplification; a Unicode character may be represented by multiple code points when grapheme clusters are involved.) Today the most widely used character encoding system is Unicode, though others exist (GBK, Big5, Latin-1, etc.).
Strings and characters are concepts independent of computers and programs. Even in contexts such as telegraphy we can encode characters (for example in Morse code) and transmit strings.
We can inspect the Unicode code point of each character in JavaScript via codePointAt (unless otherwise noted, all encoding discussions assume Unicode):
Char
Code Point
我
25105
爱
29233
🦀
129408
Another easy-to-confuse pair is code point versus code unit:
Code point: the abstract Unicode character number, independent of a particular encoding. For instance the crab emoji 🦀 has code point 129408 (retrieved via '🦀'.codePointAt(0)).
Code unit: the smallest storage unit in a specific encoding (which varies between UTF-8, UTF-16, UTF-32, etc.). In UTF-16 the same 🦀 requires two code units: [\uD83E, \uDD80] (accessible via '🦀'[0] and '🦀'[1]).
A given code point may map to different numbers of code units depending on the encoding.
Byte: essentially an unsigned 8-bit value. Turning code points into sequences of u8 values is the process of encoding. UTF-8 is the mainstream encoding today (used by Go and Rust), but many others exist, such as UTF-16 (JavaScript) and UTF-32. We can examine the results via Buffer.from('我爱🦀').
Notice that Buffer.from('爱') and Buffer.from('🦀') have different lengths because UTF-8 is variable-width.
Another pair of related concepts is the character encoding system versus the character encoding scheme. The former maps abstract characters to numerical identifiers (e.g. Unicode code points) while the latter maps code points onto concrete storage representations (bytes).
Representations in Programming Languages
Most languages expose data structures for bytes, characters, and strings. For example:
The C language is a bit special. The language and its standard library do not define String or Char. What we typically call a C string is a byte array, and a char is effectively an ASCII byte. Encoding and decoding are left to the programmer.
Even though C does not provide a UTF-8 string type, it supports UTF-8 string literals such as const char *string = "我爱🦀";.
Source File Encoding
How does a compiler handle strings in the source file?
int main() {
setlocale(LC_ALL, "");
const char *string = "我爱🦀"; // Is this UTF-8 or UTF-16?
printf("string: %s\n", string);
// Print the byte array
printf("bytes: [");
size_t slen = strlen(string);
for (size_t i = 0; i < slen; ++i) {
if (i > 0) printf(", ");
printf("%d", (unsigned char)string[i]);
}
printf("]\n");
return 0;
}
It is easy to confuse how a source file stores text with how a compiler interprets string literals (their runtime semantics). The two are independent.
Text editors usually let you choose the encoding. Most default to UTF-8 and allow switching to others.
For example, VS Code encodes files as UTF-8 by default, but JavaScript strings are UTF-16. The JavaScript engine reads UTF-8 source text and converts it internally into UTF-16 strings.
Most languages require source text to be UTF-8. If your source file uses another encoding you generally have to convert it yourself or instruct the compiler to do so. GCC and Clang provide -finput-charset for this purpose.
C is a bit different: it does not mandate a specific encoding for string literals. Instead it offers specialized literals for UTF-8 and other encodings:
s1[] = "a猫🍌"; // depends on -fexec-charset
s2[] = u8"a猫🍌"; // UTF-8 string literal
char16_t s3[] = u"a猫🍌"; // prior to C23 this was an unspecified 16-bit encoding; after C23 it means UTF-16
char32_t s4[] = U"a猫🍌"; // prior to C23 this was an unspecified 32-bit encoding; after C23 it means UTF-32
Two GCC flags highlight the distinction between textual and runtime encodings. -finput-charset tells the compiler how the source file is encoded (e.g. UTF-16) so it can convert it to the internal encoding (often UTF-8). -fexec-charset tells the compiler which encoding to use for string literals at runtime. For instance, if const char* s = "我爱🦀"; is compiled with -fexec-charset=utf16, s will store UTF-16 code units.
Escape Sequences
Converting between UTF-16 and UTF-8 is usually straightforward: decode bytes into a code point sequence and re-encode. (Ignoring performance for the moment.)
// This is intentionally inefficient; it only illustrates the conversion steps.
fn utf8_to_utf16(utf8_buffer: Vec<u8>) {
let codepoints = decode_from_utf8(utf8_buffer);
let utf16_buffer: Vec<u16> = encode_into_utf16(codepoints);
}
Because most strings are language-agnostic, the conversion itself is trivial. The tricky bit is that most languages allow Unicode escape sequences inside string literals, and each language defines them differently.
Repeated caution: Unlike ordinary character representations, Unicode escape sequences are language features and vary greatly across languages. Be very careful when exchanging strings between languages.
Also remember that textual representations (such as 🦀, \u{1F980}, or \uD83E\uDD80) are distinct from runtime semantics. The same code point may have many textual forms.
UTF-8 vs. UTF-16
A JavaScript compiler must frequently convert UTF-8 source text into UTF-16 strings required at runtime.
Surrogate Pairs
Both UTF-8 and UTF-16 are variable-width encodings. Code points beyond U+FFFF require additional storage. UTF-16 represents these via surrogate pairs:
Surrogates must appear in pairs. A solitary high or low surrogate is invalid, and we call such a code unit a lone surrogate.
Unfortunately, while lone surrogates are invalid in Unicode, JavaScript strings allow them:
let str = "\uD800";
Rust, on the other hand, rejects them:
let s = "\u{D800}";
This leads to a direct problem: JavaScript strings cannot be losslessly represented as Rust String. The following API is unsound. Rust converts unsupported sequences to U+FFFD, so strings that differ in JavaScript may become equal in Rust.
Conclusion: when lone surrogates matter, storing JavaScript strings as Rust String is a bad idea (there are hacky workarounds, but we will skip them here).
We found that the original file 🦀.md was transformed into \uD83E\uDD80.md. Investigation showed that swc-loader performed the conversion because it defaulted to jsc.output.charset=ascii, rewriting every non-ASCII UTF-16 string into ASCII plus Unicode escape sequences. This is usually safe, but SWC has a bug when handling Unicode escape sequences. The AST visitor received a TokenValue equal to the escaped form (\uD83E\uDD80), so \uD83E\uDD80.md != 🦀.md, causing path lookups to fail.
Later, web-infra-dev/rspack#11568 added support for jsc.output.charset=ascii=utf8, which incidentally fixed the Rspress bug and restored emoji paths. In that configuration the Rust side receives the original 🦀 from the AST visitor, so 🦀 == 🦀 and the path matches.
The key takeaway is that string equality compares underlying code points, not the textual representation.
'🦀' == '\uD83E\uDD80' // true: two textual forms of the same code point; both have length 2
'\uD83E\uDD80'.length // 2 code units
[...'\uD83E\uDD80'] // 🦀
'\\uD83E\\uDD80' == '\\uD83E\\uDD80' // false: the first is literal text, the second is an escape sequence
'\\uD83E\\uDD80'.length // 12
[...'\\uD83E\\uDD80'] // ['\\', 'u', 'D', '8', '3', 'E', '\\', 'u', 'D', 'D', '8', '0']
The remaining SWC issues stem from the same root cause—bugs in escape-sequence handling:
Transformers and minifiers do not necessarily need to keep the original spelling, but formatters do. Supporting verbatim output is harder because the parser must store not only code points (TokenValue) but also the original text (TokenText). Raw text is not part of the ESTree specification (see estree/estree#291). ASTs that carry raw text can be seen as "extended ASTs"; concrete syntax trees (CSTs) are another example.
Boa, V8, QuickJS, Esbuild
These engines take similar approaches. Since Rust String cannot hold lone surrogates, they store code points directly—for example as Vec<u16>. (Code points beyond u16 are encoded as two adjacent u16 values.) Taking Boa as an example:
/// The string interner for Boa.
#[derive(Debug, Default)]
pub struct Interner {
utf8_interner: RawInterner<u8>,
utf16_interner: RawInterner<u16>,
}
Provide UTF-8 and UTF-16 accessors. Since some interned strings cannot be represented in UTF-8, the UTF-8 accessor returns Option. (An utf8_lossy helper could be added for callers that do not require strict UTF-8.) Interning also improves performance: many tokens are guaranteed ASCII (e.g. keywords), so storing them as Vec<u8> instead of Vec<u16> saves space.
TokenKind::StringLiteral stores both the interned value and whether it contained escape sequences.
Because the parser is written in JavaScript, it can store TokenValue as plain JavaScript strings, which naturally support all JavaScript string semantics.
During parsing it does not resolve token text. It records only the token range. token.text is a lazy getter that reads the original text when needed.
Biome exposes only token_text, not TokenValue. Consumers parse the escape sequences themselves. Treating escapes as raw text avoids the issue entirely.
This raises the question: does the parser even need to compute TokenValue? Could SWC skip it and keep only TokenText?
OXC
OXC takes another path: it stores JavaScript strings in Rust String, but encodes lone surrogates specially. Before converting to Rust String, it rewrites lone surrogates—for example, \uD800 becomes \u{FFFD}d800—to prevent Rust from replacing it with U+FFFD. Consumers must explicitly decode \u{FFFD}d800 back to \uD800. To tell user-supplied \u{FFFD} apart from encoded placeholders, OXC encodes the former as \u{FFFD}fffd.
This approach has drawbacks: AST visitors in plugins must be aware of the encoding scheme and decode strings explicitly, otherwise they observe different values than the original source—an extra burden for plugin authors.
SWC
SWC actually has two related but distinct problems:
It needlessly escapes strings such as \uD800, changing their semantics.
It stores JavaScript strings as Rust String, so lone surrogates are unsupported.
Possible Fixes
Because JavaScript and Rust strings have fundamentally different invariants, treating a JavaScript string as String inside Rspack or SWC is risky and hard to debug. Ideally we should wrap JavaScript strings in a dedicated type (say JsString) as Nova does.
Nova: converts the parser's encoded Rust string into EcmaScript::String, which stores WTF-8 internally.
WTF-8 can be viewed as a superset of UTF-8. When a string contains only valid Unicode code points, the two encodings produce identical byte sequences. The difference is that UTF-8 refuses to encode invalid code points, whereas WTF-8 encodes them as though they were valid. This makes WTF-8 an excellent ABI-compatible superset of UTF-8.
A Possible Fix
Our ideal fix should meet the following requirements:
Preserve ABI stability so existing AST serialization stays unchanged.
Preserve API stability so callers need no changes.
One idea is to back Atom with WTF8Buf:
Strings without lone surrogates remain identical, preserving the ABI.
Strings with lone surrogates retain their original code point information losslessly.
We also considered changing as_str to
Return lossless UTF-8 when no lone surrogates are present.
Replace invalid code points with \u{FFFD} otherwise, matching JavaScript semantics.
That would keep most APIs working unchanged for the common case.
Instead we can add an as_wtf8 accessor. Callers that need WTF-8 (identifiers, string literals, template strings) call as_wtf8 rather than as_str.
Drawbacks
We lose some type safety: consumers must know when to treat an atom as WTF-8 or UTF-8. Currently only string literals, identifiers, and templates need WTF-8. If more cases arise, we can migrate as_str callers gradually.
We could further distinguish the exposed value types in the AST: identifiers, string literals, and templates would return WTF8Atom, while other nodes return Atom. This enforces the distinction at the type level, preventing visitor mistakes. It would be an API-breaking change but not ABI-breaking.
struct Atom {
buf: WTF8Buf,
}
struct WTF8Atom {
buf: WTF8Buf,
}
impl Visitor for CollectVisitor {
fn string_literal(node) {
let token_value = node.value(); // Atom -> WTF8 buffer
let chars = token_value.chars().collect(); // panic
}
}
impl Visitor for CollectVisitor {
fn string_literal(node) {
let token_value = node.value(); // WTF8Atom
let chars = token_value.chars().collect();
}
fn function_decl(node) {
let token_value = node.value(); // Atom
let chars = token_value.chars().collect();
}
}
Cross-Language String Handling
napi-rs & napi
At first glance converting between JavaScript and Rust strings with napi-rs seems easy because napi encapsulates the complexity. sys::napi_get_value_string_utf8 ultimately calls napi_get_value_string_utf8, which automatically replaces lone surrogates.
Grapheme Clusters and Unicode Normalization Forms
This topic is tangential to lexers and parsers but still interesting. For a detailed treatment see https://go.dev/blog/normalization.
Token Span
Alongside TokenValue we also have TokenSpan. What does a span measure?
Byte offset?
Code unit offset?
Code point offset?
Unfortunately compilers disagree and there is no standard. (ESTree does not specify the unit for locations: estree/estree#80.) Converting between span formats is costly. Rspack and SWC currently use spans for:
SWC's own source maps
rspack-source source maps
rspack-source edit operations
SWC measures spans in BytePos (byte offsets). Babel, Acorn, and similar tools use code unit offsets. If Webpack were to switch to SWC as its parser, it would have to translate byte offsets into code unit offsets.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Original Chinese Version
The previous post, JavaScript Lexer 1: TokenKind, explained how to split tokens. The next question is how to represent a token. A typical token carries at least the following elements:
EQUALS
,SLASH
, orREGEX
EQUALS ('=')
,SLASH ('/')
, orREGEX ('/\.js$/g')
TokenText
, where escape sequences (such as Unicode escapes) are resolved into the host language representation, e.g. Rust or GoString
Previously we looked at how different lexers treat
TokenKind
. Now we will see how various compilers handleTokenText
andTokenValue
.It just so happens that Rspack recently ran into two categories of
TokenValue
bugs (the last three issues fall into the same bucket):The first challenge with
TokenValue
is how to store it. JavaScript strings use UTF-16, whereas Rust and Go use UTF-8 (UTF-8 was in fact invented by one of Go's creators). Translating between the two encodings introduces format conversion issues. Before we dive deeper, we need to clarify a few core concepts about strings.Byte, Char, and String
These three notions describe strings from different angles:
我
,爱
, or🦀
. A character is an abstract entity: it can be rendered in different fonts or systems and can even represent control characters. To communicate unambiguously, we assign each character an identifier—its code point. (Again this is a simplification; a Unicode character may be represented by multiple code points when grapheme clusters are involved.) Today the most widely used character encoding system is Unicode, though others exist (GBK, Big5, Latin-1, etc.).Strings and characters are concepts independent of computers and programs. Even in contexts such as telegraphy we can encode characters (for example in Morse code) and transmit strings.
We can inspect the Unicode code point of each character in JavaScript via
codePointAt
(unless otherwise noted, all encoding discussions assume Unicode):Another easy-to-confuse pair is code point versus code unit:
129408
(retrieved via'🦀'.codePointAt(0)
).[\uD83E, \uDD80]
(accessible via'🦀'[0]
and'🦀'[1]
).A given code point may map to different numbers of code units depending on the encoding.
u8
values is the process of encoding. UTF-8 is the mainstream encoding today (used by Go and Rust), but many others exist, such as UTF-16 (JavaScript) and UTF-32. We can examine the results viaBuffer.from('我爱🦀')
.Notice that
Buffer.from('爱')
andBuffer.from('🦀')
have different lengths because UTF-8 is variable-width.Another pair of related concepts is the character encoding system versus the character encoding scheme. The former maps abstract characters to numerical identifiers (e.g. Unicode code points) while the latter maps code points onto concrete storage representations (bytes).
Representations in Programming Languages
Most languages expose data structures for bytes, characters, and strings. For example:
The C language is a bit special. The language and its standard library do not define
String
orChar
. What we typically call a C string is a byte array, and achar
is effectively an ASCII byte. Encoding and decoding are left to the programmer.Even though C does not provide a UTF-8 string type, it supports UTF-8 string literals such as
const char *string = "我爱🦀";
.Source File Encoding
How does a compiler handle strings in the source file?
It is easy to confuse how a source file stores text with how a compiler interprets string literals (their runtime semantics). The two are independent.
For example, VS Code encodes files as UTF-8 by default, but JavaScript strings are UTF-16. The JavaScript engine reads UTF-8 source text and converts it internally into UTF-16 strings.
Most languages require source text to be UTF-8. If your source file uses another encoding you generally have to convert it yourself or instruct the compiler to do so. GCC and Clang provide
-finput-charset
for this purpose.C is a bit different: it does not mandate a specific encoding for string literals. Instead it offers specialized literals for UTF-8 and other encodings:
Two GCC flags highlight the distinction between textual and runtime encodings.
-finput-charset
tells the compiler how the source file is encoded (e.g. UTF-16) so it can convert it to the internal encoding (often UTF-8).-fexec-charset
tells the compiler which encoding to use for string literals at runtime. For instance, ifconst char* s = "我爱🦀";
is compiled with-fexec-charset=utf16
,s
will store UTF-16 code units.Escape Sequences
Converting between UTF-16 and UTF-8 is usually straightforward: decode bytes into a code point sequence and re-encode. (Ignoring performance for the moment.)
Because most strings are language-agnostic, the conversion itself is trivial. The tricky bit is that most languages allow Unicode escape sequences inside string literals, and each language defines them differently.
Repeated caution: Unlike ordinary character representations, Unicode escape sequences are language features and vary greatly across languages. Be very careful when exchanging strings between languages.
Also remember that textual representations (such as
🦀
,\u{1F980}
, or\uD83E\uDD80
) are distinct from runtime semantics. The same code point may have many textual forms.UTF-8 vs. UTF-16
A JavaScript compiler must frequently convert UTF-8 source text into UTF-16 strings required at runtime.
Surrogate Pairs
Both UTF-8 and UTF-16 are variable-width encodings. Code points beyond U+FFFF require additional storage. UTF-16 represents these via surrogate pairs:
U+D800
~U+DBFF
U+DC00
~U+DFFF
The code point is recovered as:
Surrogates must appear in pairs. A solitary high or low surrogate is invalid, and we call such a code unit a lone surrogate.
Unfortunately, while lone surrogates are invalid in Unicode, JavaScript strings allow them:
Rust, on the other hand, rejects them:
This leads to a direct problem: JavaScript strings cannot be losslessly represented as Rust
String
. The following API is unsound. Rust converts unsupported sequences toU+FFFD
, so strings that differ in JavaScript may become equal in Rust.Conclusion: when lone surrogates matter, storing JavaScript strings as Rust
String
is a bad idea (there are hacky workarounds, but we will skip them here).wasm-bindgen
exposes aJsString
type for bridging Rust and JavaScript strings and documents the pitfalls: https://wasm-bindgen.github.io/wasm-bindgen/reference/types/str.html#utf-16-vs-utf-8Context-Sensitive Unicode Escape Sequences
Not every position in JavaScript source treats
\u
sequences as escapes. Different syntactic contexts use different rules—for example:Consider the differences between raw strings and ordinary string literals:
Many languages support a notion of raw string where escape sequences such as
\u12
are treated as plain text (\
,u
,1
,2
).Rspack Bug Analysis
Emoji Paths Not Supported
This bug is amusing—it is the result of several bugs stacking together. Reproduction: https://github.com/hardfist/rspress-emoji-bug
We found that the original file
🦀.md
was transformed into\uD83E\uDD80.md
. Investigation showed thatswc-loader
performed the conversion because it defaulted tojsc.output.charset=ascii
, rewriting every non-ASCII UTF-16 string into ASCII plus Unicode escape sequences. This is usually safe, but SWC has a bug when handling Unicode escape sequences. The AST visitor received aTokenValue
equal to the escaped form (\uD83E\uDD80
), so\uD83E\uDD80.md != 🦀.md
, causing path lookups to fail.Later, web-infra-dev/rspack#11568 added support for
jsc.output.charset=ascii=utf8
, which incidentally fixed the Rspress bug and restored emoji paths. In that configuration the Rust side receives the original🦀
from the AST visitor, so🦀 == 🦀
and the path matches.The key takeaway is that string equality compares underlying code points, not the textual representation.
The remaining SWC issues stem from the same root cause—bugs in escape-sequence handling:
How JavaScript Compilers Handle Strings
What a parser does with strings depends on its goals, which range from:
Because the same code point can have multiple textual forms (e.g.
🦀
,\uD83E\uDD80
,\u{1f980}
), code generators must decide what to emit:Different tools make different choices:
charset
optioncharset=ascii
charset=utf8
Transformers and minifiers do not necessarily need to keep the original spelling, but formatters do. Supporting verbatim output is harder because the parser must store not only code points (
TokenValue
) but also the original text (TokenText
). Raw text is not part of the ESTree specification (see estree/estree#291). ASTs that carry raw text can be seen as "extended ASTs"; concrete syntax trees (CSTs) are another example.Boa, V8, QuickJS, Esbuild
These engines take similar approaches. Since Rust
String
cannot hold lone surrogates, they store code points directly—for example asVec<u16>
. (Code points beyondu16
are encoded as two adjacentu16
values.) Taking Boa as an example:Vec<u16>
while recording whether escape sequences were present. https://github.com/boa-dev/boa/blob/44de1e64850fdd07881ec83fb998bd6b7f516b65/core/parser/src/lexer/string.rs#L136Provide UTF-8 and UTF-16 accessors. Since some interned strings cannot be represented in UTF-8, the UTF-8 accessor returns
Option
. (Anutf8_lossy
helper could be added for callers that do not require strict UTF-8.) Interning also improves performance: many tokens are guaranteed ASCII (e.g. keywords), so storing them asVec<u8>
instead ofVec<u16>
saves space.TokenKind::StringLiteral
stores both the interned value and whether it contained escape sequences.tsgo
tsgo
appears not to support lone surrogates. It stores JavaScript strings as Go strings, so lone surrogates cause issues (microsoft/typescript-go#1701). Interestingly, its printer still works because it uses slices from the source file (sourceFile[node.start:node.end]
) to emit string literals, bypassing token values. https://github.com/microsoft/typescript-go/blob/0216862d44c9b14717b7400818cf300f99ec5d1f/internal/scanner/utilities.go#L31TypeScript
Because the parser is written in JavaScript, it can store
TokenValue
as plain JavaScript strings, which naturally support all JavaScript string semantics.https://github.com/microsoft/TypeScript/blob/b504a1eed45e35b5f54694a1e0a09f35d0a5663c/src/compiler/scanner.ts#L1707
Biome
Biome behaves differently:
token.text
is a lazy getter that reads the original text when needed.token_text
, notTokenValue
. Consumers parse the escape sequences themselves. Treating escapes as raw text avoids the issue entirely.This raises the question: does the parser even need to compute
TokenValue
? Could SWC skip it and keep onlyTokenText
?OXC
OXC takes another path: it stores JavaScript strings in Rust
String
, but encodes lone surrogates specially. Before converting to RustString
, it rewrites lone surrogates—for example,\uD800
becomes\u{FFFD}d800
—to prevent Rust from replacing it withU+FFFD
. Consumers must explicitly decode\u{FFFD}d800
back to\uD800
. To tell user-supplied\u{FFFD}
apart from encoded placeholders, OXC encodes the former as\u{FFFD}fffd
.oxc-project/oxc#10041 (comment)
This approach has drawbacks: AST visitors in plugins must be aware of the encoding scheme and decode strings explicitly, otherwise they observe different values than the original source—an extra burden for plugin authors.
SWC
SWC actually has two related but distinct problems:
\uD800
, changing their semantics.String
, so lone surrogates are unsupported.Possible Fixes
Because JavaScript and Rust strings have fundamentally different invariants, treating a JavaScript string as
String
inside Rspack or SWC is risky and hard to debug. Ideally we should wrap JavaScript strings in a dedicated type (sayJsString
) as Nova does.JsString
right away and keeping it in that form. Incorrect handling of unpaired surrogates in JS strings wasm-bindgen/wasm-bindgen#1348WTF-8
WTF-8 can be viewed as a superset of UTF-8. When a string contains only valid Unicode code points, the two encodings produce identical byte sequences. The difference is that UTF-8 refuses to encode invalid code points, whereas WTF-8 encodes them as though they were valid. This makes WTF-8 an excellent ABI-compatible superset of UTF-8.
A Possible Fix
Our ideal fix should meet the following requirements:
One idea is to back
Atom
withWTF8Buf
:We also considered changingas_str
toReturn lossless UTF-8 when no lone surrogates are present.Replace invalid code points with\u{FFFD}
otherwise, matching JavaScript semantics.That would keep most APIs working unchanged for the common case.Instead we can add an
as_wtf8
accessor. Callers that need WTF-8 (identifiers, string literals, template strings) callas_wtf8
rather thanas_str
.Drawbacks
as_str
callers gradually.WTF8Atom
, while other nodes returnAtom
. This enforces the distinction at the type level, preventing visitor mistakes. It would be an API-breaking change but not ABI-breaking.Cross-Language String Handling
napi-rs & napi
At first glance converting between JavaScript and Rust strings with
napi-rs
seems easy becausenapi
encapsulates the complexity.sys::napi_get_value_string_utf8
ultimately callsnapi_get_value_string_utf8
, which automatically replaces lone surrogates.Grapheme Clusters and Unicode Normalization Forms
This topic is tangential to lexers and parsers but still interesting. For a detailed treatment see https://go.dev/blog/normalization.
Token Span
Alongside
TokenValue
we also haveTokenSpan
. What does a span measure?Unfortunately compilers disagree and there is no standard. (ESTree does not specify the unit for locations: estree/estree#80.) Converting between span formats is costly. Rspack and SWC currently use spans for:
rspack-source
source mapsrspack-source
edit operationsSWC measures spans in
BytePos
(byte offsets). Babel, Acorn, and similar tools use code unit offsets. If Webpack were to switch to SWC as its parser, it would have to translate byte offsets into code unit offsets.References
Beta Was this translation helpful? Give feedback.
All reactions