JavaScript Compiler: Lexer TokenValue and UTF-8 #6

hardfist · 2025-10-13T01:46:53Z

hardfist
Oct 13, 2025
Maintainer

The previous post, JavaScript Lexer 1: TokenKind, explained how to split tokens. The next question is how to represent a token. A typical token carries at least the following elements:

TokenKind: the token type such as EQUALS, SLASH, or REGEX
TokenText (RawText): the textual representation, for example EQUALS ('='), SLASH ('/'), or REGEX ('/\.js$/g')
TokenValue: the decoded form of TokenText, where escape sequences (such as Unicode escapes) are resolved into the host language representation, e.g. Rust or Go String

Previously we looked at how different lexers treat TokenKind. Now we will see how various compilers handle TokenText and TokenValue.

It just so happens that Rspack recently ran into two categories of TokenValue bugs (the last three issues fall into the same bucket):

The first challenge with TokenValue is how to store it. JavaScript strings use UTF-16, whereas Rust and Go use UTF-8 (UTF-8 was in fact invented by one of Go's creators). Translating between the two encodings introduces format conversion issues. Before we dive deeper, we need to clarify a few core concepts about strings.

Byte, Char, and String

These three notions describe strings from different angles:

String: loosely speaking, a string is a sequence of characters. (This is a simplification—different languages have different expectations. Some treat strings as arbitrary byte streams. Here we follow the "sequence of characters" interpretation.)

The string "我爱🦀" corresponds to the character sequence "我", "爱", and "🦀".

Character: a single textual symbol such as 我, 爱, or 🦀. A character is an abstract entity: it can be rendered in different fonts or systems and can even represent control characters. To communicate unambiguously, we assign each character an identifier—its code point. (Again this is a simplification; a Unicode character may be represented by multiple code points when grapheme clusters are involved.) Today the most widely used character encoding system is Unicode, though others exist (GBK, Big5, Latin-1, etc.).

Strings and characters are concepts independent of computers and programs. Even in contexts such as telegraphy we can encode characters (for example in Morse code) and transmit strings.

We can inspect the Unicode code point of each character in JavaScript via codePointAt (unless otherwise noted, all encoding discussions assume Unicode):

Char	Code Point
我	25105
爱	29233
🦀	129408

Another easy-to-confuse pair is code point versus code unit:

Code point: the abstract Unicode character number, independent of a particular encoding. For instance the crab emoji 🦀 has code point 129408 (retrieved via '🦀'.codePointAt(0)).
Code unit: the smallest storage unit in a specific encoding (which varies between UTF-8, UTF-16, UTF-32, etc.). In UTF-16 the same 🦀 requires two code units: [\uD83E, \uDD80] (accessible via '🦀'[0] and '🦀'[1]).

A given code point may map to different numbers of code units depending on the encoding.

Byte: essentially an unsigned 8-bit value. Turning code points into sequences of u8 values is the process of encoding. UTF-8 is the mainstream encoding today (used by Go and Rust), but many others exist, such as UTF-16 (JavaScript) and UTF-32. We can examine the results via Buffer.from('我爱🦀').

> Buffer.from('我爱🦀', 'utf8')
<Buffer e6 88 91 e7 88 b1 f0 9f a6 80>
> Buffer.from('我', 'utf8')
<Buffer e6 88 91>
> Buffer.from('爱', 'utf8')
<Buffer e7 88 b1>
> Buffer.from('🦀', 'utf8')
<Buffer f0 9f a6 80>

> Buffer.from('我爱🦀', 'utf16le')
<Buffer 11 62 31 72 3e d8 80 dd>

Notice that Buffer.from('爱') and Buffer.from('🦀') have different lengths because UTF-8 is variable-width.

Another pair of related concepts is the character encoding system versus the character encoding scheme. The former maps abstract characters to numerical identifiers (e.g. Unicode code points) while the latter maps code points onto concrete storage representations (bytes).

Representations in Programming Languages

Most languages expose data structures for bytes, characters, and strings. For example:

Rust:

fn main() {
    let string: String = "我爱🦀".to_string(); // String
    let chars = string.chars().collect::<Vec<char>>(); // Characters
    let bytes = string.as_bytes(); // Bytes
    println!("string: {}", string);
    println!("chars: {:?}", chars);
    println!("bytes: {:?}", bytes);
}

// output
string: 我爱🦀
chars: ['我', '爱', '🦀']
bytes: [230, 136, 145, 231, 136, 177, 240, 159, 166, 128]

Go:

func main() {
    s := "我爱🦀"

    // Split into Unicode code points (runes)
    runes := []rune(s)
    chars := make([]string, len(runes))
    for i, r := range runes {
        chars[i] = string(r)
    }

    // UTF-8 bytes
    bytes := []byte(s)

    fmt.Printf("string: %s\n", s)
    fmt.Printf("chars: %v\n", runes)
    fmt.Printf("chars: %v\n", chars)
    fmt.Printf("bytes: %v\n", bytes)
}

JavaScript:

const str = "我爱🦀";
const chars = Array.from(str); // Unicode-aware split into code points
const bytes = Buffer.from(str); // UTF-8 bytes

console.log(`string: ${str}`);
console.log("chars:", chars);
console.log("bytes:", bytes);

// output
string: 我爱🦀
chars: [ '我', '爱', '🦀' ]
bytes: <Buffer e6 88 91 e7 88 b1 f0 9f a6 80>

C:

The C language is a bit special. The language and its standard library do not define String or Char. What we typically call a C string is a byte array, and a char is effectively an ASCII byte. Encoding and decoding are left to the programmer.

#include <stdio.h>
#include <string.h>
#include <wchar.h>
#include <locale.h>

int main() {
    setlocale(LC_ALL, "");
    const char *string = "我爱🦀"; // UTF-8 string literal
    printf("string: %s\n", string);
    // Print the byte array
    printf("bytes: [");
    size_t slen = strlen(string);
    for (size_t i = 0; i < slen; ++i) {
        if (i > 0) printf(", ");
        printf("%d", (unsigned char)string[i]);
    }
    printf("]\n");
    return 0;
}

// output
string: 我爱🦀
bytes: [230, 136, 145, 231, 136, 177, 240, 159, 166, 128]

Even though C does not provide a UTF-8 string type, it supports UTF-8 string literals such as const char *string = "我爱🦀";.

Source File Encoding

How does a compiler handle strings in the source file?

int main() {
    setlocale(LC_ALL, "");
    const char *string = "我爱🦀"; // Is this UTF-8 or UTF-16?
    printf("string: %s\n", string);
    // Print the byte array
    printf("bytes: [");
    size_t slen = strlen(string);
    for (size_t i = 0; i < slen; ++i) {
        if (i > 0) printf(", ");
        printf("%d", (unsigned char)string[i]);
    }
    printf("]\n");
    return 0;
}

It is easy to confuse how a source file stores text with how a compiler interprets string literals (their runtime semantics). The two are independent.

Text editors usually let you choose the encoding. Most default to UTF-8 and allow switching to others.

For example, VS Code encodes files as UTF-8 by default, but JavaScript strings are UTF-16. The JavaScript engine reads UTF-8 source text and converts it internally into UTF-16 strings.

Most languages require source text to be UTF-8. If your source file uses another encoding you generally have to convert it yourself or instruct the compiler to do so. GCC and Clang provide -finput-charset for this purpose.

C is a bit different: it does not mandate a specific encoding for string literals. Instead it offers specialized literals for UTF-8 and other encodings:

s1[] = "a猫🍌";          // depends on -fexec-charset
s2[] = u8"a猫🍌";        // UTF-8 string literal
char16_t s3[] = u"a猫🍌"; // prior to C23 this was an unspecified 16-bit encoding; after C23 it means UTF-16
char32_t s4[] = U"a猫🍌"; // prior to C23 this was an unspecified 32-bit encoding; after C23 it means UTF-32

Two GCC flags highlight the distinction between textual and runtime encodings. -finput-charset tells the compiler how the source file is encoded (e.g. UTF-16) so it can convert it to the internal encoding (often UTF-8). -fexec-charset tells the compiler which encoding to use for string literals at runtime. For instance, if const char* s = "我爱🦀"; is compiled with -fexec-charset=utf16, s will store UTF-16 code units.

Escape Sequences

Converting between UTF-16 and UTF-8 is usually straightforward: decode bytes into a code point sequence and re-encode. (Ignoring performance for the moment.)

// This is intentionally inefficient; it only illustrates the conversion steps.
fn utf8_to_utf16(utf8_buffer: Vec<u8>) {
   let codepoints = decode_from_utf8(utf8_buffer);
   let utf16_buffer: Vec<u16> = encode_into_utf16(codepoints);
}

Because most strings are language-agnostic, the conversion itself is trivial. The tricky bit is that most languages allow Unicode escape sequences inside string literals, and each language defines them differently.

JavaScript:

'\u{1F980}' == '🦀' // true
"\uD83E\uDD80" == '🦀' // true

Rust:

"\u{1F980}" == "🦀" // true
"\uD83E\uDD80" == "🦀" // compile error

Go:

"\u{1F980}" == "🦀" // compile error
"\uD83E\uDD80" == "🦀" // compile error
"\U0001F980" == "🦀" // true

Repeated caution: Unlike ordinary character representations, Unicode escape sequences are language features and vary greatly across languages. Be very careful when exchanging strings between languages.

Also remember that textual representations (such as 🦀, \u{1F980}, or \uD83E\uDD80) are distinct from runtime semantics. The same code point may have many textual forms.

UTF-8 vs. UTF-16

A JavaScript compiler must frequently convert UTF-8 source text into UTF-16 strings required at runtime.

Surrogate Pairs

Both UTF-8 and UTF-16 are variable-width encodings. Code points beyond U+FFFF require additional storage. UTF-16 represents these via surrogate pairs:

High surrogate: U+D800 ~ U+DBFF
Low surrogate: U+DC00 ~ U+DFFF

The code point is recovered as:

codePoint = 0x10000 + ((H - 0xD800) << 10) + (L - 0xDC00)

Surrogates must appear in pairs. A solitary high or low surrogate is invalid, and we call such a code unit a lone surrogate.

Unfortunately, while lone surrogates are invalid in Unicode, JavaScript strings allow them:

let str = "\uD800";

Rust, on the other hand, rejects them:

let s = "\u{D800}";

This leads to a direct problem: JavaScript strings cannot be losslessly represented as Rust String. The following API is unsound. Rust converts unsupported sequences to U+FFFD, so strings that differ in JavaScript may become equal in Rust.

static MAP: LazyLock<Mutex<HashMap<String, i32>>> = LazyLock::new(|| Mutex::new(HashMap::new()));

#[napi]
pub fn visit_count(a: String) -> i32 {
    let mut map = MAP.lock().unwrap();
    let count = map.get(&a[..]).unwrap_or(&0) + 1;
    map.insert(a, count);
    count
}

console.log(lib.visitCount('a')); // 1
console.log(lib.visitCount('b')); // 1
console.log(lib.visitCount('\uD800')); // 1
console.log(lib.visitCount('\uD801')); // 2
console.log(lib.visitCount('\uD802')); // 3

console.log('\uD800' === '\uD801'); // false
console.log('\uD800' === '\uD802'); // false

Conclusion: when lone surrogates matter, storing JavaScript strings as Rust String is a bad idea (there are hacky workarounds, but we will skip them here).

wasm-bindgen exposes a JsString type for bridging Rust and JavaScript strings and documents the pitfalls: https://wasm-bindgen.github.io/wasm-bindgen/reference/types/str.html#utf-16-vs-utf-8

Context-Sensitive Unicode Escape Sequences

Not every position in JavaScript source treats \u sequences as escapes. Different syntactic contexts use different rules—for example:

StringLiteral
Identifier
Comment
RawString
Keyword

Consider the differences between raw strings and ordinary string literals:

for(async in xxx) { // error
}
for(asyn\u0062 in xxx) { // ok
}

var bsyn\u0062;
console.log(bsync); // same

// "\uD83E\uDD80" == '🦀'   comment
"\uD83E\uDD80" == '🦀'  // string literal
String.raw`\uD83E\uDD80`  // raw string

> a = `\uD83E\uDD80`
'🦀'
> a.length
2
> a = String.raw`\uD83E\uDD80`
'\uD83E\uDD80'
> a.length
12
>

Many languages support a notion of raw string where escape sequences such as \u12 are treated as plain text (\, u, 1, 2).

Rspack Bug Analysis

Emoji Paths Not Supported

This bug is amusing—it is the result of several bugs stacking together. Reproduction: https://github.com/hardfist/rspress-emoji-bug

We found that the original file 🦀.md was transformed into \uD83E\uDD80.md. Investigation showed that swc-loader performed the conversion because it defaulted to jsc.output.charset=ascii, rewriting every non-ASCII UTF-16 string into ASCII plus Unicode escape sequences. This is usually safe, but SWC has a bug when handling Unicode escape sequences. The AST visitor received a TokenValue equal to the escaped form (\uD83E\uDD80), so \uD83E\uDD80.md != 🦀.md, causing path lookups to fail.

Later, web-infra-dev/rspack#11568 added support for jsc.output.charset=ascii=utf8, which incidentally fixed the Rspress bug and restored emoji paths. In that configuration the Rust side receives the original 🦀 from the AST visitor, so 🦀 == 🦀 and the path matches.

The key takeaway is that string equality compares underlying code points, not the textual representation.

'🦀' == '\uD83E\uDD80'  // true: two textual forms of the same code point; both have length 2
'\uD83E\uDD80'.length // 2 code units
[...'\uD83E\uDD80'] // 🦀
'\\uD83E\\uDD80' == '\\uD83E\\uDD80' // false: the first is literal text, the second is an escape sequence
'\\uD83E\\uDD80'.length // 12
[...'\\uD83E\\uDD80'] // ['\\', 'u', 'D', '8', '3', 'E', '\\', 'u', 'D', 'D', '8', '0']

The remaining SWC issues stem from the same root cause—bugs in escape-sequence handling:

How JavaScript Compilers Handle Strings

What a parser does with strings depends on its goals, which range from:

Producing the minimal ESTree-compatible AST.
Preserving original token text so the code can be regenerated exactly.

Because the same code point can have multiple textual forms (e.g. 🦀, \uD83E\uDD80, \u{1f980}), code generators must decide what to emit:

'🦀' // ?
'\uD83E\uDD80' // ?
'\u{1f980}' // ?

Different tools make different choices:

TypeScript: preserves the original text https://www.typescriptlang.org/play/?#code/OSPg3AG7wFDAOgrgEQBwGYCi8HIAxVnAbwEYAzATiWwF9gg

esbuild: depends on the charset option
- charset=ascii

charset=utf8

Prettier: preserves the original text
Biome: preserves the original text

Transformers and minifiers do not necessarily need to keep the original spelling, but formatters do. Supporting verbatim output is harder because the parser must store not only code points (TokenValue) but also the original text (TokenText). Raw text is not part of the ESTree specification (see estree/estree#291). ASTs that carry raw text can be seen as "extended ASTs"; concrete syntax trees (CSTs) are another example.

Boa, V8, QuickJS, Esbuild

These engines take similar approaches. Since Rust String cannot hold lone surrogates, they store code points directly—for example as Vec<u16>. (Code points beyond u16 are encoded as two adjacent u16 values.) Taking Boa as an example:

Convert the JavaScript string to a sequence of code points stored in Vec<u16> while recording whether escape sequences were present. https://github.com/boa-dev/boa/blob/44de1e64850fdd07881ec83fb998bd6b7f516b65/core/parser/src/lexer/string.rs#L136
Store the code points in the string interner. https://github.com/boa-dev/boa/blob/44de1e64850fdd07881ec83fb998bd6b7f516b65/core/interner/src/lib.rs#L164-L165

/// The string interner for Boa.
#[derive(Debug, Default)]
pub struct Interner {
    utf8_interner: RawInterner<u8>,
    utf16_interner: RawInterner<u16>,
}

Provide UTF-8 and UTF-16 accessors. Since some interned strings cannot be represented in UTF-8, the UTF-8 accessor returns Option. (An utf8_lossy helper could be added for callers that do not require strict UTF-8.) Interning also improves performance: many tokens are guaranteed ASCII (e.g. keywords), so storing them as Vec<u8> instead of Vec<u16> saves space.
TokenKind::StringLiteral stores both the interned value and whether it contained escape sequences.

tsgo

tsgo appears not to support lone surrogates. It stores JavaScript strings as Go strings, so lone surrogates cause issues (microsoft/typescript-go#1701). Interestingly, its printer still works because it uses slices from the source file (sourceFile[node.start:node.end]) to emit string literals, bypassing token values. https://github.com/microsoft/typescript-go/blob/0216862d44c9b14717b7400818cf300f99ec5d1f/internal/scanner/utilities.go#L31

TypeScript

Because the parser is written in JavaScript, it can store TokenValue as plain JavaScript strings, which naturally support all JavaScript string semantics.

https://github.com/microsoft/TypeScript/blob/b504a1eed45e35b5f54694a1e0a09f35d0a5663c/src/compiler/scanner.ts#L1707

Biome

Biome behaves differently:

During parsing it does not resolve token text. It records only the token range. token.text is a lazy getter that reads the original text when needed.
Biome exposes only token_text, not TokenValue. Consumers parse the escape sequences themselves. Treating escapes as raw text avoids the issue entirely.

This raises the question: does the parser even need to compute TokenValue? Could SWC skip it and keep only TokenText?

OXC

OXC takes another path: it stores JavaScript strings in Rust String, but encodes lone surrogates specially. Before converting to Rust String, it rewrites lone surrogates—for example, \uD800 becomes \u{FFFD}d800—to prevent Rust from replacing it with U+FFFD. Consumers must explicitly decode \u{FFFD}d800 back to \uD800. To tell user-supplied \u{FFFD} apart from encoded placeholders, OXC encodes the former as \u{FFFD}fffd.

oxc-project/oxc#10041 (comment)

This approach has drawbacks: AST visitors in plugins must be aware of the encoding scheme and decode strings explicitly, otherwise they observe different values than the original source—an extra burden for plugin authors.

SWC

SWC actually has two related but distinct problems:

It needlessly escapes strings such as \uD800, changing their semantics.
It stores JavaScript strings as Rust String, so lone surrogates are unsupported.

Possible Fixes

Because JavaScript and Rust strings have fundamentally different invariants, treating a JavaScript string as String inside Rspack or SWC is risky and hard to debug. Ideally we should wrap JavaScript strings in a dedicated type (say JsString) as Nova does.

Nova: converts the parser's encoded Rust string into EcmaScript::String, which stores WTF-8 internally.
wasm-bindgen: recommends converting to JsString right away and keeping it in that form. Incorrect handling of unpaired surrogates in JS strings wasm-bindgen/wasm-bindgen#1348
WebKit (browser): uses WTF-8 strings.
tsgo: plans to adopt WTF-8.

WTF-8

WTF-8 can be viewed as a superset of UTF-8. When a string contains only valid Unicode code points, the two encodings produce identical byte sequences. The difference is that UTF-8 refuses to encode invalid code points, whereas WTF-8 encodes them as though they were valid. This makes WTF-8 an excellent ABI-compatible superset of UTF-8.

A Possible Fix

Our ideal fix should meet the following requirements:

Preserve ABI stability so existing AST serialization stays unchanged.
Preserve API stability so callers need no changes.

One idea is to back Atom with WTF8Buf:

Strings without lone surrogates remain identical, preserving the ABI.
Strings with lone surrogates retain their original code point information losslessly.

~~We also considered changing as_str to~~

~~Return lossless UTF-8 when no lone surrogates are present.~~
~~Replace invalid code points with \u{FFFD} otherwise, matching JavaScript semantics.~~

~~That would keep most APIs working unchanged for the common case.~~

Instead we can add an as_wtf8 accessor. Callers that need WTF-8 (identifiers, string literals, template strings) call as_wtf8 rather than as_str.

Drawbacks

We lose some type safety: consumers must know when to treat an atom as WTF-8 or UTF-8. Currently only string literals, identifiers, and templates need WTF-8. If more cases arise, we can migrate as_str callers gradually.
- We could further distinguish the exposed value types in the AST: identifiers, string literals, and templates would return WTF8Atom, while other nodes return Atom. This enforces the distinction at the type level, preventing visitor mistakes. It would be an API-breaking change but not ABI-breaking.

struct Atom {
   buf: WTF8Buf,
}
struct WTF8Atom {
  buf: WTF8Buf,
}

impl Visitor for CollectVisitor {
    fn string_literal(node) {
      let token_value = node.value(); // Atom -> WTF8 buffer
      let chars = token_value.chars().collect(); // panic
    }
}

impl Visitor for CollectVisitor {
  fn string_literal(node) {
      let token_value = node.value(); // WTF8Atom
      let chars = token_value.chars().collect();
  }
  fn function_decl(node) {
     let token_value = node.value(); // Atom
     let chars = token_value.chars().collect();
  }
}

Cross-Language String Handling

napi-rs & napi

At first glance converting between JavaScript and Rust strings with napi-rs seems easy because napi encapsulates the complexity. sys::napi_get_value_string_utf8 ultimately calls napi_get_value_string_utf8, which automatically replaces lone surrogates.

Grapheme Clusters and Unicode Normalization Forms

This topic is tangential to lexers and parsers but still interesting. For a detailed treatment see https://go.dev/blog/normalization.

Token Span

Alongside TokenValue we also have TokenSpan. What does a span measure?

Byte offset?
Code unit offset?
Code point offset?

Unfortunately compilers disagree and there is no standard. (ESTree does not specify the unit for locations: estree/estree#80.) Converting between span formats is costly. Rspack and SWC currently use spans for:

SWC's own source maps
rspack-source source maps
rspack-source edit operations

SWC measures spans in BytePos (byte offsets). Babel, Acorn, and similar tools use code unit offsets. If Webpack were to switch to SWC as its parser, it would have to translate byte offsets into code unit offsets.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

JavaScript Compiler: Lexer TokenValue and UTF-8 #6

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

JavaScript Compiler: Lexer TokenValue and UTF-8 #6

Uh oh!

Uh oh!

hardfist Oct 13, 2025 Maintainer

Byte, Char, and String

Representations in Programming Languages

Source File Encoding

Escape Sequences

UTF-8 vs. UTF-16

Surrogate Pairs

Context-Sensitive Unicode Escape Sequences

Rspack Bug Analysis

Emoji Paths Not Supported

How JavaScript Compilers Handle Strings

Boa, V8, QuickJS, Esbuild

tsgo

TypeScript

Biome

OXC

SWC

Possible Fixes

WTF-8

A Possible Fix

Drawbacks

Cross-Language String Handling

napi-rs & napi

Grapheme Clusters and Unicode Normalization Forms

Token Span

References

Replies: 0 comments

hardfist
Oct 13, 2025
Maintainer