Reduce size of `Token`? #1880

overlookmotel · 2024-01-03T01:37:45Z

Token struct used in the parser is currently 48 bytes. This could be reduced to 40 bytes with a couple of reasonably small changes, or reduced to 32 bytes with more invasive changes.

Anatomy of a `Token`

Token is currently structured as:

// 48 bytes
pub struct Token<'a> {
  pub kind: Kind,             // 1 byte
  pub start: u32,             // 4 bytes
  pub end: u32,               // 4 bytes
  pub is_on_new_line: bool,   // 1 byte
  pub escaped: bool,          // 1 byte
  pub value: TokenValue<'a>,  // 32 bytes
  // padding for align 8      // 5 bytes
}

// 32 bytes
pub enum TokenValue<'a> {
  None,                       // 0 bytes
  Number(f64),                // 8 bytes
  BigInt(num_bigint::BigInt), // 32 bytes
  String(&'a str),            // 16 bytes
  RegExp(RegExp<'a>),         // 24 bytes
}

// 24 bytes
pub struct RegExp<'a> {
  pub pattern: &'a str,       // 16 bytes
  pub flags: RegExpFlags,     // 1 byte
}

Reducing the size

To reduce TokenValue to 24 bytes:

1. Box `BigInt`

BigInt is the largest variant of TokenValue enum. Replacing BigInt with Box<BigInt> would reduce that variant from 32 bytes to 8 bytes.

This would impose a cost of indirection when reading the value of TokenValue::BigInt, but BigInts are fairly rare in JS Code, so for most code this cost would be insignificant.

2. Remove `RegExp` variant

Move RegExpFlags into Token (using one of the 5 bytes currently used for padding).

Then all that remains in RegExp is a &str, which is identical to the String variant. So those 2 variants can be merged.

Result is:

// 40 bytes
pub struct Token<'a> {
  pub kind: Kind,                  // 1 byte
  pub start: u32,                  // 4 bytes
  pub end: u32,                    // 4 bytes
  pub is_on_new_line: bool,        // 1 byte
  pub escaped: bool,               // 1 byte
  pub regexp_flags: RegExpFlags,   // 1 byte
  pub value: TokenValue<'a>,       // 24 bytes
  // padding for align 8           // 4 bytes
}

// 24 bytes
pub enum TokenValue<'a> {
  // <discriminant>                // 8 bytes
  None,                            // 0 bytes
  Number(f64),                     // 8 bytes
  BigInt(Box<num_bigint::BigInt>), // 8 bytes
  String(&'a str),                 // 16 bytes
}

All the variants are now 16 bytes or less. The reason a discriminant is now required where it wasn't before is that the longest variant &'a str only has 1 niche which is insufficient to contain the discriminant. In current layout, BigInt has multiple niches, so Rust "smuggles" the discriminant in those niches.

Reducing `TokenValue` to 16 bytes

3. Remove the discriminant

The discriminant is not required, as it duplicates information which is already present in Token.kind. e.g. A Token with kind Ident always contains a string in value, whereas kind Number is associated with f64 in value.

So TokenValue could change from an enum to a union with no discriminant, saving a further 8 bytes.

TokenValue::as_number etc would become methods on Token, which read the kind field and return the requested value only if kind matches.

The difficulty is in producing a safe interface which doesn't allow setting incompatible kind and value pairs. Should be doable, but would require a lot of changes to the lexer.

Alternative + further optimization

Once new Atom is implemented (#1803), it will be 16 bytes and contain multiple niches, so if String(&'str) is replaced with String(Atom<'a>), that will also reduce TokenValue to 16 bytes without having to convert it to a union.

However, there's also an inefficiency around the use of TokenValue.

The parser is often doing things like this:

fn parse_literal_expression(&mut self) -> Result<Expression<'a>> {
  match self.cur_kind() {
    kind if kind.is_number() => {
      if self.cur_src().ends_with('n') {
        self.parse_literal_bigint().map(|literal| self.ast.literal_bigint_expression(literal))
      } else {
        self.parse_literal_number().map(|literal| self.ast.literal_number_expression(literal))
      }
    },
    // ...
  }
}

fn parse_literal_number(&mut self) -> Result<NumberLiteral<'a>> {
  let value = self.cur_token().value.as_number();
  // ...
}

parse_literal_expression branches on kind. Then in parse_literal_number, by calling value.as_number() it branches again on TokenKind's discriminant.

This is unnecessary - we already know value is a number, or we wouldn't be in parse_literal_number.

I'd imagine correct branch prediction removes most of the cost of the 2nd branch, but this is presumably a hot path.

Again, the difficultly is in finding a safe interface to express this.

Questions

Is any of this worthwhile? Would reducing the size of Token likely make a measurable impact on performance?

If so, how much work is it worth? Take the easy-ish gain of getting it down to 40 bytes, or go the whole hog to reach 32?

The text was updated successfully, but these errors were encountered:

Boshen · 2024-01-03T02:59:43Z

Love this investigation! I made various attempts to reduce the token size when I finished the lexer.

Is any of this worthwhile? Would reducing the size of Token likely make a measurable impact on performance?

Reducing the token size will have significant performance improvements. Less memory + better cpu cache. Previous attempts are #32 and #151

If so, how much work is it worth? Take the easy-ish gain of getting it down to 40 bytes, or go the whole hog to reach 32?

We have a very easy performance measurement setup with codspeed, just make PRs and see their impacts.

To get started, let's try and box the bigint and regex variant, then merge in the Atom change.

To get aggressive, I also had the idea of making everything SoA (from the DoD talk https://vimeo.com/649009599), but unfortunately I don't have the time to experiment with such big changes.

I'll give you write access to the repo if you wish to take Oxc's performance to another level.

overlookmotel · 2024-01-03T18:09:09Z

Thanks for your enthusiasm!

After I submitted this issue, I rethought and concluded it was probably a waste of time, and reducing the size of Token would make no significant difference.

I'm new to OXC's codebase, but from what I can see, the design means that only the tiny Kind is passed around to functions, and there's only a single Token in existence at any time, which doesn't move (well except for LexerCheckpoints when rewinding may be necessary, but presumably that's fairly uncommon).

So would be interested to know why you think it might have a significant impact.

But, like you say, the best way to find out is do it and look at the benchmarks. I'd like to finish Atom first, so can measure what difference that makes in isolation, without the knock-on effect of it reducing the size of Token, but will hopefully be able to try this out after.

Data-oriented design is not something I'd encountered before. Interesting. I'll watch that video when I get time.

PS Thanks for offer of write access. Really kind of you to have such faith! However, I have sadly little time for coding, so I think you might find my commitment lacking. Also, I'm pretty new to Rust, so I think it's better if others check my work. So I'd prefer if you don't mind to continue just tinkering and submit PRs.

overlookmotel · 2024-01-04T00:03:05Z

Ah, I see now. next_token() does return a Token, and I guess that'll get called a lot, so reducing Token's size could make a difference.

Anyway, will try any make the changes above and see what benchmarks say.

Boshen · 2024-01-05T15:36:21Z

I'm nerd sniped, rewatching the DoD video ... wish me luck this weekend.

Boshen · 2024-01-05T16:01:09Z

Our token size would be 12 bytes if we remove the token value entirely.

Different from the zig parser:

lazy token values: this is going to be slightly slower for TypeScript's parser rewind, but the situations are rare.
getting the unescaped keyword, identifiers and strings: we can utilized the escaped value and derive things lazily. This is going to be hard one.

Making all the changes is going to be challenging, I'll plan them out.

overlookmotel · 2024-01-06T03:41:07Z

I watched the Andrew Kelley video too. Fascinating stuff.

Something else occurred to me on this subject:

As I understand it, the flow between parser and lexer goes as follows:

Parser asks lexer for next token.
Lexer gets next token.
In some circumstances, lexer has to look ahead to disambiguate e.g. /* vs /=. But in the end, it makes a decision and ends up with a single token stored in lexer.current.token.
Lexer returns that token to the parser, and resets lexer.current.token, ready for next time.

My thinking is:

At any point in time in the parser, there's only a single token in play. So couldn't Lexer::next_token() just return Kind (1 byte), and then the parser gets whatever further info it needs about the token from lexer.current.token?

Lexer wouldn't reset lexer.current.token until start of the next call to next_token().

Result is that Tokens don't get passed around. The current token is always in the same place in memory, and it'll hopefully sit continuously in L1 cache. In that case, the size of Token is less important, as long as it's under 64 bytes.

I guess there are exceptions to this rule e.g. when parser receives the token ( as start of an expression. Is it a parenthesised expression or an arrow function? In such cases, it may need to retrieve many more tokens from the lexer before it can decide.

But I'd imagine that'd be the minority of cases. So maybe better to handle storing a backlog of ambiguous tokens in the parser as a special case, and preserve the advantage of "only one token in play at a time" in the general case.

I wonder if that might achieve some of the gains of a full-on DoD approach, without the massive overhaul of the codebase that it sounds like you're contemplating?

Does this make any sense? I'm not that familiar with OXC's parser yet, so I could be missing something really obvious.

Boshen · 2024-01-06T04:02:18Z

At any point in time in the parser, there's only a single token in play.

This is untrue:

oxc/crates/oxc_parser/src/lexer/mod.rs

Line 92 in 24d209c

    
           lookahead: VecDeque::with_capacity(4), // 4 is the maximum lookahead for TypeScript

So couldn't Lexer::next_token() just return Kind

Theoretically yes, but parsing javascript is ridiculous, e.g.

ASI:

oxc/crates/oxc_parser/src/cursor.rs

Lines 145 to 164 in 24d209c

    
           /// [Automatic Semicolon Insertion](https://tc39.es/ecma262/#sec-automatic-semicolon-insertion) 
        
           /// # Errors 
        
           pub(crate) fn asi(&mut self) -> Result<()> { 
        
               if !self.can_insert_semicolon() { 
        
                   let span = Span::new(self.prev_token_end, self.cur_token().start); 
        
                   return Err(diagnostics::AutoSemicolonInsertion(span).into()); 
        
               } 
        
               if self.at(Kind::Semicolon) { 
        
                   self.advance(Kind::Semicolon); 
        
               } 
        
               Ok(()) 
        
           } 
        
           pub(crate) fn can_insert_semicolon(&self) -> bool { 
        
               let kind = self.cur_kind(); 
        
               if kind == Kind::Semicolon { 
        
                   return true; 
        
               } 
        
               kind == Kind::RCurly || kind.is_eof() || self.cur_token().is_on_new_line 
        
           }

escaped keyword:

oxc/crates/oxc_parser/src/cursor.rs

Lines 92 to 101 in 24d209c

    
           /// `StringValue` of `IdentifierName` normalizes any Unicode escape sequences 
        
           /// in `IdentifierName` hence such escapes cannot be used to write an Identifier 
        
           /// whose code point sequence is the same as a `ReservedWord`. 
        
           #[inline] 
        
           fn test_escaped_keyword(&mut self, kind: Kind) { 
        
               if self.cur_token().escaped && kind.is_all_keyword() { 
        
                   let span = self.cur_token().span(); 
        
                   self.error(diagnostics::EscapedKeyword(span)); 
        
               } 
        
           }

Fun stuff:

oxc/crates/oxc_parser/src/lexer/mod.rs

Line 181 in 24d209c

std::mem::take(&mut self.current.token)

drop gets called on the token (because the token value's bigint has a drop). Removing the token value will also remove this drop, which is pretty nice.

overlookmotel · 2024-01-06T05:06:19Z

I'm heading to bed so just a brief-ish reply on 2 points:

This is untrue:
lookahead: VecDeque::with_capacity(4),

Ah yes, but that's in the lexer. My point was that the lexer only passes a single token back to the parser, and then that's the only token "in play" until next call to next_token() (probably with some exceptions which could be special-cased).

Well, at least that's if I understand how the parser works, which quite possibly I don't!

drop gets called on the token (because the token value's bigint has a drop)

Yes, I hadn't considered that. Removing the drop on Token and TokenValue would be a big improvement.

But... there's currently a memory leak due to BigInts in ASTs not getting dropped. Personally, I can't see a good way to solve that without replacing BigInt with an ArenaBigInt<'a> which stores its internal data in the arena (similar to replacing CompactString with Atom<'a>). If we go that way, the drop would disappear.

The other solution (which you suggested above) is to defer converting a slice of code to a BigInt until later, so TokenValue just contains a &str to represent a bigint. That'd also make TokenValue non-drop.

By the way, I wasn't trying to critique your planned approach. Just wanted to raise another option in case it's useful.

overlookmotel · 2024-01-06T18:08:19Z

I woke up today, and suddenly the plan you outlined before clicked for me. It's genius!

Why optimize a structure, when you can remove it entirely?

I had missed the point before. If parsing a string/bigint/number/regex from a chunk of source code can be deferred until the AST node is being created, then the only info required to do that is kind, start, end, and a few flags.

As a side effect, that also removes the extraneous lookups of TokenValue's discriminant which I mentioned at top of this issue. That's another gain. Very nice indeed.

Apologies I completely missed the point before.

I do still wonder why it's necessary to copy Token to parser.token when Parser::cur_token() could just get it from parser.lexer.current.token. But that optimization is insignificant compared to what you're proposing.

This PR is part of #1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.

This PR partially fixes #1803 and is part of #1880. BigInt is removed from the `Token` value, so that the token size can be reduced once we removed all the variants. `Token` is now also `Copy`, which removes all the `clone` and `drop` calls. This yields 5% performance improvement for the parser.

This PR is part of #1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.

This PR is part of #1880. Token size is reduced from 40 to 32 bytes.

Boshen · 2024-01-08T08:32:52Z

We gain around 5 - 8% improvements from these 3 PRs, I'm now staring at TokenValue::String 😭

IWANABETHATGUY · 2024-01-08T08:44:44Z

refactor(parser): remove TokenValue::BigInt from Token #1924

refactor(parser): remove TokenValue::RegExp from Token #1926

refactor(parser): remove TokenValue::Number from Token #1945

We gain around 5 - 8% improvements from these 3 PRs, I'm now staring at TokenValue::String 😭

We don't even need a TokenValue type, just use Option<&str> to represent the TokenValue::String is enough,
https://github.com/oxc-project/oxc/pull/1945/files#diff-1900da6e4d59908339755be451de7da17e7f4b3f62d9c412dbecd800dd2d6510R39-R42

Boshen · 2024-01-08T09:01:59Z

refactor(parser): remove TokenValue::BigInt from Token #1924

refactor(parser): remove TokenValue::RegExp from Token #1926

refactor(parser): remove TokenValue::Number from Token #1945

We gain around 5 - 8% improvements from these 3 PRs, I'm now staring at TokenValue::String 😭

We don't even need a TokenValue type, just use Option<&str> to represent the TokenValue::String is enough, https://github.com/oxc-project/oxc/pull/1945/files#diff-1900da6e4d59908339755be451de7da17e7f4b3f62d9c412dbecd800dd2d6510R39-R42

The goal is to remove it from Token, Option<&str> is the same thing that doesn't affect anything.

Part of #1880 `Token` size is from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings` `vec` for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html

Boshen · 2024-01-09T07:15:01Z

Down to 16 bytes in #1962. I think we've strived for a balance between code complexity and performance. Shall resolve this issue?

Part of #1880 `Token` size is reduced from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `&source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings[]` vec for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html

overlookmotel · 2024-01-09T16:25:13Z

Well this issue originally proposed reducing TokenValue down to 24 or 16 bytes, and you've got it down to 4. So I'd say this issue has been resolved!

The diff on #1962 is quite large, and I won't have time to go through it properly for a while.

So yes, feel free to close this issue, and if I can find an opportunity for further optimization later on, I'll open another one.

What was the grand total of the performance boost from all this work in the end? +15%?

Boshen · 2024-01-10T05:54:22Z

https://codspeed.io/oxc-project/oxc/benchmarks/tasks/benchmark/benches/parser.rs::parser::bench_parser::parser[checker.ts]

Codspeed shows a reduction from 533 - 480, which is 11%.

Boshen · 2024-01-13T04:02:02Z

Down to 12 bytes in https://github.com/oxc-project/oxc/pull/2010/files

This PR partially fixes oxc-project#1803 and is part of oxc-project#1880. BigInt is removed from the `Token` value, so that the token size can be reduced once we removed all the variants. `Token` is now also `Copy`, which removes all the `clone` and `drop` calls. This yields 5% performance improvement for the parser.

…#1926) This PR is part of oxc-project#1880. `Token` size is reduced from 48 to 40 bytes. To reconstruct the regex pattern and flags within the parser , the regex string is re-parsed from the end by reading all valid flags. In order to make things work nicely, the lexer will no longer recover from a invalid regex.

) This PR is part of oxc-project#1880. Token size is reduced from 40 to 32 bytes.

…t#1962) Part of oxc-project#1880 `Token` size is reduced from 32 to 16 bytes by changing the previous token value `Option<&'a str>` to a u32 index handle. It would be nice if this handle is eliminated entirely because the normal case for a string is always `&source_text[token.span.start.token.span.end]` Unfortunately, JavaScript allows escaped characters to appear in identifiers, strings and templates. These strings need to be unescaped for equality checks, i.e. `"\a" === "a"`. This leads us to adding a `escaped_strings[]` vec for storing these unescaped and allocated strings. Performance regression for adding this vec should be minimal because escaped strings are rare. Background Reading: * https://floooh.github.io/2018/06/17/handles-vs-pointers.html

overlookmotel changed the title ~~Reduce size of Token~~ Reduce size of Token? Jan 3, 2024

Boshen added A-parser Area - Parser C-performance Category - Solution not expected to change functional behavior, only performance labels Jan 3, 2024

Boshen mentioned this issue Jan 7, 2024

refactor(parser): remove TokenValue::BigInt from Token #1924

Merged

Boshen mentioned this issue Jan 7, 2024

refactor(parser): remove TokenValue::RegExp from Token #1926

Merged

Boshen added a commit that referenced this issue Jan 8, 2024

refactor(parser): remove TokenValue::Number from Token

48bbd5d

This PR is part of #1880. Token size is reduced from 40 to 32 bytes.

Boshen mentioned this issue Jan 8, 2024

refactor(parser): remove TokenValue::Number from Token #1945

Merged

Boshen added a commit that referenced this issue Jan 8, 2024

refactor(parser): remove TokenValue::Number from Token (#1945)

6e0bd52

This PR is part of #1880. Token size is reduced from 40 to 32 bytes.

Boshen mentioned this issue Jan 9, 2024

refactor(parser): reduce Token size from 32 to 16 bytes #1962

Merged

Boshen closed this as completed Jan 10, 2024

IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this issue May 29, 2024

refactor(parser): remove TokenValue::Number from Token (oxc-project#1945

37cae98

) This PR is part of oxc-project#1880. Token size is reduced from 40 to 32 bytes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce size of `Token`? #1880

Reduce size of `Token`? #1880

overlookmotel commented Jan 3, 2024 •

edited

Loading

Boshen commented Jan 3, 2024 •

edited

Loading

overlookmotel commented Jan 3, 2024

overlookmotel commented Jan 4, 2024

Boshen commented Jan 5, 2024

Boshen commented Jan 5, 2024

overlookmotel commented Jan 6, 2024 •

edited

Loading

Boshen commented Jan 6, 2024 •

edited

Loading

overlookmotel commented Jan 6, 2024

overlookmotel commented Jan 6, 2024 •

edited

Loading

Boshen commented Jan 8, 2024

IWANABETHATGUY commented Jan 8, 2024

Boshen commented Jan 8, 2024

Boshen commented Jan 9, 2024

overlookmotel commented Jan 9, 2024

Boshen commented Jan 10, 2024

Boshen commented Jan 13, 2024

Reduce size of Token? #1880

Reduce size of Token? #1880

Comments

overlookmotel commented Jan 3, 2024 • edited Loading

Anatomy of a Token

Reducing the size

1. Box BigInt

2. Remove RegExp variant

Reducing TokenValue to 16 bytes

3. Remove the discriminant

Alternative + further optimization

Questions

Boshen commented Jan 3, 2024 • edited Loading

overlookmotel commented Jan 3, 2024

overlookmotel commented Jan 4, 2024

Boshen commented Jan 5, 2024

Boshen commented Jan 5, 2024

overlookmotel commented Jan 6, 2024 • edited Loading

Boshen commented Jan 6, 2024 • edited Loading

overlookmotel commented Jan 6, 2024

overlookmotel commented Jan 6, 2024 • edited Loading

Boshen commented Jan 8, 2024

IWANABETHATGUY commented Jan 8, 2024

Boshen commented Jan 8, 2024

Boshen commented Jan 9, 2024

overlookmotel commented Jan 9, 2024

Boshen commented Jan 10, 2024

Boshen commented Jan 13, 2024

Reduce size of `Token`? #1880

Reduce size of `Token`? #1880

overlookmotel commented Jan 3, 2024 •

edited

Loading

Anatomy of a `Token`

1. Box `BigInt`

2. Remove `RegExp` variant

Reducing `TokenValue` to 16 bytes

Boshen commented Jan 3, 2024 •

edited

Loading

overlookmotel commented Jan 6, 2024 •

edited

Loading

Boshen commented Jan 6, 2024 •

edited

Loading

overlookmotel commented Jan 6, 2024 •

edited

Loading