Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce size of Token? #1880

Closed
overlookmotel opened this issue Jan 3, 2024 · 16 comments
Closed

Reduce size of Token? #1880

overlookmotel opened this issue Jan 3, 2024 · 16 comments
Labels
A-parser Area - Parser C-performance Category - Solution not expected to change functional behavior, only performance

Comments

@overlookmotel
Copy link
Collaborator

overlookmotel commented Jan 3, 2024

Token struct used in the parser is currently 48 bytes. This could be reduced to 40 bytes with a couple of reasonably small changes, or reduced to 32 bytes with more invasive changes.

Anatomy of a Token

Token is currently structured as:

// 48 bytes
pub struct Token<'a> {
  pub kind: Kind,             // 1 byte
  pub start: u32,             // 4 bytes
  pub end: u32,               // 4 bytes
  pub is_on_new_line: bool,   // 1 byte
  pub escaped: bool,          // 1 byte
  pub value: TokenValue<'a>,  // 32 bytes
  // padding for align 8      // 5 bytes
}

// 32 bytes
pub enum TokenValue<'a> {
  None,                       // 0 bytes
  Number(f64),                // 8 bytes
  BigInt(num_bigint::BigInt), // 32 bytes
  String(&'a str),            // 16 bytes
  RegExp(RegExp<'a>),         // 24 bytes
}

// 24 bytes
pub struct RegExp<'a> {
  pub pattern: &'a str,       // 16 bytes
  pub flags: RegExpFlags,     // 1 byte
}

Reducing the size

To reduce TokenValue to 24 bytes:

1. Box BigInt

BigInt is the largest variant of TokenValue enum. Replacing BigInt with Box<BigInt> would reduce that variant from 32 bytes to 8 bytes.

This would impose a cost of indirection when reading the value of TokenValue::BigInt, but BigInts are fairly rare in JS Code, so for most code this cost would be insignificant.

2. Remove RegExp variant

Move RegExpFlags into Token (using one of the 5 bytes currently used for padding).

Then all that remains in RegExp is a &str, which is identical to the String variant. So those 2 variants can be merged.

Result is:

// 40 bytes
pub struct Token<'a> {
  pub kind: Kind,                  // 1 byte
  pub start: u32,                  // 4 bytes
  pub end: u32,                    // 4 bytes
  pub is_on_new_line: bool,        // 1 byte
  pub escaped: bool,               // 1 byte
  pub regexp_flags: RegExpFlags,   // 1 byte
  pub value: TokenValue<'a>,       // 24 bytes
  // padding for align 8           // 4 bytes
}

// 24 bytes
pub enum TokenValue<'a> {
  // <discriminant>                // 8 bytes
  None,                            // 0 bytes
  Number(f64),                     // 8 bytes
  BigInt(Box<num_bigint::BigInt>), // 8 bytes
  String(&'a str),                 // 16 bytes
}

All the variants are now 16 bytes or less. The reason a discriminant is now required where it wasn't before is that the longest variant &'a str only has 1 niche which is insufficient to contain the discriminant. In current layout, BigInt has multiple niches, so Rust "smuggles" the discriminant in those niches.

Reducing TokenValue to 16 bytes

3. Remove the discriminant

The discriminant is not required, as it duplicates information which is already present in Token.kind. e.g. A Token with kind Ident always contains a string in value, whereas kind Number is associated with f64 in value.

So TokenValue could change from an enum to a union with no discriminant, saving a further 8 bytes.

TokenValue::as_number etc would become methods on Token, which read the kind field and return the requested value only if kind matches.

The difficulty is in producing a safe interface which doesn't allow setting incompatible kind and value pairs. Should be doable, but would require a lot of changes to the lexer.

Alternative + further optimization

Once new Atom is implemented (#1803), it will be 16 bytes and contain multiple niches, so if String(&'str) is replaced with String(Atom<'a>), that will also reduce TokenValue to 16 bytes without having to convert it to a union.

However, there's also an inefficiency around the use of TokenValue.

The parser is often doing things like this:

fn parse_literal_expression(&mut self) -> Result<Expression<'a>> {
  match self.cur_kind() {
    kind if kind.is_number() => {
      if self.cur_src().ends_with('n') {
        self.parse_literal_bigint().map(|literal| self.ast.literal_bigint_expression(literal))
      } else {
        self.parse_literal_number().map(|literal| self.ast.literal_number_expression(literal))
      }
    },
    // ...
  }
}
fn parse_literal_number(&mut self) -> Result<NumberLiteral<'a>> {
  let value = self.cur_token().value.as_number();
  // ...
}

parse_literal_expression branches on kind. Then in parse_literal_number, by calling value.as_number() it branches again on TokenKind's discriminant.

This is unnecessary - we already know value is a number, or we wouldn't be in parse_literal_number.

I'd imagine correct branch prediction removes most of the cost of the 2nd branch, but this is presumably a hot path.

Again, the difficultly is in finding a safe interface to express this.

Questions

Is any of this worthwhile? Would reducing the size of Token likely make a measurable impact on performance?

If so, how much work is it worth? Take the easy-ish gain of getting it down to 40 bytes, or go the whole hog to reach 32?

@overlookmotel overlookmotel changed the title Reduce size of Token Reduce size of Token? Jan 3, 2024
@Boshen
Copy link
Member

Boshen commented Jan 3, 2024

Love this investigation! I made various attempts to reduce the token size when I finished the lexer.

Is any of this worthwhile? Would reducing the size of Token likely make a measurable impact on performance?

Reducing the token size will have significant performance improvements. Less memory + better cpu cache. Previous attempts are #32 and #151

If so, how much work is it worth? Take the easy-ish gain of getting it down to 40 bytes, or go the whole hog to reach 32?

We have a very easy performance measurement setup with codspeed, just make PRs and see their impacts.


To get started, let's try and box the bigint and regex variant, then merge in the Atom change.

To get aggressive, I also had the idea of making everything SoA (from the DoD talk https://vimeo.com/649009599), but unfortunately I don't have the time to experiment with such big changes.


I'll give you write access to the repo if you wish to take Oxc's performance to another level.

@Boshen Boshen added A-parser Area - Parser C-performance Category - Solution not expected to change functional behavior, only performance labels Jan 3, 2024
@overlookmotel
Copy link
Collaborator Author

Thanks for your enthusiasm!

After I submitted this issue, I rethought and concluded it was probably a waste of time, and reducing the size of Token would make no significant difference.

I'm new to OXC's codebase, but from what I can see, the design means that only the tiny Kind is passed around to functions, and there's only a single Token in existence at any time, which doesn't move (well except for LexerCheckpoints when rewinding may be necessary, but presumably that's fairly uncommon).

So would be interested to know why you think it might have a significant impact.

But, like you say, the best way to find out is do it and look at the benchmarks. I'd like to finish Atom first, so can measure what difference that makes in isolation, without the knock-on effect of it reducing the size of Token, but will hopefully be able to try this out after.

Data-oriented design is not something I'd encountered before. Interesting. I'll watch that video when I get time.

PS Thanks for offer of write access. Really kind of you to have such faith! However, I have sadly little time for coding, so I think you might find my commitment lacking. Also, I'm pretty new to Rust, so I think it's better if others check my work. So I'd prefer if you don't mind to continue just tinkering and submit PRs.

@overlookmotel
Copy link
Collaborator Author

Ah, I see now. next_token() does return a Token, and I guess that'll get called a lot, so reducing Token's size could make a difference.

Anyway, will try any make the changes above and see what benchmarks say.

@Boshen
Copy link
Member

Boshen commented Jan 5, 2024

I'm nerd sniped, rewatching the DoD video ... wish me luck this weekend.

@Boshen
Copy link
Member

Boshen commented Jan 5, 2024

image

Our token size would be 12 bytes if we remove the token value entirely.

Different from the zig parser:

  • lazy token values: this is going to be slightly slower for TypeScript's parser rewind, but the situations are rare.
  • getting the unescaped keyword, identifiers and strings: we can utilized the escaped value and derive things lazily. This is going to be hard one.

Making all the changes is going to be challenging, I'll plan them out.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jan 6, 2024

I watched the Andrew Kelley video too. Fascinating stuff.

Something else occurred to me on this subject:

As I understand it, the flow between parser and lexer goes as follows:

  • Parser asks lexer for next token.
  • Lexer gets next token.
  • In some circumstances, lexer has to look ahead to disambiguate e.g. /* vs /=. But in the end, it makes a decision and ends up with a single token stored in lexer.current.token.
  • Lexer returns that token to the parser, and resets lexer.current.token, ready for next time.

My thinking is:

At any point in time in the parser, there's only a single token in play. So couldn't Lexer::next_token() just return Kind (1 byte), and then the parser gets whatever further info it needs about the token from lexer.current.token?

Lexer wouldn't reset lexer.current.token until start of the next call to next_token().

Result is that Tokens don't get passed around. The current token is always in the same place in memory, and it'll hopefully sit continuously in L1 cache. In that case, the size of Token is less important, as long as it's under 64 bytes.

I guess there are exceptions to this rule e.g. when parser receives the token ( as start of an expression. Is it a parenthesised expression or an arrow function? In such cases, it may need to retrieve many more tokens from the lexer before it can decide.

But I'd imagine that'd be the minority of cases. So maybe better to handle storing a backlog of ambiguous tokens in the parser as a special case, and preserve the advantage of "only one token in play at a time" in the general case.

I wonder if that might achieve some of the gains of a full-on DoD approach, without the massive overhaul of the codebase that it sounds like you're contemplating?

Does this make any sense? I'm not that familiar with OXC's parser yet, so I could be missing something really obvious.

@Boshen
Copy link
Member

Boshen commented Jan 6, 2024

At any point in time in the parser, there's only a single token in play.

This is untrue:

lookahead: VecDeque::with_capacity(4), // 4 is the maximum lookahead for TypeScript

So couldn't Lexer::next_token() just return Kind

Theoretically yes, but parsing javascript is ridiculous, e.g.

ASI:

/// [Automatic Semicolon Insertion](https://tc39.es/ecma262/#sec-automatic-semicolon-insertion)
/// # Errors
pub(crate) fn asi(&mut self) -> Result<()> {
if !self.can_insert_semicolon() {
let span = Span::new(self.prev_token_end, self.cur_token().start);
return Err(diagnostics::AutoSemicolonInsertion(span).into());
}
if self.at(Kind::Semicolon) {
self.advance(Kind::Semicolon);
}
Ok(())
}
pub(crate) fn can_insert_semicolon(&self) -> bool {
let kind = self.cur_kind();
if kind == Kind::Semicolon {
return true;
}
kind == Kind::RCurly || kind.is_eof() || self.cur_token().is_on_new_line
}

escaped keyword:

/// `StringValue` of `IdentifierName` normalizes any Unicode escape sequences
/// in `IdentifierName` hence such escapes cannot be used to write an Identifier
/// whose code point sequence is the same as a `ReservedWord`.
#[inline]
fn test_escaped_keyword(&mut self, kind: Kind) {
if self.cur_token().escaped && kind.is_all_keyword() {
let span = self.cur_token().span();
self.error(diagnostics::EscapedKeyword(span));
}
}


Fun stuff:

std::mem::take(&mut self.current.token)

drop gets called on the token (because the token value's bigint has a drop). Removing the token value will also remove this drop, which is pretty nice.

@overlookmotel
Copy link
Collaborator Author

I'm heading to bed so just a brief-ish reply on 2 points:

This is untrue:
lookahead: VecDeque::with_capacity(4),

Ah yes, but that's in the lexer. My point was that the lexer only passes a single token back to the parser, and then that's the only token "in play" until next call to next_token() (probably with some exceptions which could be special-cased).

Well, at least that's if I understand how the parser works, which quite possibly I don't!

drop gets called on the token (because the token value's bigint has a drop)

Yes, I hadn't considered that. Removing the drop on Token and TokenValue would be a big improvement.

But... there's currently a memory leak due to BigInts in ASTs not getting dropped. Personally, I can't see a good way to solve that without replacing BigInt with an ArenaBigInt<'a> which stores its internal data in the arena (similar to replacing CompactString with Atom<'a>). If we go that way, the drop would disappear.

The other solution (which you suggested above) is to defer converting a slice of code to a BigInt until later, so TokenValue just contains a &str to represent a bigint. That'd also make TokenValue non-drop.

By the way, I wasn't trying to critique your planned approach. Just wanted to raise another option in case it's useful.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jan 6, 2024

I woke up today, and suddenly the plan you outlined before clicked for me. It's genius!

Why optimize a structure, when you can remove it entirely?

I had missed the point before. If parsing a string/bigint/number/regex from a chunk of source code can be deferred until the AST node is being created, then the only info required to do that is kind, start, end, and a few flags.

As a side effect, that also removes the extraneous lookups of TokenValue's discriminant which I mentioned at top of this issue. That's another gain. Very nice indeed.

Apologies I completely missed the point before.

I do still wonder why it's necessary to copy Token to parser.token when Parser::cur_token() could just get it from parser.lexer.current.token. But that optimization is insignificant compared to what you're proposing.

Boshen added a commit that referenced this issue Jan 7, 2024
This PR is part of #1880.

`Token` size is reduced from 48 to 40 bytes.

To reconstruct the regex pattern and flags within the parser , the regex string is
re-parsed from the end by reading all valid flags.

In order to make things work nicely, the lexer will no longer recover
from a invalid regex.
Boshen added a commit that referenced this issue Jan 7, 2024
This PR is part of #1880.

`Token` size is reduced from 48 to 40 bytes.

To reconstruct the regex pattern and flags within the parser , the regex string is
re-parsed from the end by reading all valid flags.

In order to make things work nicely, the lexer will no longer recover
from a invalid regex.
Boshen added a commit that referenced this issue Jan 8, 2024
This PR partially fixes #1803 and is part of #1880.

BigInt is removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.

`Token` is now also `Copy`, which removes all the `clone` and `drop`
calls.

This yields 5% performance improvement for the parser.
Boshen added a commit that referenced this issue Jan 8, 2024
This PR is part of #1880.

`Token` size is reduced from 48 to 40 bytes.

To reconstruct the regex pattern and flags within the parser , the regex string is
re-parsed from the end by reading all valid flags.

In order to make things work nicely, the lexer will no longer recover
from a invalid regex.
Boshen added a commit that referenced this issue Jan 8, 2024
This PR is part of #1880.

`Token` size is reduced from 48 to 40 bytes.

To reconstruct the regex pattern and flags within the parser , the regex
string is
re-parsed from the end by reading all valid flags.

In order to make things work nicely, the lexer will no longer recover
from a invalid regex.
Boshen added a commit that referenced this issue Jan 8, 2024
This PR is part of #1880.

Token size is reduced from 40 to 32 bytes.
Boshen added a commit that referenced this issue Jan 8, 2024
This PR is part of #1880.

Token size is reduced from 40 to 32 bytes.
@Boshen
Copy link
Member

Boshen commented Jan 8, 2024

@IWANABETHATGUY
Copy link
Contributor

We gain around 5 - 8% improvements from these 3 PRs, I'm now staring at TokenValue::String 😭

We don't even need a TokenValue type, just use Option<&str> to represent the TokenValue::String is enough,
https://github.com/oxc-project/oxc/pull/1945/files#diff-1900da6e4d59908339755be451de7da17e7f4b3f62d9c412dbecd800dd2d6510R39-R42

@Boshen
Copy link
Member

Boshen commented Jan 8, 2024

We gain around 5 - 8% improvements from these 3 PRs, I'm now staring at TokenValue::String 😭

We don't even need a TokenValue type, just use Option<&str> to represent the TokenValue::String is enough, https://github.com/oxc-project/oxc/pull/1945/files#diff-1900da6e4d59908339755be451de7da17e7f4b3f62d9c412dbecd800dd2d6510R39-R42

The goal is to remove it from Token, Option<&str> is the same thing that doesn't affect anything.

Boshen added a commit that referenced this issue Jan 9, 2024
Part of #1880

`Token` size is from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.

It would be nice if this handle is eliminated entirely because
the normal case for a string is always `source_text[token.span.start.token.span.end]`

Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a"  === "a"`.

This leads us to adding a `escaped_strings` `vec` for storing these unescaped and allocated
strings.

Performance regression for adding this vec should be minimal because escaped strings are rare.

Background Reading:

* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Boshen added a commit that referenced this issue Jan 9, 2024
Part of #1880

`Token` size is from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.

It would be nice if this handle is eliminated entirely because
the normal case for a string is always `source_text[token.span.start.token.span.end]`

Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a"  === "a"`.

This leads us to adding a `escaped_strings` `vec` for storing these unescaped and allocated
strings.

Performance regression for adding this vec should be minimal because escaped strings are rare.

Background Reading:

* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Boshen added a commit that referenced this issue Jan 9, 2024
Part of #1880

`Token` size is from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.

It would be nice if this handle is eliminated entirely because
the normal case for a string is always `source_text[token.span.start.token.span.end]`

Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a"  === "a"`.

This leads us to adding a `escaped_strings` `vec` for storing these unescaped and allocated
strings.

Performance regression for adding this vec should be minimal because escaped strings are rare.

Background Reading:

* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
@Boshen
Copy link
Member

Boshen commented Jan 9, 2024

Down to 16 bytes in #1962. I think we've strived for a balance between code complexity and performance. Shall resolve this issue?

Boshen added a commit that referenced this issue Jan 9, 2024
Part of #1880

`Token` size is reduced from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.

It would be nice if this handle is eliminated entirely because
the normal case for a string is always
`&source_text[token.span.start.token.span.end]`

Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a"  === "a"`.

This leads us to adding a `escaped_strings[]` vec for storing these
unescaped and allocated
strings.

Performance regression for adding this vec should be minimal because
escaped strings are rare.

Background Reading:

* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
@overlookmotel
Copy link
Collaborator Author

Well this issue originally proposed reducing TokenValue down to 24 or 16 bytes, and you've got it down to 4. So I'd say this issue has been resolved!

The diff on #1962 is quite large, and I won't have time to go through it properly for a while.

So yes, feel free to close this issue, and if I can find an opportunity for further optimization later on, I'll open another one.

What was the grand total of the performance boost from all this work in the end? +15%?

@Boshen
Copy link
Member

Boshen commented Jan 10, 2024

@Boshen Boshen closed this as completed Jan 10, 2024
@Boshen
Copy link
Member

Boshen commented Jan 13, 2024

Down to 12 bytes in https://github.com/oxc-project/oxc/pull/2010/files

IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this issue May 29, 2024
This PR partially fixes oxc-project#1803 and is part of oxc-project#1880.

BigInt is removed from the `Token` value, so that the token size can be
reduced once we removed all the variants.

`Token` is now also `Copy`, which removes all the `clone` and `drop`
calls.

This yields 5% performance improvement for the parser.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this issue May 29, 2024
…#1926)

This PR is part of oxc-project#1880.

`Token` size is reduced from 48 to 40 bytes.

To reconstruct the regex pattern and flags within the parser , the regex
string is
re-parsed from the end by reading all valid flags.

In order to make things work nicely, the lexer will no longer recover
from a invalid regex.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this issue May 29, 2024
)

This PR is part of oxc-project#1880.

Token size is reduced from 40 to 32 bytes.
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this issue May 29, 2024
…t#1962)

Part of oxc-project#1880

`Token` size is reduced from 32 to 16 bytes by changing the previous
token value `Option<&'a str>` to a u32 index handle.

It would be nice if this handle is eliminated entirely because
the normal case for a string is always
`&source_text[token.span.start.token.span.end]`

Unfortunately, JavaScript allows escaped characters to appear in
identifiers, strings and templates. These strings need to be unescaped
for equality checks, i.e. `"\a"  === "a"`.

This leads us to adding a `escaped_strings[]` vec for storing these
unescaped and allocated
strings.

Performance regression for adding this vec should be minimal because
escaped strings are rare.

Background Reading:

* https://floooh.github.io/2018/06/17/handles-vs-pointers.html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser C-performance Category - Solution not expected to change functional behavior, only performance
Projects
None yet
Development

No branches or pull requests

3 participants