Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(parser): lexer byte handlers consume ASCII chars faster #2046

Merged
merged 2 commits into from
Jan 16, 2024

Conversation

overlookmotel
Copy link
Collaborator

@overlookmotel overlookmotel commented Jan 16, 2024

In the lexer, most BYTE_HANDLERs immediately consume the current char with lexer.consume_char().

Byte handlers are only called if there's a certain value (or range of values) for the next char. This is their entire purpose. So in all cases we know for sure that we're not at EOF, and that the next char is a single-byte ASCII character.

The compiler, however, doesn't seem to be able to "see through" the BYTE_HANDLERS[byte](self) call and understand these invariants. So it produces very verbose ASM for lexer.consume_char().

This PR replaces lexer.consume_char() in the byte handlers with an unsafe lexer.consume_ascii_char() which skips on to next char with a single inc instruction.

The difference in codegen can be seen here: https://godbolt.org/z/1ha3cr9W5 (compare the 2 x core::ops::function::FnOnce::call_once handlers).

Downside is that this does introduce a lot of unsafe blocks, but in my opinion they're all pretty trivial to validate.

@github-actions github-actions bot added the A-parser Area - Parser label Jan 16, 2024
Copy link

codspeed-hq bot commented Jan 16, 2024

CodSpeed Performance Report

Merging #2046 will not alter performance

Comparing overlookmotel:lexer-bytes3 (0daa3ef) with main (09c7570)

Summary

✅ 14 untouched benchmarks

@Boshen
Copy link
Member

Boshen commented Jan 16, 2024

image

@Boshen
Copy link
Member

Boshen commented Jan 16, 2024

I shrank the unsafe surface area because all calls are self-contained (within the same file).

@Boshen Boshen merged commit 66a7a68 into oxc-project:main Jan 16, 2024
17 checks passed
@overlookmotel
Copy link
Collaborator Author

Interesting that it benefits TypeScript more than plain JS.

I see from your new commit that you're not keen on all the unsafe!

I totally get that. However, in my opinion, consume_ascii_char() should be an unsafe function. Calling it at EOF or if the next char isn't ASCII would be UB. If it's not an unsafe function, it'd be easy for someone to call it from somewhere else in lexer's codebase without considering the constraints they must uphold.

I did try to come up with another way to do it where the unsafe block is only within read_next_token(), rather than in every handler, but it's hard to find a good solution.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jan 16, 2024

Best alternative I came up with which moves the burden of satisfying the safety constraint into read_next_token() and removes the unsafe blocks from byte handlers:

mod with_ascii {
  use super::Lexer;

  pub struct LexerWithNextCharAscii<'a>(&'a mut Lexer<'a>);

  impl<'a> LexerWithNextCharAscii<'a> {
    /// Create a wrapped `Lexer` where next char is guaranteed to be ASCII.
    /// SAFETY: Caller must ensure this is the case.
    pub unsafe fn new(lexer: &'a mut Lexer<'a>) -> Self {
      Self(lexer)
    }

    /// Unwrap to `Lexer<'a>`
    pub fn lexer(self) -> &'a mut Lexer<'a> {
      self.0
    }

    /// Consume next char and unwrap to `Lexer<'a>`
    pub fn consume_char(self) -> (char, &'a mut Lexer<'a>) {
      // SAFETY: Invariant of this type is that lexer's next char is ASCII
      let c = unsafe { self.0.consume_ascii_char() };
      (c, self.0)
    }
  }
}
use with_ascii::LexerWithNextCharAscii;

Then the safety invariant is enforced in read_next_token():

let byte = remaining.as_bytes()[0];
let kind = if byte < 128 {
  // SAFETY: We've just determined next byte is ASCII
  let ascii_lexer = unsafe { LexerWithNextCharAscii::new(self) };
  BYTE_HANDLERS[byte as usize](ascii_lexer)
} else {
  // ...
}

BYTE_HANDLERS are called with a LexerWithNextCharAscii which allows consuming only one char as ASCII and then returning to a standard Lexer. e.g.:

type ByteHandler = fn(LexerWithNextCharAscii<'_>) -> Kind;

const QOT: ByteHandler = |ascii_lexer| {
  // Consume next char efficiently as we know it's ASCII
  // No unsafe here!
  let (c, lexer) = ascii_lexer.consume_char();
  // `lexer` is now a normal `Lexer` without the ASCII guarantee
  if lexer.context == LexerContext::JsxAttributeValue {
    lexer.read_jsx_string_literal(c)
  } else {
    lexer.read_string_literal(c)
  }
};

This only uses the type system to enforce the constraints, and should have no runtime cost. But still it's a bit bananas...

Personally I think it's cleaner to keep consume_ascii_char() as an unsafe function and have unsafe blocks in the handlers.

PS Yes, I take your point that it's all contained within 1 file. But that file is 1800 lines!

@Boshen
Copy link
Member

Boshen commented Jan 16, 2024

Perhaps we should move away from the Chars interface and code it like C 😅


I guess this is why unsafe is frowned upon because now we are nitpicking where to put the unsafe.

Both of our points are valid.

The rule of thumb is to make the code more approachable for future readers, so feel free to disagree with my changes and make adjustments, I'm happy to merge.

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jan 16, 2024

Thanks for coming back.

The Chars interface does have some advantages in that it enforces that you can never have the pointer in the middle of a unicode byte sequence. So that's one less thing to think about for contributors, and one less source of nasty bugs.

I do see why you would prefer to avoid unsafe in every byte handler. It is ugly.

My personal approach to unsafe is quite rigid. I like that when you call a safe function, there's a whole bunch of things you don't need to worry about. But obviously that breaks down if you can't trust safe functions to be safe!

I guess it's inevitable that there's going to be some trade-off between performance and readability in a project like this where performance is a focus. It's tough to strike that balance.

Anyway, thanks for being open to disagreement. I'll give it some thought...

Out of interest, how do you feel about the kind of type system-based alternative I outlined above?

@Boshen
Copy link
Member

Boshen commented Jan 16, 2024

Out of interest, how do you feel about the kind of type system-based alternative I outlined above?

Feels like over abstracting with concepts, which isn't beginner friendly. The trying to keep it simple idea came from reading the Rome / Biome codebase (and partly swc as well).

@overlookmotel
Copy link
Collaborator Author

I think I found a solution which squares the circle. No unsafe, but without sacrificing the performance gain. Will test it out and make a PR.

Boshen pushed a commit that referenced this pull request Jan 17, 2024
As discussed on #2046, it wasn't ideal to have `unsafe {
lexer.consume_ascii_char() }` in every byte handler. It also wasn't
great to have a safe function `consume_ascii_char()` which could cause
UB if called incorrectly (so wasn't really safe at all).

This PR achieves the same objective of #2046, but using a macro to
define byte handlers for ASCII chars, which builds in the assertion that
next char is guaranteed to be ASCII.

Before #2046:

```rs
const SPS: ByteHandler = |lexer| {
  lexer.consume_char();
  Kind::WhiteSpace
};
```

After this PR:

```rs
ascii_byte_handler!(SPS(lexer) {
  lexer.consume_char();
  Kind::WhiteSpace
});
```

i.e. The body of the handlers are unchanged from how they were before
#2046.

This expands to:

```rs
const SPS: ByteHandler = |lexer| {
  unsafe {
    let s = lexer.current.chars.as_str();
    assert_unchecked!(!s.is_empty());
    assert_unchecked!(s.as_bytes()[0] < 128);
  }
  lexer.consume_char();
  Kind::WhiteSpace
};
```

But due to the assertions the macro inserts, `consume_char()` is now
optimized for ASCII characters, and reduces to a single instruction. So
the `consume_ascii_char()` function introduced by #2046 is unnecessary,
and can be removed again.

The "boundary of unsafe" is moved to a new function `handle_byte()`
which `read_next_token()` calls. `read_next_token()` is responsible for
upholding the safety invariants, which include ensuring that
`ascii_byte_handler!()` macro is not being misused (that last part is
strictly speaking a bit of a cheat, but...).

I am not a fan of macros, as they're not great for readability. But in
this case I don't think it's *too* bad, because:

1. The macro is well-documented.
2. It's not too clever (only one syntax is accepted).
3. It's used repetitively in a clear pattern, and once you've understood
one, you understand them all.

What do you think? Does this strike a reasonable balance between
readability and safety?
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
…ject#2046)

In the lexer, most `BYTE_HANDLER`s immediately consume the current char
with `lexer.consume_char()`.

Byte handlers are only called if there's a certain value (or range of
values) for the next char. This is their entire purpose. So in all cases
we know for sure that we're not at EOF, and that the next char is a
single-byte ASCII character.

The compiler, however, doesn't seem to be able to "see through" the
`BYTE_HANDLERS[byte](self)` call and understand these invariants. So it
produces very verbose ASM for `lexer.consume_char()`.

This PR replaces `lexer.consume_char()` in the byte handlers with an
unsafe `lexer.consume_ascii_char()` which skips on to next char with a
single `inc` instruction.

The difference in codegen can be seen here:
https://godbolt.org/z/1ha3cr9W5 (compare the 2 x
`core::ops::function::FnOnce::call_once` handlers).

Downside is that this does introduce a lot of unsafe blocks, but in my
opinion they're all pretty trivial to validate.

---------

Co-authored-by: Boshen <boshenc@gmail.com>
IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
As discussed on oxc-project#2046, it wasn't ideal to have `unsafe {
lexer.consume_ascii_char() }` in every byte handler. It also wasn't
great to have a safe function `consume_ascii_char()` which could cause
UB if called incorrectly (so wasn't really safe at all).

This PR achieves the same objective of oxc-project#2046, but using a macro to
define byte handlers for ASCII chars, which builds in the assertion that
next char is guaranteed to be ASCII.

Before oxc-project#2046:

```rs
const SPS: ByteHandler = |lexer| {
  lexer.consume_char();
  Kind::WhiteSpace
};
```

After this PR:

```rs
ascii_byte_handler!(SPS(lexer) {
  lexer.consume_char();
  Kind::WhiteSpace
});
```

i.e. The body of the handlers are unchanged from how they were before
oxc-project#2046.

This expands to:

```rs
const SPS: ByteHandler = |lexer| {
  unsafe {
    let s = lexer.current.chars.as_str();
    assert_unchecked!(!s.is_empty());
    assert_unchecked!(s.as_bytes()[0] < 128);
  }
  lexer.consume_char();
  Kind::WhiteSpace
};
```

But due to the assertions the macro inserts, `consume_char()` is now
optimized for ASCII characters, and reduces to a single instruction. So
the `consume_ascii_char()` function introduced by oxc-project#2046 is unnecessary,
and can be removed again.

The "boundary of unsafe" is moved to a new function `handle_byte()`
which `read_next_token()` calls. `read_next_token()` is responsible for
upholding the safety invariants, which include ensuring that
`ascii_byte_handler!()` macro is not being misused (that last part is
strictly speaking a bit of a cheat, but...).

I am not a fan of macros, as they're not great for readability. But in
this case I don't think it's *too* bad, because:

1. The macro is well-documented.
2. It's not too clever (only one syntax is accepted).
3. It's used repetitively in a clear pattern, and once you've understood
one, you understand them all.

What do you think? Does this strike a reasonable balance between
readability and safety?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants