perf(parser): lexer byte handlers consume ASCII chars faster #2046

overlookmotel · 2024-01-16T03:25:06Z

In the lexer, most BYTE_HANDLERs immediately consume the current char with lexer.consume_char().

Byte handlers are only called if there's a certain value (or range of values) for the next char. This is their entire purpose. So in all cases we know for sure that we're not at EOF, and that the next char is a single-byte ASCII character.

The compiler, however, doesn't seem to be able to "see through" the BYTE_HANDLERS[byte](self) call and understand these invariants. So it produces very verbose ASM for lexer.consume_char().

This PR replaces lexer.consume_char() in the byte handlers with an unsafe lexer.consume_ascii_char() which skips on to next char with a single inc instruction.

The difference in codegen can be seen here: https://godbolt.org/z/1ha3cr9W5 (compare the 2 x core::ops::function::FnOnce::call_once handlers).

Downside is that this does introduce a lot of unsafe blocks, but in my opinion they're all pretty trivial to validate.

codspeed-hq · 2024-01-16T03:37:16Z

CodSpeed Performance Report

Merging #2046 will not alter performance

_{Comparing overlookmotel:lexer-bytes3 (0daa3ef) with main (09c7570)}

Summary

✅ 14 untouched benchmarks

Boshen · 2024-01-16T03:41:21Z

Boshen · 2024-01-16T04:31:39Z

I shrank the unsafe surface area because all calls are self-contained (within the same file).

overlookmotel · 2024-01-16T05:00:28Z

Interesting that it benefits TypeScript more than plain JS.

I see from your new commit that you're not keen on all the unsafe!

I totally get that. However, in my opinion, consume_ascii_char() should be an unsafe function. Calling it at EOF or if the next char isn't ASCII would be UB. If it's not an unsafe function, it'd be easy for someone to call it from somewhere else in lexer's codebase without considering the constraints they must uphold.

I did try to come up with another way to do it where the unsafe block is only within read_next_token(), rather than in every handler, but it's hard to find a good solution.

overlookmotel · 2024-01-16T05:11:15Z

Best alternative I came up with which moves the burden of satisfying the safety constraint into read_next_token() and removes the unsafe blocks from byte handlers:

mod with_ascii {
  use super::Lexer;

  pub struct LexerWithNextCharAscii<'a>(&'a mut Lexer<'a>);

  impl<'a> LexerWithNextCharAscii<'a> {
    /// Create a wrapped `Lexer` where next char is guaranteed to be ASCII.
    /// SAFETY: Caller must ensure this is the case.
    pub unsafe fn new(lexer: &'a mut Lexer<'a>) -> Self {
      Self(lexer)
    }

    /// Unwrap to `Lexer<'a>`
    pub fn lexer(self) -> &'a mut Lexer<'a> {
      self.0
    }

    /// Consume next char and unwrap to `Lexer<'a>`
    pub fn consume_char(self) -> (char, &'a mut Lexer<'a>) {
      // SAFETY: Invariant of this type is that lexer's next char is ASCII
      let c = unsafe { self.0.consume_ascii_char() };
      (c, self.0)
    }
  }
}
use with_ascii::LexerWithNextCharAscii;

Then the safety invariant is enforced in read_next_token():

let byte = remaining.as_bytes()[0];
let kind = if byte < 128 {
  // SAFETY: We've just determined next byte is ASCII
  let ascii_lexer = unsafe { LexerWithNextCharAscii::new(self) };
  BYTE_HANDLERS[byte as usize](ascii_lexer)
} else {
  // ...
}

BYTE_HANDLERS are called with a LexerWithNextCharAscii which allows consuming only one char as ASCII and then returning to a standard Lexer. e.g.:

type ByteHandler = fn(LexerWithNextCharAscii<'_>) -> Kind;

const QOT: ByteHandler = |ascii_lexer| {
  // Consume next char efficiently as we know it's ASCII
  // No unsafe here!
  let (c, lexer) = ascii_lexer.consume_char();
  // `lexer` is now a normal `Lexer` without the ASCII guarantee
  if lexer.context == LexerContext::JsxAttributeValue {
    lexer.read_jsx_string_literal(c)
  } else {
    lexer.read_string_literal(c)
  }
};

This only uses the type system to enforce the constraints, and should have no runtime cost. But still it's a bit bananas...

Personally I think it's cleaner to keep consume_ascii_char() as an unsafe function and have unsafe blocks in the handlers.

PS Yes, I take your point that it's all contained within 1 file. But that file is 1800 lines!

Boshen · 2024-01-16T05:39:01Z

Perhaps we should move away from the Chars interface and code it like C 😅

I guess this is why unsafe is frowned upon because now we are nitpicking where to put the unsafe.

Both of our points are valid.

The rule of thumb is to make the code more approachable for future readers, so feel free to disagree with my changes and make adjustments, I'm happy to merge.

overlookmotel · 2024-01-16T05:52:41Z

Thanks for coming back.

The Chars interface does have some advantages in that it enforces that you can never have the pointer in the middle of a unicode byte sequence. So that's one less thing to think about for contributors, and one less source of nasty bugs.

I do see why you would prefer to avoid unsafe in every byte handler. It is ugly.

My personal approach to unsafe is quite rigid. I like that when you call a safe function, there's a whole bunch of things you don't need to worry about. But obviously that breaks down if you can't trust safe functions to be safe!

I guess it's inevitable that there's going to be some trade-off between performance and readability in a project like this where performance is a focus. It's tough to strike that balance.

Anyway, thanks for being open to disagreement. I'll give it some thought...

Out of interest, how do you feel about the kind of type system-based alternative I outlined above?

Boshen · 2024-01-16T06:09:58Z

Out of interest, how do you feel about the kind of type system-based alternative I outlined above?

Feels like over abstracting with concepts, which isn't beginner friendly. The trying to keep it simple idea came from reading the Rome / Biome codebase (and partly swc as well).

overlookmotel · 2024-01-16T17:22:39Z

I think I found a solution which squares the circle. No unsafe, but without sacrificing the performance gain. Will test it out and make a PR.

As discussed on #2046, it wasn't ideal to have `unsafe { lexer.consume_ascii_char() }` in every byte handler. It also wasn't great to have a safe function `consume_ascii_char()` which could cause UB if called incorrectly (so wasn't really safe at all). This PR achieves the same objective of #2046, but using a macro to define byte handlers for ASCII chars, which builds in the assertion that next char is guaranteed to be ASCII. Before #2046: ```rs const SPS: ByteHandler = |lexer| { lexer.consume_char(); Kind::WhiteSpace }; ``` After this PR: ```rs ascii_byte_handler!(SPS(lexer) { lexer.consume_char(); Kind::WhiteSpace }); ``` i.e. The body of the handlers are unchanged from how they were before #2046. This expands to: ```rs const SPS: ByteHandler = |lexer| { unsafe { let s = lexer.current.chars.as_str(); assert_unchecked!(!s.is_empty()); assert_unchecked!(s.as_bytes()[0] < 128); } lexer.consume_char(); Kind::WhiteSpace }; ``` But due to the assertions the macro inserts, `consume_char()` is now optimized for ASCII characters, and reduces to a single instruction. So the `consume_ascii_char()` function introduced by #2046 is unnecessary, and can be removed again. The "boundary of unsafe" is moved to a new function `handle_byte()` which `read_next_token()` calls. `read_next_token()` is responsible for upholding the safety invariants, which include ensuring that `ascii_byte_handler!()` macro is not being misused (that last part is strictly speaking a bit of a cheat, but...). I am not a fan of macros, as they're not great for readability. But in this case I don't think it's *too* bad, because: 1. The macro is well-documented. 2. It's not too clever (only one syntax is accepted). 3. It's used repetitively in a clear pattern, and once you've understood one, you understand them all. What do you think? Does this strike a reasonable balance between readability and safety?

…ject#2046) In the lexer, most `BYTE_HANDLER`s immediately consume the current char with `lexer.consume_char()`. Byte handlers are only called if there's a certain value (or range of values) for the next char. This is their entire purpose. So in all cases we know for sure that we're not at EOF, and that the next char is a single-byte ASCII character. The compiler, however, doesn't seem to be able to "see through" the `BYTE_HANDLERS[byte](self)` call and understand these invariants. So it produces very verbose ASM for `lexer.consume_char()`. This PR replaces `lexer.consume_char()` in the byte handlers with an unsafe `lexer.consume_ascii_char()` which skips on to next char with a single `inc` instruction. The difference in codegen can be seen here: https://godbolt.org/z/1ha3cr9W5 (compare the 2 x `core::ops::function::FnOnce::call_once` handlers). Downside is that this does introduce a lot of unsafe blocks, but in my opinion they're all pretty trivial to validate. --------- Co-authored-by: Boshen <boshenc@gmail.com>

As discussed on oxc-project#2046, it wasn't ideal to have `unsafe { lexer.consume_ascii_char() }` in every byte handler. It also wasn't great to have a safe function `consume_ascii_char()` which could cause UB if called incorrectly (so wasn't really safe at all). This PR achieves the same objective of oxc-project#2046, but using a macro to define byte handlers for ASCII chars, which builds in the assertion that next char is guaranteed to be ASCII. Before oxc-project#2046: ```rs const SPS: ByteHandler = |lexer| { lexer.consume_char(); Kind::WhiteSpace }; ``` After this PR: ```rs ascii_byte_handler!(SPS(lexer) { lexer.consume_char(); Kind::WhiteSpace }); ``` i.e. The body of the handlers are unchanged from how they were before oxc-project#2046. This expands to: ```rs const SPS: ByteHandler = |lexer| { unsafe { let s = lexer.current.chars.as_str(); assert_unchecked!(!s.is_empty()); assert_unchecked!(s.as_bytes()[0] < 128); } lexer.consume_char(); Kind::WhiteSpace }; ``` But due to the assertions the macro inserts, `consume_char()` is now optimized for ASCII characters, and reduces to a single instruction. So the `consume_ascii_char()` function introduced by oxc-project#2046 is unnecessary, and can be removed again. The "boundary of unsafe" is moved to a new function `handle_byte()` which `read_next_token()` calls. `read_next_token()` is responsible for upholding the safety invariants, which include ensuring that `ascii_byte_handler!()` macro is not being misused (that last part is strictly speaking a bit of a cheat, but...). I am not a fan of macros, as they're not great for readability. But in this case I don't think it's *too* bad, because: 1. The macro is well-documented. 2. It's not too clever (only one syntax is accepted). 3. It's used repetitively in a clear pattern, and once you've understood one, you understand them all. What do you think? Does this strike a reasonable balance between readability and safety?

perf(parser): lexer byte handlers consume ASCII chars faster

2d2fae1

github-actions bot added the A-parser Area - Parser label Jan 16, 2024

shorten the unsafe

0daa3ef

Boshen merged commit 66a7a68 into oxc-project:main Jan 16, 2024
17 checks passed

overlookmotel mentioned this pull request Jan 17, 2024

refactor(parser): macro for ASCII byte handlers #2066

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parser): lexer byte handlers consume ASCII chars faster #2046

perf(parser): lexer byte handlers consume ASCII chars faster #2046

overlookmotel commented Jan 16, 2024 •

edited

Loading

codspeed-hq bot commented Jan 16, 2024 •

edited

Loading

Boshen commented Jan 16, 2024

Boshen commented Jan 16, 2024

overlookmotel commented Jan 16, 2024

overlookmotel commented Jan 16, 2024 •

edited

Loading

Boshen commented Jan 16, 2024

overlookmotel commented Jan 16, 2024 •

edited

Loading

Boshen commented Jan 16, 2024

overlookmotel commented Jan 16, 2024

perf(parser): lexer byte handlers consume ASCII chars faster #2046

perf(parser): lexer byte handlers consume ASCII chars faster #2046

Conversation

overlookmotel commented Jan 16, 2024 • edited Loading

codspeed-hq bot commented Jan 16, 2024 • edited Loading

CodSpeed Performance Report

Merging #2046 will not alter performance

Summary

Boshen commented Jan 16, 2024

Boshen commented Jan 16, 2024

overlookmotel commented Jan 16, 2024

overlookmotel commented Jan 16, 2024 • edited Loading

Boshen commented Jan 16, 2024

overlookmotel commented Jan 16, 2024 • edited Loading

Boshen commented Jan 16, 2024

overlookmotel commented Jan 16, 2024

overlookmotel commented Jan 16, 2024 •

edited

Loading

codspeed-hq bot commented Jan 16, 2024 •

edited

Loading

overlookmotel commented Jan 16, 2024 •

edited

Loading

overlookmotel commented Jan 16, 2024 •

edited

Loading