refactor(parser): macro for ASCII byte handlers #2066

overlookmotel · 2024-01-17T04:42:59Z

As discussed on #2046, it wasn't ideal to have unsafe { lexer.consume_ascii_char() } in every byte handler. It also wasn't great to have a safe function consume_ascii_char() which could cause UB if called incorrectly (so wasn't really safe at all).

This PR achieves the same objective of #2046, but using a macro to define byte handlers for ASCII chars, which builds in the assertion that next char is guaranteed to be ASCII.

Before #2046:

const SPS: ByteHandler = |lexer| {
  lexer.consume_char();
  Kind::WhiteSpace
};

After this PR:

ascii_byte_handler!(SPS(lexer) {
  lexer.consume_char();
  Kind::WhiteSpace
});

i.e. The body of the handlers are unchanged from how they were before #2046.

This expands to:

const SPS: ByteHandler = |lexer| {
  unsafe {
    let s = lexer.current.chars.as_str();
    assert_unchecked!(!s.is_empty());
    assert_unchecked!(s.as_bytes()[0] < 128);
  }
  lexer.consume_char();
  Kind::WhiteSpace
};

But due to the assertions the macro inserts, consume_char() is now optimized for ASCII characters, and reduces to a single instruction. So the consume_ascii_char() function introduced by #2046 is unnecessary, and can be removed again.

The "boundary of unsafe" is moved to a new function handle_byte() which read_next_token() calls. read_next_token() is responsible for upholding the safety invariants, which include ensuring that ascii_byte_handler!() macro is not being misused (that last part is strictly speaking a bit of a cheat, but...).

I am not a fan of macros, as they're not great for readability. But in this case I don't think it's too bad, because:

The macro is well-documented.
It's not too clever (only one syntax is accepted).
It's used repetitively in a clear pattern, and once you've understood one, you understand them all.

What do you think? Does this strike a reasonable balance between readability and safety?

overlookmotel · 2024-01-17T04:45:50Z

NB: This approach is also slightly better in terms of performance. It produces a small speed-up (~0.3%) as the assertions now cover more cases, so compiler can optimize a few other bits of code.

codspeed-hq · 2024-01-17T04:54:30Z

CodSpeed Performance Report

Merging #2066 will not alter performance

_{Comparing overlookmotel:lexer-bytes-unsafe2 (0a4b4c3) with main (18a58d4)}

Summary

✅ 14 untouched benchmarks

overlookmotel · 2024-01-17T04:54:46Z

I also tried another approach, replacing BYTE_HANDLERS with a big match byte { ... } and allowing the compiler to create the jump table itself.

But for reasons I don't understand, this didn't work at all. It did result in consume_char() being optimized as I'd hoped, but something else about this approach was inefficient, and parser benchmarks dropped about 5%:

https://codspeed.io/overlookmotel/oxc/branches/lexer-bytes6

Boshen · 2024-01-17T07:28:30Z

Neat!

FYI the code used to be match char, with slower performance of course :-)

Boshen · 2024-01-17T07:31:45Z

Here's the original source: https://github.com/ratel-rust/ratel-core/blob/e55a1310ba69a3f5ce2a9a6eef643feced02ac08/ratel/src/lexer/mod.rs#L60

You may find other optimization techniques in there, but I didn't delve deeper.

overlookmotel · 2024-01-17T13:38:42Z

Great! Glad we found a compromise where we're both happy with the readability/safety balance.

match char would of course be worse performing, as it involves checking against up to 4 bytes. But I am surprised that match byte doesn't perform equivalently to the BYTE_HANDLERS lookup table. From testing with a reduced example in Godbolt, it looks like the compiler just builds an almost-identical lookup table itself. In fact I had thought it might be more performant as it might cluster the handler functions together in memory, or do something else clever. But no!

Is the maintainer of Ratel involved in OXC at all? Presumably he chose the BYTE_HANDLERS approach for a reason, and he might have an answer to this question.

Boshen · 2024-01-18T13:08:28Z

Is the maintainer of Ratel involved in OXC at all? Presumably he chose the BYTE_HANDLERS approach for a reason, and he might have an answer to this question.

Ratel predates oxc by 8 years, rslint 4 years and biome 3 years. oxc is only 1 year old in the open. The later project copied some coded from the predecessor. (FYI I never looked at swc's code while implementing oxc)

Presumably he chose the BYTE_HANDLERS approach for a reason, and he might have an answer to this question.

Unfortunately he stopped working in public by the looks of his github profile.

But he has many really good crates that you can learn from, e.g. the beef crate.

I think he sup

overlookmotel · 2024-01-18T17:02:48Z

Thanks for the notes. I've been looking at Ratel, and there may be some other optimizations we can port across, though they do make the code a bit more C-style. beef looks great, I'll check out his other crates.

As discussed on oxc-project#2046, it wasn't ideal to have `unsafe { lexer.consume_ascii_char() }` in every byte handler. It also wasn't great to have a safe function `consume_ascii_char()` which could cause UB if called incorrectly (so wasn't really safe at all). This PR achieves the same objective of oxc-project#2046, but using a macro to define byte handlers for ASCII chars, which builds in the assertion that next char is guaranteed to be ASCII. Before oxc-project#2046: ```rs const SPS: ByteHandler = |lexer| { lexer.consume_char(); Kind::WhiteSpace }; ``` After this PR: ```rs ascii_byte_handler!(SPS(lexer) { lexer.consume_char(); Kind::WhiteSpace }); ``` i.e. The body of the handlers are unchanged from how they were before oxc-project#2046. This expands to: ```rs const SPS: ByteHandler = |lexer| { unsafe { let s = lexer.current.chars.as_str(); assert_unchecked!(!s.is_empty()); assert_unchecked!(s.as_bytes()[0] < 128); } lexer.consume_char(); Kind::WhiteSpace }; ``` But due to the assertions the macro inserts, `consume_char()` is now optimized for ASCII characters, and reduces to a single instruction. So the `consume_ascii_char()` function introduced by oxc-project#2046 is unnecessary, and can be removed again. The "boundary of unsafe" is moved to a new function `handle_byte()` which `read_next_token()` calls. `read_next_token()` is responsible for upholding the safety invariants, which include ensuring that `ascii_byte_handler!()` macro is not being misused (that last part is strictly speaking a bit of a cheat, but...). I am not a fan of macros, as they're not great for readability. But in this case I don't think it's *too* bad, because: 1. The macro is well-documented. 2. It's not too clever (only one syntax is accepted). 3. It's used repetitively in a clear pattern, and once you've understood one, you understand them all. What do you think? Does this strike a reasonable balance between readability and safety?

refactor(parser): macro for ASCII byte handlers

0a4b4c3

github-actions bot added the A-parser Area - Parser label Jan 17, 2024

Boshen approved these changes Jan 17, 2024

View reviewed changes

Boshen merged commit 8d5f5b8 into oxc-project:main Jan 17, 2024
17 checks passed

overlookmotel deleted the lexer-bytes-unsafe2 branch January 17, 2024 13:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(parser): macro for ASCII byte handlers #2066

refactor(parser): macro for ASCII byte handlers #2066

overlookmotel commented Jan 17, 2024 •

edited

Loading

overlookmotel commented Jan 17, 2024

codspeed-hq bot commented Jan 17, 2024

overlookmotel commented Jan 17, 2024

Boshen commented Jan 17, 2024

Boshen commented Jan 17, 2024 •

edited

Loading

overlookmotel commented Jan 17, 2024

Boshen commented Jan 18, 2024

overlookmotel commented Jan 18, 2024 •

edited

Loading

refactor(parser): macro for ASCII byte handlers #2066

refactor(parser): macro for ASCII byte handlers #2066

Conversation

overlookmotel commented Jan 17, 2024 • edited Loading

overlookmotel commented Jan 17, 2024

codspeed-hq bot commented Jan 17, 2024

CodSpeed Performance Report

Merging #2066 will not alter performance

Summary

overlookmotel commented Jan 17, 2024

Boshen commented Jan 17, 2024

Boshen commented Jan 17, 2024 • edited Loading

overlookmotel commented Jan 17, 2024

Boshen commented Jan 18, 2024

overlookmotel commented Jan 18, 2024 • edited Loading

overlookmotel commented Jan 17, 2024 •

edited

Loading

Boshen commented Jan 17, 2024 •

edited

Loading

overlookmotel commented Jan 18, 2024 •

edited

Loading