Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(parser): macro for ASCII byte handlers #2066

Merged
merged 1 commit into from
Jan 17, 2024

Conversation

overlookmotel
Copy link
Collaborator

@overlookmotel overlookmotel commented Jan 17, 2024

As discussed on #2046, it wasn't ideal to have unsafe { lexer.consume_ascii_char() } in every byte handler. It also wasn't great to have a safe function consume_ascii_char() which could cause UB if called incorrectly (so wasn't really safe at all).

This PR achieves the same objective of #2046, but using a macro to define byte handlers for ASCII chars, which builds in the assertion that next char is guaranteed to be ASCII.

Before #2046:

const SPS: ByteHandler = |lexer| {
  lexer.consume_char();
  Kind::WhiteSpace
};

After this PR:

ascii_byte_handler!(SPS(lexer) {
  lexer.consume_char();
  Kind::WhiteSpace
});

i.e. The body of the handlers are unchanged from how they were before #2046.

This expands to:

const SPS: ByteHandler = |lexer| {
  unsafe {
    let s = lexer.current.chars.as_str();
    assert_unchecked!(!s.is_empty());
    assert_unchecked!(s.as_bytes()[0] < 128);
  }
  lexer.consume_char();
  Kind::WhiteSpace
};

But due to the assertions the macro inserts, consume_char() is now optimized for ASCII characters, and reduces to a single instruction. So the consume_ascii_char() function introduced by #2046 is unnecessary, and can be removed again.

The "boundary of unsafe" is moved to a new function handle_byte() which read_next_token() calls. read_next_token() is responsible for upholding the safety invariants, which include ensuring that ascii_byte_handler!() macro is not being misused (that last part is strictly speaking a bit of a cheat, but...).

I am not a fan of macros, as they're not great for readability. But in this case I don't think it's too bad, because:

  1. The macro is well-documented.
  2. It's not too clever (only one syntax is accepted).
  3. It's used repetitively in a clear pattern, and once you've understood one, you understand them all.

What do you think? Does this strike a reasonable balance between readability and safety?

@github-actions github-actions bot added the A-parser Area - Parser label Jan 17, 2024
@overlookmotel
Copy link
Collaborator Author

NB: This approach is also slightly better in terms of performance. It produces a small speed-up (~0.3%) as the assertions now cover more cases, so compiler can optimize a few other bits of code.

Copy link

codspeed-hq bot commented Jan 17, 2024

CodSpeed Performance Report

Merging #2066 will not alter performance

Comparing overlookmotel:lexer-bytes-unsafe2 (0a4b4c3) with main (18a58d4)

Summary

✅ 14 untouched benchmarks

@overlookmotel
Copy link
Collaborator Author

I also tried another approach, replacing BYTE_HANDLERS with a big match byte { ... } and allowing the compiler to create the jump table itself.

But for reasons I don't understand, this didn't work at all. It did result in consume_char() being optimized as I'd hoped, but something else about this approach was inefficient, and parser benchmarks dropped about 5%:

https://codspeed.io/overlookmotel/oxc/branches/lexer-bytes6

@Boshen
Copy link
Member

Boshen commented Jan 17, 2024

Neat!


FYI the code used to be match char, with slower performance of course :-)

@Boshen Boshen merged commit 8d5f5b8 into oxc-project:main Jan 17, 2024
17 checks passed
@Boshen
Copy link
Member

Boshen commented Jan 17, 2024

Here's the original source: https://github.com/ratel-rust/ratel-core/blob/e55a1310ba69a3f5ce2a9a6eef643feced02ac08/ratel/src/lexer/mod.rs#L60

You may find other optimization techniques in there, but I didn't delve deeper.

@overlookmotel
Copy link
Collaborator Author

Great! Glad we found a compromise where we're both happy with the readability/safety balance.

match char would of course be worse performing, as it involves checking against up to 4 bytes. But I am surprised that match byte doesn't perform equivalently to the BYTE_HANDLERS lookup table. From testing with a reduced example in Godbolt, it looks like the compiler just builds an almost-identical lookup table itself. In fact I had thought it might be more performant as it might cluster the handler functions together in memory, or do something else clever. But no!

Is the maintainer of Ratel involved in OXC at all? Presumably he chose the BYTE_HANDLERS approach for a reason, and he might have an answer to this question.

@overlookmotel overlookmotel deleted the lexer-bytes-unsafe2 branch January 17, 2024 13:39
@Boshen
Copy link
Member

Boshen commented Jan 18, 2024

Is the maintainer of Ratel involved in OXC at all? Presumably he chose the BYTE_HANDLERS approach for a reason, and he might have an answer to this question.

Ratel predates oxc by 8 years, rslint 4 years and biome 3 years. oxc is only 1 year old in the open. The later project copied some coded from the predecessor. (FYI I never looked at swc's code while implementing oxc)

Presumably he chose the BYTE_HANDLERS approach for a reason, and he might have an answer to this question.

Unfortunately he stopped working in public by the looks of his github profile.

But he has many really good crates that you can learn from, e.g. the beef crate.

I think he sup

@overlookmotel
Copy link
Collaborator Author

overlookmotel commented Jan 18, 2024

Thanks for the notes. I've been looking at Ratel, and there may be some other optimizations we can port across, though they do make the code a bit more C-style. beef looks great, I'll check out his other crates.

IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024
As discussed on oxc-project#2046, it wasn't ideal to have `unsafe {
lexer.consume_ascii_char() }` in every byte handler. It also wasn't
great to have a safe function `consume_ascii_char()` which could cause
UB if called incorrectly (so wasn't really safe at all).

This PR achieves the same objective of oxc-project#2046, but using a macro to
define byte handlers for ASCII chars, which builds in the assertion that
next char is guaranteed to be ASCII.

Before oxc-project#2046:

```rs
const SPS: ByteHandler = |lexer| {
  lexer.consume_char();
  Kind::WhiteSpace
};
```

After this PR:

```rs
ascii_byte_handler!(SPS(lexer) {
  lexer.consume_char();
  Kind::WhiteSpace
});
```

i.e. The body of the handlers are unchanged from how they were before
oxc-project#2046.

This expands to:

```rs
const SPS: ByteHandler = |lexer| {
  unsafe {
    let s = lexer.current.chars.as_str();
    assert_unchecked!(!s.is_empty());
    assert_unchecked!(s.as_bytes()[0] < 128);
  }
  lexer.consume_char();
  Kind::WhiteSpace
};
```

But due to the assertions the macro inserts, `consume_char()` is now
optimized for ASCII characters, and reduces to a single instruction. So
the `consume_ascii_char()` function introduced by oxc-project#2046 is unnecessary,
and can be removed again.

The "boundary of unsafe" is moved to a new function `handle_byte()`
which `read_next_token()` calls. `read_next_token()` is responsible for
upholding the safety invariants, which include ensuring that
`ascii_byte_handler!()` macro is not being misused (that last part is
strictly speaking a bit of a cheat, but...).

I am not a fan of macros, as they're not great for readability. But in
this case I don't think it's *too* bad, because:

1. The macro is well-documented.
2. It's not too clever (only one syntax is accepted).
3. It's used repetitively in a clear pattern, and once you've understood
one, you understand them all.

What do you think? Does this strike a reasonable balance between
readability and safety?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-parser Area - Parser
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants