-
-
Notifications
You must be signed in to change notification settings - Fork 408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(parser): lexer byte handlers consume ASCII chars faster #2046
Conversation
CodSpeed Performance ReportMerging #2046 will not alter performanceComparing Summary
|
I shrank the |
Interesting that it benefits TypeScript more than plain JS. I see from your new commit that you're not keen on all the I totally get that. However, in my opinion, I did try to come up with another way to do it where the |
Best alternative I came up with which moves the burden of satisfying the safety constraint into mod with_ascii {
use super::Lexer;
pub struct LexerWithNextCharAscii<'a>(&'a mut Lexer<'a>);
impl<'a> LexerWithNextCharAscii<'a> {
/// Create a wrapped `Lexer` where next char is guaranteed to be ASCII.
/// SAFETY: Caller must ensure this is the case.
pub unsafe fn new(lexer: &'a mut Lexer<'a>) -> Self {
Self(lexer)
}
/// Unwrap to `Lexer<'a>`
pub fn lexer(self) -> &'a mut Lexer<'a> {
self.0
}
/// Consume next char and unwrap to `Lexer<'a>`
pub fn consume_char(self) -> (char, &'a mut Lexer<'a>) {
// SAFETY: Invariant of this type is that lexer's next char is ASCII
let c = unsafe { self.0.consume_ascii_char() };
(c, self.0)
}
}
}
use with_ascii::LexerWithNextCharAscii; Then the safety invariant is enforced in let byte = remaining.as_bytes()[0];
let kind = if byte < 128 {
// SAFETY: We've just determined next byte is ASCII
let ascii_lexer = unsafe { LexerWithNextCharAscii::new(self) };
BYTE_HANDLERS[byte as usize](ascii_lexer)
} else {
// ...
}
type ByteHandler = fn(LexerWithNextCharAscii<'_>) -> Kind;
const QOT: ByteHandler = |ascii_lexer| {
// Consume next char efficiently as we know it's ASCII
// No unsafe here!
let (c, lexer) = ascii_lexer.consume_char();
// `lexer` is now a normal `Lexer` without the ASCII guarantee
if lexer.context == LexerContext::JsxAttributeValue {
lexer.read_jsx_string_literal(c)
} else {
lexer.read_string_literal(c)
}
}; This only uses the type system to enforce the constraints, and should have no runtime cost. But still it's a bit bananas... Personally I think it's cleaner to keep PS Yes, I take your point that it's all contained within 1 file. But that file is 1800 lines! |
Perhaps we should move away from the I guess this is why Both of our points are valid. The rule of thumb is to make the code more approachable for future readers, so feel free to disagree with my changes and make adjustments, I'm happy to merge. |
Thanks for coming back. The I do see why you would prefer to avoid unsafe in every byte handler. It is ugly. My personal approach to unsafe is quite rigid. I like that when you call a safe function, there's a whole bunch of things you don't need to worry about. But obviously that breaks down if you can't trust safe functions to be safe! I guess it's inevitable that there's going to be some trade-off between performance and readability in a project like this where performance is a focus. It's tough to strike that balance. Anyway, thanks for being open to disagreement. I'll give it some thought... Out of interest, how do you feel about the kind of type system-based alternative I outlined above? |
Feels like over abstracting with concepts, which isn't beginner friendly. The trying to keep it simple idea came from reading the Rome / Biome codebase (and partly swc as well). |
I think I found a solution which squares the circle. No unsafe, but without sacrificing the performance gain. Will test it out and make a PR. |
As discussed on #2046, it wasn't ideal to have `unsafe { lexer.consume_ascii_char() }` in every byte handler. It also wasn't great to have a safe function `consume_ascii_char()` which could cause UB if called incorrectly (so wasn't really safe at all). This PR achieves the same objective of #2046, but using a macro to define byte handlers for ASCII chars, which builds in the assertion that next char is guaranteed to be ASCII. Before #2046: ```rs const SPS: ByteHandler = |lexer| { lexer.consume_char(); Kind::WhiteSpace }; ``` After this PR: ```rs ascii_byte_handler!(SPS(lexer) { lexer.consume_char(); Kind::WhiteSpace }); ``` i.e. The body of the handlers are unchanged from how they were before #2046. This expands to: ```rs const SPS: ByteHandler = |lexer| { unsafe { let s = lexer.current.chars.as_str(); assert_unchecked!(!s.is_empty()); assert_unchecked!(s.as_bytes()[0] < 128); } lexer.consume_char(); Kind::WhiteSpace }; ``` But due to the assertions the macro inserts, `consume_char()` is now optimized for ASCII characters, and reduces to a single instruction. So the `consume_ascii_char()` function introduced by #2046 is unnecessary, and can be removed again. The "boundary of unsafe" is moved to a new function `handle_byte()` which `read_next_token()` calls. `read_next_token()` is responsible for upholding the safety invariants, which include ensuring that `ascii_byte_handler!()` macro is not being misused (that last part is strictly speaking a bit of a cheat, but...). I am not a fan of macros, as they're not great for readability. But in this case I don't think it's *too* bad, because: 1. The macro is well-documented. 2. It's not too clever (only one syntax is accepted). 3. It's used repetitively in a clear pattern, and once you've understood one, you understand them all. What do you think? Does this strike a reasonable balance between readability and safety?
…ject#2046) In the lexer, most `BYTE_HANDLER`s immediately consume the current char with `lexer.consume_char()`. Byte handlers are only called if there's a certain value (or range of values) for the next char. This is their entire purpose. So in all cases we know for sure that we're not at EOF, and that the next char is a single-byte ASCII character. The compiler, however, doesn't seem to be able to "see through" the `BYTE_HANDLERS[byte](self)` call and understand these invariants. So it produces very verbose ASM for `lexer.consume_char()`. This PR replaces `lexer.consume_char()` in the byte handlers with an unsafe `lexer.consume_ascii_char()` which skips on to next char with a single `inc` instruction. The difference in codegen can be seen here: https://godbolt.org/z/1ha3cr9W5 (compare the 2 x `core::ops::function::FnOnce::call_once` handlers). Downside is that this does introduce a lot of unsafe blocks, but in my opinion they're all pretty trivial to validate. --------- Co-authored-by: Boshen <boshenc@gmail.com>
As discussed on oxc-project#2046, it wasn't ideal to have `unsafe { lexer.consume_ascii_char() }` in every byte handler. It also wasn't great to have a safe function `consume_ascii_char()` which could cause UB if called incorrectly (so wasn't really safe at all). This PR achieves the same objective of oxc-project#2046, but using a macro to define byte handlers for ASCII chars, which builds in the assertion that next char is guaranteed to be ASCII. Before oxc-project#2046: ```rs const SPS: ByteHandler = |lexer| { lexer.consume_char(); Kind::WhiteSpace }; ``` After this PR: ```rs ascii_byte_handler!(SPS(lexer) { lexer.consume_char(); Kind::WhiteSpace }); ``` i.e. The body of the handlers are unchanged from how they were before oxc-project#2046. This expands to: ```rs const SPS: ByteHandler = |lexer| { unsafe { let s = lexer.current.chars.as_str(); assert_unchecked!(!s.is_empty()); assert_unchecked!(s.as_bytes()[0] < 128); } lexer.consume_char(); Kind::WhiteSpace }; ``` But due to the assertions the macro inserts, `consume_char()` is now optimized for ASCII characters, and reduces to a single instruction. So the `consume_ascii_char()` function introduced by oxc-project#2046 is unnecessary, and can be removed again. The "boundary of unsafe" is moved to a new function `handle_byte()` which `read_next_token()` calls. `read_next_token()` is responsible for upholding the safety invariants, which include ensuring that `ascii_byte_handler!()` macro is not being misused (that last part is strictly speaking a bit of a cheat, but...). I am not a fan of macros, as they're not great for readability. But in this case I don't think it's *too* bad, because: 1. The macro is well-documented. 2. It's not too clever (only one syntax is accepted). 3. It's used repetitively in a clear pattern, and once you've understood one, you understand them all. What do you think? Does this strike a reasonable balance between readability and safety?
In the lexer, most
BYTE_HANDLER
s immediately consume the current char withlexer.consume_char()
.Byte handlers are only called if there's a certain value (or range of values) for the next char. This is their entire purpose. So in all cases we know for sure that we're not at EOF, and that the next char is a single-byte ASCII character.
The compiler, however, doesn't seem to be able to "see through" the
BYTE_HANDLERS[byte](self)
call and understand these invariants. So it produces very verbose ASM forlexer.consume_char()
.This PR replaces
lexer.consume_char()
in the byte handlers with an unsafelexer.consume_ascii_char()
which skips on to next char with a singleinc
instruction.The difference in codegen can be seen here: https://godbolt.org/z/1ha3cr9W5 (compare the 2 x
core::ops::function::FnOnce::call_once
handlers).Downside is that this does introduce a lot of unsafe blocks, but in my opinion they're all pretty trivial to validate.