Convert lexer to iterate over bytes instead of chars #3291
Labels
A-parser
Area - Parser
C-performance
Category - Solution not expected to change functional behavior, only performance
E-Help Wanted
Experience level - For the experienced collaborators
Why the lexer is slower than it could be
Chars
iterator is really slow. Lexer should iterate byte-by-byte rather than char-by-char.In almost all cases, we're only matching against ASCII characters anyway (e.g.
self.next_eq('.')
), so calculating the Unicode char is completely pointless, as we only care if it's an ASCII.
anyway. Surprisingly, the compiler seems not to be able to see this, and doesn't optimize out the char calculation itself. It generates a lot of pointless and slow code.How to make it faster
Source
was introduced with intention to ease the transition to byte-by-byte iteration through source code.#2352 got some heavy perf improvements from using it (along with other tricks), but I have not been able to find the time to complete the work.
The following APIs should be converted to take/return
u8
and useSource::peek_byte
instead ofSource::peek_char
:peek
peek2
next_eq
Usages of these APIs should be refactored unless they're directly preceded by a
peek
:next_char
consume_char
Suggested implementation
Source::next_byte
is an unsafe function, as it can break the invariant thatSource
must always be positioned on a UTF8 character boundary (i.e. not in the middle of a multi-byte Unicode char). It's preferable not to use it, to avoid littering the lexer with unsafe code.However,
Source::next_char
is written to be much more transparent to the compiler thanChars::next
is. So the compiler is able to optimize safe code like:down to equivalent of:
(at least that's what I remember from a few months ago when I checked it with Godbolt).
(originally mentioned in #3250 (comment))
The text was updated successfully, but these errors were encountered: