perf(parser): lex identifiers as bytes not chars #2352

overlookmotel · 2024-02-08T20:15:32Z

This PR re-implements lexing identifiers with a fast path for the most common case - identifiers which are pure ASCII characters, using the new Source / SourcePosition APIs.

Lexing identifiers is a hot path, and accounts for the majority of the time the Lexer spends. The performance bump from this change is (if I do say so myself!) quite decent.

I've spent a lot of time tuning the implementation, which gained a further 10-15% on the Lexer benchmarks compared to my first, simpler attempt. Some of the design decisions, if they look odd, are likely motivated by gains in performance.

Techniques

This implementation uses a few different strategies for performance:

Search byte-by-byte, not char-by-char.
Process batches of 32 bytes at a time to reduce bounds checks.
Mark uncommon paths #[cold].

Structure

The implementation is built in 3 layers:

ASCII characters only.
ASCII and Unicode characters.
\ escape sequences (and all the above).

identifier_name_handler starts at the top layer, and is optimized for consuming ASCII as fast as possible. Each "layer" is considered more uncommon than the previous, and dropping down a layer is a de-opt.

I'm assuming that 95%+ of JavaScript code does not include either Unicode characters or escapes in identifiers, so the speed of the fast path is prioritised.

That said, once a Unicode character is encountered, the next layer does expect to find further Unicode characters, rather than de-opting over and over again. If an identifier starts with a Unicode character, it enters the code straight on the 2nd layer, so is not penalised by going through a #[cold] boundary. Lexing Unicode is never going to be as fast as ASCII, but still I felt it was important not to penalise it unnecessarily, so as not to be Anglo-centric.

ASCII search macro

The main ASCII search is implemented as a macro. I found that, for reasons I don't understand, it's significantly faster to have all the code in a single function, even compared to multiple functions marked #[inline] or #[inline(always)]. The fastest implementation also requires some code to be repeated twice, which is nicer to do with a macro.

This macro, and the ByteMatchTable types that go with it, are designed to be re-usable. Next step will be to apply them for whitespace and strings, which should be fairly simple.

Searching in batches of 32 bytes is also designed to be forward-compatible with SIMD.

Bye bye `AutoCow`

AutoCow is removed. Instead, a string-builder is only created if it's needed, when a \ escape is first encountered. The string builder is also more efficient than AutoCow was, as it copies bytes in chunks, rather than 1-by-1.

This won't make much difference for identifiers, as escapes are so rare anyway, but this same technique can be used for strings, where they're more common.

overlookmotel · 2024-02-08T20:15:42Z

Current dependencies on/for this PR:

main
- PR refactor(parser): all pointer manipulation through SourcePosition #2350
  - PR refactor(parser): macro for ASCII identifier byte handlers #2351
    - PR perf(parser): lex identifiers as bytes not chars #2352 👈
      - PR perf(parser): eat whitespace after line break #2353
        
        PR perf(parser): skip single space between tokens branchlessly #2354

This stack of pull requests is managed by Graphite.

codspeed-hq · 2024-02-08T20:25:32Z

CodSpeed Performance Report

Merging #2352 will improve performances by 56.76%

_{Comparing 02-08-perf_parser_lex_identifiers_as_bytes_not_chars (9a8d7b7) with main (6910e4f)}

Summary

⚡ 17 improvements
✅ 10 untouched benchmarks

Benchmarks breakdown

	Benchmark	`main`	`lex_identifiers_as_bytes_not_chars`	Change
⚡	`transformer[checker.ts]`	1.1 s	1.1 s	+4.63%
⚡	`lexer[checker.ts]`	146.1 ms	95.6 ms	+52.76%
⚡	`lexer[pdf.mjs]`	35.3 ms	22.5 ms	+56.76%
⚡	`parser[cal.com.tsx]`	176.4 ms	159.7 ms	+10.45%
⚡	`lexer[cal.com.tsx]`	59.7 ms	43 ms	+38.74%
⚡	`transformer[cal.com.tsx]`	520.7 ms	504 ms	+3.31%
⚡	`lexer[antd.js]`	230 ms	157.7 ms	+45.81%
⚡	`minifier[react.development.js]`	12.2 ms	11.2 ms	+9.14%
⚡	`transformer[antd.js]`	1.7 s	1.6 s	+4.5%
⚡	`parser[RadixUIAdoptionSection.jsx]`	421.9 µs	397.2 µs	+6.22%
⚡	`minifier[typescript.js]`	1.8 s	1.6 s	+9.65%
⚡	`transformer[pdf.mjs]`	300.9 ms	288.2 ms	+4.41%
⚡	`transformer[RadixUIAdoptionSection.jsx]`	819.2 µs	794.8 µs	+3.07%
⚡	`parser[pdf.mjs]`	124.3 ms	111.6 ms	+11.39%
⚡	`lexer[RadixUIAdoptionSection.jsx]`	196.5 µs	148.7 µs	+32.18%
⚡	`parser[checker.ts]`	419.2 ms	369.2 ms	+13.55%
⚡	`parser[antd.js]`	773.1 ms	698.3 ms	+10.71%

Boshen · 2024-02-09T03:50:46Z

Merge activity

Feb 8, 10:50 PM EST: @Boshen started a stack merge that includes this pull request via Graphite.
Feb 8, 10:56 PM EST: Graphite rebased this pull request as part of a merge.
Feb 8, 11:01 PM EST: @Boshen merged this pull request with Graphite.

Add a macro for ASCII identifier byte handlers. This is a preparatory step towards #2352.

Uses the `byte_search!` macro introduced in #2352 to consume whitespace after a line break.

…ct#2351) Add a macro for ASCII identifier byte handlers. This is a preparatory step towards oxc-project#2352.

This PR re-implements lexing identifiers with a fast path for the most common case - identifiers which are pure ASCII characters, using the new `Source` / `SourcePosition` APIs. Lexing identifiers is a hot path, and accounts for the majority of the time the Lexer spends. The performance bump from this change is (if I do say so myself!) quite decent. I've spent a lot of time tuning the implementation, which gained a further 10-15% on the Lexer benchmarks compared to my first, simpler attempt. Some of the design decisions, if they look odd, are likely motivated by gains in performance. ### Techniques This implementation uses a few different strategies for performance: * Search byte-by-byte, not char-by-char. * Process batches of 32 bytes at a time to reduce bounds checks. * Mark uncommon paths `#[cold]`. ### Structure The implementation is built in 3 layers: 1. ASCII characters only. 2. ASCII and Unicode characters. 3. `\` escape sequences (and all the above). `identifier_name_handler` starts at the top layer, and is optimized for consuming ASCII as fast as possible. Each "layer" is considered more uncommon than the previous, and dropping down a layer is a de-opt. I'm assuming that 95%+ of JavaScript code does not include either Unicode characters or escapes in identifiers, so the speed of the fast path is prioritised. That said, once a Unicode character is encountered, the next layer does expect to find further Unicode characters, rather than de-opting over and over again. If an identifier *starts* with a Unicode character, it enters the code straight on the 2nd layer, so is not penalised by going through a `#[cold]` boundary. Lexing Unicode is never going to be as fast as ASCII, but still I felt it was important not to penalise it unnecessarily, so as not to be Anglo-centric. ### ASCII search macro The main ASCII search is implemented as a macro. I found that, for reasons I don't understand, it's significantly faster to have all the code in a single function, even compared to multiple functions marked `#[inline]` or `#[inline(always)]`. The fastest implementation also requires some code to be repeated twice, which is nicer to do with a macro. This macro, and the `ByteMatchTable` types that go with it, are designed to be re-usable. Next step will be to apply them for whitespace and strings, which should be fairly simple. Searching in batches of 32 bytes is also designed to be forward-compatible with SIMD. ### Bye bye `AutoCow` `AutoCow` is removed. Instead, a string-builder is only created if it's needed, when a `\` escape is first encountered. The string builder is also more efficient than `AutoCow` was, as it copies bytes in chunks, rather than 1-by-1. This won't make much difference for identifiers, as escapes are so rare anyway, but this same technique can be used for strings, where they're more common.

Uses the `byte_search!` macro introduced in oxc-project#2352 to consume whitespace after a line break.

This was referenced Feb 8, 2024

refactor(parser): all pointer manipulation through SourcePosition #2350

Merged

refactor(parser): macro for ASCII identifier byte handlers #2351

Merged

github-actions bot added the A-parser Area - Parser label Feb 8, 2024

overlookmotel force-pushed the 02-08-perf_parser_lex_identifiers_as_bytes_not_chars branch from 6360f04 to 5b9f1a7 Compare February 8, 2024 21:23

This was referenced Feb 8, 2024

perf(parser): eat whitespace after line break #2353

Merged

perf(parser): skip single space between tokens branchlessly #2354

Closed

overlookmotel marked this pull request as draft February 9, 2024 01:53

overlookmotel marked this pull request as ready for review February 9, 2024 02:11

overlookmotel force-pushed the 02-08-perf_parser_lex_identifiers_as_bytes_not_chars branch 2 times, most recently from 5b9f1a7 to 070d3c4 Compare February 9, 2024 02:13

Boshen force-pushed the 02-08-refactor_parser_macro_for_ASCII_identifier_byte_handlers branch from 593a602 to beafc6c Compare February 9, 2024 03:51

Base automatically changed from 02-08-refactor_parser_macro_for_ASCII_identifier_byte_handlers to main February 9, 2024 03:55

Boshen pushed a commit that referenced this pull request Feb 9, 2024

refactor(parser): macro for ASCII identifier byte handlers (#2351)

6910e4f

Add a macro for ASCII identifier byte handlers. This is a preparatory step towards #2352.

perf(parser): lex identifiers as bytes not chars

9a8d7b7

Boshen force-pushed the 02-08-perf_parser_lex_identifiers_as_bytes_not_chars branch from 070d3c4 to 9a8d7b7 Compare February 9, 2024 03:55

Boshen merged commit d3a59f2 into main Feb 9, 2024
21 checks passed

Boshen deleted the 02-08-perf_parser_lex_identifiers_as_bytes_not_chars branch February 9, 2024 04:01

Boshen pushed a commit that referenced this pull request Feb 9, 2024

perf(parser): eat whitespace after line break (#2353)

8376f15

Uses the `byte_search!` macro introduced in #2352 to consume whitespace after a line break.

overlookmotel mentioned this pull request Feb 9, 2024

SIMD in the lexer #2296

Open

This was referenced May 13, 2024

perf(lexer): use SIMD char search when parsing numbers #3250

Closed

Convert lexer to iterate over bytes instead of chars #3291

Open

IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024

refactor(parser): macro for ASCII identifier byte handlers (oxc-proje…

655d9bc

…ct#2351) Add a macro for ASCII identifier byte handlers. This is a preparatory step towards oxc-project#2352.

IWANABETHATGUY pushed a commit to IWANABETHATGUY/oxc that referenced this pull request May 29, 2024

perf(parser): eat whitespace after line break (oxc-project#2353)

4e0b1a0

Uses the `byte_search!` macro introduced in oxc-project#2352 to consume whitespace after a line break.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(parser): lex identifiers as bytes not chars #2352

perf(parser): lex identifiers as bytes not chars #2352

overlookmotel commented Feb 8, 2024 •

edited

Loading

overlookmotel commented Feb 8, 2024 •

edited

Loading

codspeed-hq bot commented Feb 8, 2024 •

edited by Boshen

Loading

Boshen commented Feb 9, 2024 •

edited

Loading

perf(parser): lex identifiers as bytes not chars #2352

perf(parser): lex identifiers as bytes not chars #2352

Conversation

overlookmotel commented Feb 8, 2024 • edited Loading

Techniques

Structure

ASCII search macro

Bye bye AutoCow

overlookmotel commented Feb 8, 2024 • edited Loading

codspeed-hq bot commented Feb 8, 2024 • edited by Boshen Loading

CodSpeed Performance Report

Merging #2352 will improve performances by 56.76%

Summary

Benchmarks breakdown

Boshen commented Feb 9, 2024 • edited Loading

Merge activity

overlookmotel commented Feb 8, 2024 •

edited

Loading

Bye bye `AutoCow`

overlookmotel commented Feb 8, 2024 •

edited

Loading

codspeed-hq bot commented Feb 8, 2024 •

edited by Boshen

Loading

Boshen commented Feb 9, 2024 •

edited

Loading