Faster tokenizer by bschnurr · Pull Request #11392 · microsoft/pyright

bschnurr · 2026-04-16T00:08:46Z

Already committed as 462a76a2f. Here's the updated PR description:

Faster tokenizer

Summary

Continues the tokenizer optimization work with a series of hot-path improvements, plus review follow-ups on the ignore-directive scanner and identifier-intern behavior. Combined result: ~20–30% faster on large Python corpora.

Benchmark results vs `main`

Methodology: separate git worktrees of origin/main and faster-tokenizer, identical tokenizerBenchmark.test.ts harness (each corpus run in a fresh Node process, 3 warmup + 10 measured iterations). Numbers below are the median of per-run medians across 3–6 runs per side. Small corpora (< 2 ms, < 10 KB) are dominated by V8 JIT/GC jitter and noted as noise.

Corpus	Size	main	this PR	Δ
large_stdlib_10x	430 KB	25.98 ms	19.62 ms	−24%
large_class	24.5 KB	2.59 ms	1.78 ms	−31%
large_stdlib	43 KB	3.54 ms	2.80 ms	−21%
repetitive_identifiers	6.7 KB	2.25 ms	1.96 ms	−13%
union_heavy	13 KB	1.72 ms	1.64 ms	−5%
import_heavy	7.9 KB	1.10 ms	1.20 ms	+9% (noise)
fstring_heavy	8.4 KB	1.39 ms	1.64 ms	+18% (noise)
comment_heavy	9.5 KB	1.63 ms	1.89 ms	+16% (noise)

The large_stdlib_10x result is the most trustworthy signal (10× work, proportionally less noise) and shows a consistent −24%.

Enhancements

Tokenizer hot paths

Replaced regex-based keyword / operator / ignore-directive matching with table-driven scans and 128-entry boolean lookup tables (_canStartString, _asciiIdentifierStart, _asciiIdentifierContinue, _keywordFirstCharTable, _singleCharOperatorTypeTable, …).
Added an ASCII fast-path in _tryIdentifier that advances over ASCII identifier chars in a tight charCodeAt loop, falling back to the unicode/surrogate path only when a non-ASCII char is encountered. The single biggest win on real-world code.
Added a direct-mapped identifier intern cache (2048 slots, hashed by firstChar/lastChar/length, no chaining) that deduplicates repeated identifiers (self, cls, True, None, str, int, …) within a single tokenize pass. Addresses the memory concern raised during review about per-token string allocation, without the overhead of the previous Map-based intern table. The new repetitive_identifiers benchmark corpus locks in the tradeoff (−13% vs main on that corpus).
Inlined CharacterStream.skipWhitespace as a tight charCodeAt loop that updates _position / _currentChar / _isEndOfStream directly, avoiding per-iteration method calls.
Cached the "any non-trivial tokens seen yet?" flag in _handleComment so the O(n) _tokens.findIndex scan no longer runs on every type: ignore directive.

Ignore-directive scanner (review follow-ups)

Preserves support for namespaced type: ignore[...] rules (e.g. ty:rule-name).
Rejects malformed type: ignore[ / pyright: ignore[ comments with an unclosed bracket rather than treating them as "ignore all diagnostics". New test TypeIgnoreLineMalformedBracketWithSpace locks in this behavior for the # type: ignore [broken case.
Extracted the duplicated bracket-content character-class validation into a single parseIgnoreBracketContent helper — both the "bracket-after-space" and "bracket-immediately-after-ignore" branches now share one implementation.
Added a fast pre-filter in _handleComment that uses indexOf('ignore', …) before invoking the directive scanner, so comments without the word ignore don't pay directive-parsing cost.
matchIgnoreDirective uses a bounded hand-rolled scan for the directive keyword. (An earlier iteration used String.prototype.indexOf, which has no end bound and scanned well past the current comment on comment-heavy files, producing O(n²) behavior; the worktree-vs-main comparison caught and fixed this.)

Parser / source-file touch-ups

Small follow-up adjustments in the parser and source-file layer needed by the tokenizer updates.

Benchmark infrastructure

Added tokenizer and parser benchmark suites (tokenizerBenchmark.test.ts, parserBenchmark.test.ts) with representative corpora: large_stdlib, large_stdlib_10x, fstring_heavy, comment_heavy, large_class, import_heavy, union_heavy, and repetitive_identifiers (the last specifically validates the identifier intern cache).
Each corpus runs in a fresh Node process (spawned via execFileSync) with 3 warmup + 10 measured iterations; results written as JSON under .generated/benchmark-results/ for side-by-side comparison.
Benchmark helpers (calculateStats, printResultTable) take ReadonlyArray parameters per review feedback.

Tokenizer regression tests

Added coverage for ignore-directive parsing: malformed unclosed brackets (with and without leading space), namespaced codes, and mixed plain + namespaced codes.

Ran terminal command: cd c:\dev\pyright-main-benchmark; git status; git ls-files packages/pyright-internal/src/tests/benchmarkData | Select-Object -First 3

Behavior notes

# type: ignore[unclosed and # type: ignore [unclosed are now rejected entirely (no typeIgnoreLine recorded) instead of falling back to "ignore all". This matches the intent of the original regex for the [ branch and is now consistent between the space and no-space cases.
Namespaced codes (ty:rule-name) remain valid inside type: ignore[...] but not inside pyright: ignore[...].

Validation

npm test (full pyright-internal suite) — green.
Targeted: npx jest tokenizer parser.test --forceExit — 137/137 passing.
npm run test:benchmark — table above; raw JSON in .generated/benchmark-results/tokenizer/.

rchiodo · 2026-04-16T00:30:52Z

detachSubstring replaces the old cloneStr + _identifierInternedStrings Map for identifiers. The old intern map deduplicated repeated identifiers within a tokenization pass (e.g., 10,000 occurrences of self → 1 string object). The new code creates a fresh string per token. The benchmark corpora are 300–1700 lines and won't surface memory regressions on very large files with repetitive identifiers (e.g., generated code). Consider adding a "repetitive identifier" benchmark corpus to validate this tradeoff, or document that the intern map was intentionally removed with the expectation that per-token cost savings outweigh deduplication benefits.

bschnurr · 2026-04-17T23:54:46Z

Addressed in commit 462a76a: added a repetitive_identifiers benchmark corpus (234 lines, 2775 tokens) that exercises heavy reuse of identifiers like self, cls, str, int, T, K, V, repeated method signatures, and list/dict comprehensions with common names. Result vs main: branch ~1.96 ms, main ~2.25 ms (~13% faster), confirming the intern cache outperforms per-token allocation on this workload.

…for : in type: ignore[...] bracket codes (e.g. ty:unresolved-reference). The fix adds Char.Colon to the allowed bracket-content characters in both bracket-parsing branches of matchIgnoreDirective, but only when directive === 'type' — matching the original regex difference ([\s\w:,-]* for type: vs [\s\w-,]* for pyright:).

This PR improves parser/tokenizer performance on common hot paths and moves benchmark suites out of the normal Jest test matrix. The main changes are: - replace regex-based directive and continuation scanning with manual scans - reduce tokenizer overhead on common identifier paths - clean up a few parser/token access paths that benefit from the tokenizer work - add dedicated benchmark coverage and keep benchmark runs opt-in ## Benchmark Results I reran the tokenizer benchmarks against `main` using the isolated harness so each corpus runs in a fresh process. Representative median results vs `main`: - `large_stdlib`: `3.13ms` vs `3.98ms` (`21%` faster) - `fstring_heavy`: `1.77ms` vs `1.94ms` (`9%` faster) - `large_class`: `1.97ms` vs `2.12ms` (`7%` faster) - `union_heavy`: `1.63ms` vs `2.18ms` (`25%` faster) - `large_stdlib_10x`: `21.17ms` vs `24.42ms` (`13%` faster) `comment_heavy` was effectively flat, and `import_heavy` remained too noisy to treat as a reliable headline result. Overall, the larger and more representative tokenizer-heavy corpora improved relative to `main`. ## Testing - tokenizer regression tests passed - tokenizer test suite passed - full `pyright-internal` test suite passed - isolated tokenizer benchmark runs completed successfully

…cts [type: ignoretype: ignore[ or pyright: ignore[ when the closing ] is missing, instead of silently treating them as bare ignore directives. I also added a regression test in tokenizer.test.ts for the malformed-bracket case. Focused validation passed: the TypeIgnore|PyrightIgnore slice now reports 9 passing tests, including TypeIgnoreLineMalformedBracket.

…e-bracket test, fast-path comment handler

…irective)

…hod calls)

…e files)

…irective (fix O(n^2) on comment-heavy files)

… tradeoff

…ck failure)

…ts-nocheck

…ts-nocheck approach)

… script

…e split)

…mments patterns)

…undefined)

…ation

github-actions · 2026-04-22T20:37:01Z

According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅

This comment has been minimized.

Sign in to view

rchiodo reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/pyright-internal/src/parser/tokenizer.ts

rchiodo reviewed Apr 16, 2026

View reviewed changes

Comment thread packages/pyright-internal/src/parser/tokenizer.ts

rchiodo reviewed Apr 17, 2026

View reviewed changes

Comment thread packages/pyright-internal/src/parser/tokenizer.ts

rchiodo reviewed Apr 17, 2026

View reviewed changes

Comment thread packages/pyright-internal/src/tests/benchmarks/tokenizerBenchmark.test.ts

rchiodo reviewed Apr 17, 2026

View reviewed changes

Comment thread packages/pyright-internal/src/tests/tokenizer.test.ts

This comment has been minimized.

Sign in to view

bschnurr added 18 commits April 22, 2026 13:20

initial optimization

197fff5

Address PR review: extract bracket parser, ReadonlyArray, space-befor…

2767153

…e-bracket test, fast-path comment handler

tokenizer: cache typeIgnoreAll scan result (drop O(n) findIndex per d…

307ff8d

…irective)

tokenizer: ASCII fast-path for _tryIdentifier (15-25% faster tokenize)

d8b72f6

characterStream: inline skipWhitespace tight loop (avoid per-char met…

79ece28

…hod calls)

tokenizer: direct-mapped identifier intern cache (~14% faster on larg…

fc51114

…e files)

tokenizer: revert indexOf to bounded hand-rolled scan in matchIgnoreD…

b99e9b3

…irective (fix O(n^2) on comment-heavy files)

benchmark: add repetitive_identifiers corpus to validate intern-cache…

9f06d94

… tradeoff

benchmark: add @ts-nocheck to runBenchmarkJest.js (fix TS7016 typeche…

602dadb

…ck failure)

benchmark: exclude runBenchmarkJest.js from tsc (fix TS7016, replaces @…

ccf0c74

…ts-nocheck approach)

benchmark: replace runBenchmarkJest.js launcher with cross-env in npm…

a720a09

… script

tokenizer: revert Token.create to single-shape form (avoid V8 IC shap…

049a51d

…e split)

tokenizer: restore tokenizerTypes.ts to main (revert all two-shape co…

2327437

…mments patterns)

tokenizer: restore two-shape token creation (omit comments slot when …

02ecb30

…undefined)

tokenizer: add comments explaining two-shape token allocation optimiz…

2c8cb90

…ation

bschnurr force-pushed the faster-tokenizer branch from 3a82f80 to 2c8cb90 Compare April 22, 2026 20:25

bschnurr merged commit 3619ecd into microsoft:main Apr 22, 2026
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster tokenizer#11392

Faster tokenizer#11392
bschnurr merged 18 commits into
microsoft:mainfrom
bschnurr:faster-tokenizer

bschnurr commented Apr 16, 2026 •

edited

Loading

Uh oh!

This comment has been minimized.

Uh oh!

rchiodo commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment has been minimized.

bschnurr commented Apr 17, 2026

Uh oh!

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

github-actions Bot commented Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bschnurr commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!