Faster tokenizer#11392
Conversation
This comment has been minimized.
This comment has been minimized.
|
detachSubstring replaces the old cloneStr + _identifierInternedStrings Map for identifiers. The old intern map deduplicated repeated identifiers within a tokenization pass (e.g., 10,000 occurrences of self → 1 string object). The new code creates a fresh string per token. The benchmark corpora are 300–1700 lines and won't surface memory regressions on very large files with repetitive identifiers (e.g., generated code). Consider adding a "repetitive identifier" benchmark corpus to validate this tradeoff, or document that the intern map was intentionally removed with the expectation that per-token cost savings outweigh deduplication benefits. |
This comment has been minimized.
This comment has been minimized.
|
Addressed in commit 462a76a: added a |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
…for : in type: ignore[...] bracket codes (e.g. ty:unresolved-reference). The fix adds Char.Colon to the allowed bracket-content characters in both bracket-parsing branches of matchIgnoreDirective, but only when directive === 'type' — matching the original regex difference ([\s\w:,-]* for type: vs [\s\w-,]* for pyright:).
This PR improves parser/tokenizer performance on common hot paths and moves benchmark suites out of the normal Jest test matrix. The main changes are: - replace regex-based directive and continuation scanning with manual scans - reduce tokenizer overhead on common identifier paths - clean up a few parser/token access paths that benefit from the tokenizer work - add dedicated benchmark coverage and keep benchmark runs opt-in ## Benchmark Results I reran the tokenizer benchmarks against `main` using the isolated harness so each corpus runs in a fresh process. Representative median results vs `main`: - `large_stdlib`: `3.13ms` vs `3.98ms` (`21%` faster) - `fstring_heavy`: `1.77ms` vs `1.94ms` (`9%` faster) - `large_class`: `1.97ms` vs `2.12ms` (`7%` faster) - `union_heavy`: `1.63ms` vs `2.18ms` (`25%` faster) - `large_stdlib_10x`: `21.17ms` vs `24.42ms` (`13%` faster) `comment_heavy` was effectively flat, and `import_heavy` remained too noisy to treat as a reliable headline result. Overall, the larger and more representative tokenizer-heavy corpora improved relative to `main`. ## Testing - tokenizer regression tests passed - tokenizer test suite passed - full `pyright-internal` test suite passed - isolated tokenizer benchmark runs completed successfully
…cts [type: ignoretype: ignore[ or pyright: ignore[ when the closing ] is missing, instead of silently treating them as bare ignore directives. I also added a regression test in tokenizer.test.ts for the malformed-bracket case. Focused validation passed: the TypeIgnore|PyrightIgnore slice now reports 9 passing tests, including TypeIgnoreLineMalformedBracket.
…e-bracket test, fast-path comment handler
…irective (fix O(n^2) on comment-heavy files)
3a82f80 to
2c8cb90
Compare
|
According to mypy_primer, this change doesn't affect type check results on a corpus of open source code. ✅ |
Already committed as
462a76a2f. Here's the updated PR description:Faster tokenizer
Summary
Continues the tokenizer optimization work with a series of hot-path improvements, plus review follow-ups on the ignore-directive scanner and identifier-intern behavior. Combined result: ~20–30% faster on large Python corpora.
Benchmark results vs
mainMethodology: separate git worktrees of
origin/mainandfaster-tokenizer, identical tokenizerBenchmark.test.ts harness (each corpus run in a fresh Node process, 3 warmup + 10 measured iterations). Numbers below are the median of per-run medians across 3–6 runs per side. Small corpora (< 2 ms, < 10 KB) are dominated by V8 JIT/GC jitter and noted as noise.The
large_stdlib_10xresult is the most trustworthy signal (10× work, proportionally less noise) and shows a consistent −24%.Enhancements
Tokenizer hot paths
_canStartString,_asciiIdentifierStart,_asciiIdentifierContinue,_keywordFirstCharTable,_singleCharOperatorTypeTable, …)._tryIdentifierthat advances over ASCII identifier chars in a tightcharCodeAtloop, falling back to the unicode/surrogate path only when a non-ASCII char is encountered. The single biggest win on real-world code.firstChar/lastChar/length, no chaining) that deduplicates repeated identifiers (self,cls,True,None,str,int, …) within a single tokenize pass. Addresses the memory concern raised during review about per-token string allocation, without the overhead of the previousMap-based intern table. The newrepetitive_identifiersbenchmark corpus locks in the tradeoff (−13% vs main on that corpus).CharacterStream.skipWhitespaceas a tightcharCodeAtloop that updates_position/_currentChar/_isEndOfStreamdirectly, avoiding per-iteration method calls._handleCommentso the O(n)_tokens.findIndexscan no longer runs on everytype: ignoredirective.Ignore-directive scanner (review follow-ups)
type: ignore[...]rules (e.g.ty:rule-name).type: ignore[/pyright: ignore[comments with an unclosed bracket rather than treating them as "ignore all diagnostics". New testTypeIgnoreLineMalformedBracketWithSpacelocks in this behavior for the# type: ignore [brokencase.parseIgnoreBracketContenthelper — both the "bracket-after-space" and "bracket-immediately-after-ignore" branches now share one implementation._handleCommentthat usesindexOf('ignore', …)before invoking the directive scanner, so comments without the wordignoredon't pay directive-parsing cost.matchIgnoreDirectiveuses a bounded hand-rolled scan for the directive keyword. (An earlier iteration usedString.prototype.indexOf, which has no end bound and scanned well past the current comment on comment-heavy files, producing O(n²) behavior; the worktree-vs-main comparison caught and fixed this.)Parser / source-file touch-ups
Benchmark infrastructure
large_stdlib,large_stdlib_10x,fstring_heavy,comment_heavy,large_class,import_heavy,union_heavy, andrepetitive_identifiers(the last specifically validates the identifier intern cache).execFileSync) with 3 warmup + 10 measured iterations; results written as JSON under.generated/benchmark-results/for side-by-side comparison.calculateStats,printResultTable) takeReadonlyArrayparameters per review feedback.Tokenizer regression tests
Ran terminal command: cd c:\dev\pyright-main-benchmark; git status; git ls-files packages/pyright-internal/src/tests/benchmarkData | Select-Object -First 3
Behavior notes
# type: ignore[unclosedand# type: ignore [unclosedare now rejected entirely (notypeIgnoreLinerecorded) instead of falling back to "ignore all". This matches the intent of the original regex for the[branch and is now consistent between the space and no-space cases.ty:rule-name) remain valid insidetype: ignore[...]but not insidepyright: ignore[...].Validation
npm test(full pyright-internal suite) — green.npx jest tokenizer parser.test --forceExit— 137/137 passing.npm run test:benchmark— table above; raw JSON in.generated/benchmark-results/tokenizer/.