fix(parity): align WASM and native ast_nodes extraction#1016
fix(parity): align WASM and native ast_nodes extraction#1016carlos-alm merged 4 commits intomainfrom
Conversation
Resolves #1010. Three independent divergences were causing the native engine to emit ~7,200 excess `string` AST nodes vs WASM on self-build: 1. Language coverage gap (~6,653 rows) — WASM's AST_TYPE_MAPS registered only javascript/typescript/tsx; native emitted ast_nodes for 19 more languages via `walk_ast_nodes_with_config`. Mirrored every LangAstConfig from `helpers.rs` into WASM as `AST_TYPE_MAPS` + `AST_STRING_CONFIGS` entries and threaded a per-language `stopRecurseKinds` set through `createAstStoreVisitor`. 2. WASM `await` skipChildren (~500 rows) — the visitor returned skipChildren for `await_expression`, so string/call children of `await import('x')` / `await fn('y')` were never walked. Native's javascript.rs explicitly recurses. Removed `await` from the skipChildren filter. 3. UTF-8 byte-length check in native (~40 rows) — `crates/codegraph-core/src/extractors/javascript.rs` gated string emission on `content.len() < 2` (UTF-8 byte count). Any single non- ASCII glyph like `─` (3 bytes) was emitted. Changed to `content.chars().count() < 2` for parity with helpers.rs and JS `.length`. WASM's filter uses code-point count (`[...s].length`). Measured parity after fix on 775 shared files (excluding files edited in this PR): 37,605 (WASM) vs 37,649 (native) = 0.12 % delta. Every kind except `string` is at 0 delta; the remaining 44-row string gap is the UTF fix still waiting on the next native binary rebuild. New parity test in tests/engines/ast-parity.test.ts asserts ≤1 row divergence between engines for six languages (js, ts, python, rust, go, java).
|
Claude finished @carlos-alm's task —— View job Reviewing PR: fix(parity): align WASM and native ast_nodes extractionTodo List:
Root Cause AnalysisProblem Being Solved: Genuine parity issues where native Rust engine extracted ~7,200 excess AST nodes vs WASM across 3 specific divergences. Backlog Compliance
Critical ConcernsNone identified. This is high-quality parity work that: ✅ Addresses real inconsistencies - Not hiding symptoms but fixing actual engine divergences Code Quality AssessmentThe implementation is exemplary:
Final Recommendation
|
Greptile SummaryThis PR fixes three independent WASM/native Confidence Score: 5/5Safe to merge — all findings are P2 style/quality concerns with no impact on the production worker pipeline. No P0 or P1 issues found. The two P2 items (missing extToLang fallback for a dead legacy code path, and a trivial Set allocation in a default parameter) do not affect the modern worker-isolated pipeline. The core parity fixes are correct and well-tested. src/features/ast.ts (legacy _tree walk path missing extToLang fallback) and src/ast-analysis/visitors/ast-store-visitor.ts (stopRecurseKinds default parameter allocation). Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[File to analyse] --> B{astNodes pre-computed?}
B -- yes, native/worker path --> C[bulkInsertAstNodes / JS insert]
B -- no, legacy _tree path --> D{WALK_EXTENSIONS or AST_TYPE_MAPS has langId?}
D -- no --> E[skip — return empty]
D -- yes --> F[walkAst with langId]
F --> G{AST_TYPE_MAPS.get langId}
G -- miss, langId='' --> H[debug log, return empty]
G -- hit --> I[createAstStoreVisitor\nastTypeMap + stringConfig\n+ stopRecurseKinds]
I --> J{node.type in astTypeMap?}
J -- no --> K[recurse children]
J -- yes, kind=string --> L[extractStringContent\nstrip prefixes+quotes\ncount code points]
L -- codePoints < 2 --> K
L -- codePoints >= 2 --> M[emit string row]
J -- yes, kind=new/throw JS only --> N[emit row + skipChildren\nstopRecurseKinds]
J -- yes, kind=await/regex/etc --> O[emit row + recurse children]
Reviews (2): Last reviewed commit: "test: add ast_nodes parity fixtures for ..." | Re-trigger Greptile |
| function extractChildExpressionText(node: TreeSitterNode): string | null { | ||
| const keywords = new Set(['throw', 'raise', 'await', 'new']); | ||
| for (let i = 0; i < node.childCount; i++) { | ||
| const child = node.child(i); | ||
| if (!child) continue; | ||
| if (child.type === 'call_expression') { | ||
| const fn = child.childForFieldName('function'); | ||
| return fn ? fn.text : child.text?.split('(')[0] || '?'; | ||
| } | ||
| if (child.type === 'identifier' || child.type === 'member_expression') { | ||
| return child.text; | ||
| } | ||
| if (!keywords.has(child.type)) return truncate(child.text); | ||
| } | ||
| return truncate(node.text); | ||
| } |
There was a problem hiding this comment.
Inline
new Set allocation on every call
extractChildExpressionText creates a brand-new Set on every invocation. For a large file with many throw/await nodes this is avoidable churn. The set is stateless, so it can be a module-level constant.
| function extractChildExpressionText(node: TreeSitterNode): string | null { | |
| const keywords = new Set(['throw', 'raise', 'await', 'new']); | |
| for (let i = 0; i < node.childCount; i++) { | |
| const child = node.child(i); | |
| if (!child) continue; | |
| if (child.type === 'call_expression') { | |
| const fn = child.childForFieldName('function'); | |
| return fn ? fn.text : child.text?.split('(')[0] || '?'; | |
| } | |
| if (child.type === 'identifier' || child.type === 'member_expression') { | |
| return child.text; | |
| } | |
| if (!keywords.has(child.type)) return truncate(child.text); | |
| } | |
| return truncate(node.text); | |
| } | |
| const SKIP_KEYWORDS = new Set(['throw', 'raise', 'await', 'new']); | |
| /** Extract text of the expression inside a throw/await, skipping the keyword. */ | |
| function extractChildExpressionText(node: TreeSitterNode): string | null { | |
| for (let i = 0; i < node.childCount; i++) { | |
| const child = node.child(i); | |
| if (!child) continue; | |
| if (!SKIP_KEYWORDS.has(child.type)) return truncate(child.text); | |
| } | |
| return truncate(node.text); | |
| } |
There was a problem hiding this comment.
Fixed in 63d1318 — hoisted the keyword set to a module-level CHILD_EXPR_SKIP_KEYWORDS constant. No more per-call Set allocation.
Codegraph Impact Analysis19 functions changed → 18 callers affected across 6 files
|
Avoid reallocating a Set on every extractChildExpressionText call — the contents are stateless and this function runs per throw/await node during AST-store extraction. Impact: 1 functions changed, 7 affected
PR #1016 added AST_TYPE_MAPS entries for 16 languages beyond js/ts/ python/rust/go/java but PARITY_FIXTURES only covered the original 6, leaving silent-divergence risk for languages with distinct string node types (encapsed_string, sigil, etc.). Adds minimal fixtures for csharp, ruby, php, c, cpp, kotlin, swift, scala, bash, elixir, lua, dart, zig, haskell, ocaml — each exercises a string literal plus at least one other kind from its AST_TYPE_MAP. ocaml-interface (.mli) already covered by reusing the ocaml map. Tests return early when a grammar is locally unavailable; CI has all grammars.
|
Addressed Greptile's outside-diff feedback on parity test coverage: (1) Inline Set allocation in (2) Parity test coverage gap for newly added languages — Fixed in 1eea986. Extended |
Resolves #1010.
Summary
Three independent divergences were causing the native engine to emit ~7,200 excess
stringAST nodes vs WASM on self-build:AST_TYPE_MAPSonly registeredjavascript/typescript/tsx; native emittedast_nodesfor 19 more languages viawalk_ast_nodes_with_config. Mirrored everyLangAstConfigfromcrates/codegraph-core/src/extractors/helpers.rsinto WASM (AST_TYPE_MAPS,AST_STRING_CONFIGS) and threaded a per-languagestopRecurseKindsset throughcreateAstStoreVisitorto replicate the native JS-only "don't recurse afternew/throw" quirk.awaitskipChildren(~500 rows) —ast-store-visitor.tsreturnedskipChildrenforawait_expression, so string/call children ofawait import('x')/await fn('y')were never walked. Nativejavascript.rsexplicitly recurses. Removedawaitfrom the filter.crates/codegraph-core/src/extractors/javascript.rsgated string emission oncontent.len() < 2(UTF-8 bytes). Single non-ASCII glyphs like─(3 bytes) slipped through. Changed tocontent.chars().count() < 2, matchinghelpers.rs::build_string_nodeand the WASM visitor (which now uses[...s].length).Measured parity
Self-build, 775 shared files (excluding files edited in this PR):
Comfortably inside the ≤1 % acceptance target. The residual +44 string delta is entirely the native UTF byte-length bug still present in the currently-installed 3.9.5 binary; the Rust fix in this PR reduces it to ~0 once CI rebuilds the native addon.
Test plan
tests/engines/ast-parity.test.ts— six new fixtures assert ≤1 row divergence for js, ts, python, rust, go, javatests/engines/— all 45 engine tests passtests/parsers/ast-*.test.ts— 43 tests passSTATUS_ACCESS_VIOLATIONduring cdylib link — known Windows issue,cargo checkpasses cleanly)Follow-ups (not in this PR)
The comparison also surfaced that WASM silently skips 29 files native processes (
.sh,.cpp,.cs,.kt,.hs) because the correspondingtree-sitter-*.wasmgrammars aren't shipped in the npm package. That's a separate grammar-availability issue — consider filing it.