Skip to content

fix(parity): align WASM and native ast_nodes extraction#1016

Merged
carlos-alm merged 4 commits intomainfrom
fix/ast-nodes-parity-1010
Apr 24, 2026
Merged

fix(parity): align WASM and native ast_nodes extraction#1016
carlos-alm merged 4 commits intomainfrom
fix/ast-nodes-parity-1010

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Resolves #1010.

Summary

Three independent divergences were causing the native engine to emit ~7,200 excess string AST nodes vs WASM on self-build:

  1. Language-coverage gap (~6,653 rows) — WASM's AST_TYPE_MAPS only registered javascript/typescript/tsx; native emitted ast_nodes for 19 more languages via walk_ast_nodes_with_config. Mirrored every LangAstConfig from crates/codegraph-core/src/extractors/helpers.rs into WASM (AST_TYPE_MAPS, AST_STRING_CONFIGS) and threaded a per-language stopRecurseKinds set through createAstStoreVisitor to replicate the native JS-only "don't recurse after new/throw" quirk.
  2. WASM await skipChildren (~500 rows)ast-store-visitor.ts returned skipChildren for await_expression, so string/call children of await import('x') / await fn('y') were never walked. Native javascript.rs explicitly recurses. Removed await from the filter.
  3. UTF-8 byte-length check in native (~40 rows)crates/codegraph-core/src/extractors/javascript.rs gated string emission on content.len() < 2 (UTF-8 bytes). Single non-ASCII glyphs like (3 bytes) slipped through. Changed to content.chars().count() < 2, matching helpers.rs::build_string_node and the WASM visitor (which now uses [...s].length).

Measured parity

Self-build, 775 shared files (excluding files edited in this PR):

kind WASM Native Δ
new 1097 1097 0
throw 230 230 0
await 620 620 0
regex 345 345 0
string 35313 35357 +44
total 37605 37649 +44 (0.12 %)

Comfortably inside the ≤1 % acceptance target. The residual +44 string delta is entirely the native UTF byte-length bug still present in the currently-installed 3.9.5 binary; the Rust fix in this PR reduces it to ~0 once CI rebuilds the native addon.

Test plan

  • tests/engines/ast-parity.test.ts — six new fixtures assert ≤1 row divergence for js, ts, python, rust, go, java
  • tests/engines/ — all 45 engine tests pass
  • tests/parsers/ast-*.test.ts — 43 tests pass
  • Local self-build on both engines — per-file kind counts verified
  • CI will rebuild native addon (local rebuild failed with rustc STATUS_ACCESS_VIOLATION during cdylib link — known Windows issue, cargo check passes cleanly)

Follow-ups (not in this PR)

The comparison also surfaced that WASM silently skips 29 files native processes (.sh, .cpp, .cs, .kt, .hs) because the corresponding tree-sitter-*.wasm grammars aren't shipped in the npm package. That's a separate grammar-availability issue — consider filing it.

Resolves #1010. Three independent divergences were causing the native
engine to emit ~7,200 excess `string` AST nodes vs WASM on self-build:

1. Language coverage gap (~6,653 rows) — WASM's AST_TYPE_MAPS registered
   only javascript/typescript/tsx; native emitted ast_nodes for 19 more
   languages via `walk_ast_nodes_with_config`. Mirrored every
   LangAstConfig from `helpers.rs` into WASM as `AST_TYPE_MAPS` +
   `AST_STRING_CONFIGS` entries and threaded a per-language
   `stopRecurseKinds` set through `createAstStoreVisitor`.

2. WASM `await` skipChildren (~500 rows) — the visitor returned
   skipChildren for `await_expression`, so string/call children of
   `await import('x')` / `await fn('y')` were never walked. Native's
   javascript.rs explicitly recurses. Removed `await` from the
   skipChildren filter.

3. UTF-8 byte-length check in native (~40 rows) —
   `crates/codegraph-core/src/extractors/javascript.rs` gated string
   emission on `content.len() < 2` (UTF-8 byte count). Any single non-
   ASCII glyph like `─` (3 bytes) was emitted. Changed to
   `content.chars().count() < 2` for parity with helpers.rs and JS
   `.length`. WASM's filter uses code-point count (`[...s].length`).

Measured parity after fix on 775 shared files (excluding files edited
in this PR): 37,605 (WASM) vs 37,649 (native) = 0.12 % delta. Every
kind except `string` is at 0 delta; the remaining 44-row string gap is
the UTF fix still waiting on the next native binary rebuild.

New parity test in tests/engines/ast-parity.test.ts asserts ≤1 row
divergence between engines for six languages (js, ts, python, rust,
go, java).
@claude
Copy link
Copy Markdown

claude Bot commented Apr 24, 2026

Claude finished @carlos-alm's task —— View job


Reviewing PR: fix(parity): align WASM and native ast_nodes extraction

Todo List:

  • Read and analyze all changed files to understand implementation
  • Phase 1: Root cause analysis - verify genuine fixes vs symptom hiding
  • Phase 2: Configuration change detection - check for rule relaxation
  • Phase 3: Test integrity - verify test quality and assertions
  • Phase 4: Code quality - review tree-sitter extraction logic
  • Phase 5: Backlog compliance - rate against criteria
  • Provide final assessment with rating and recommendation

Root Cause Analysis

Problem Being Solved: Genuine parity issues where native Rust engine extracted ~7,200 excess AST nodes vs WASM across 3 specific divergences.
Why This Approach: Surgical fixes for actual extraction differences - language coverage gap (6,653 rows), WASM await skipChildren bug (500 rows), UTF-8 byte vs char length (40 rows).
Risk Assessment: LOW RISK - Precise fixes with comprehensive test coverage, no symptom hiding.

Backlog Compliance

  • Zero-dep: ✓ — No new runtime dependencies, builds on existing tree-sitter/SQLite infrastructure
  • Foundation-aligned: ✓ — Improves engine consistency without violating any core principles (P1-P8)
  • Problem-fit: 4/5 — Directly improves agent accuracy by ensuring consistent AST extraction between engines, reducing hallucinated dependencies
  • Breaking: No — Purely internal engine consistency fix, no API/CLI changes
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. This is high-quality parity work that:

Addresses real inconsistencies - Not hiding symptoms but fixing actual engine divergences
Maintains test integrity - Adds stricter validation without weakening existing assertions
Follows documented patterns - Mirrors native LangAstConfig systematically across all languages
Preserves code quality - No configuration relaxation, maintains linting standards
Includes comprehensive testing - 6 language fixtures with ≤1 row tolerance validation

Code Quality Assessment

The implementation is exemplary:

  • Systematic approach: Complete mirroring of native LangAstConfig constants (AST_TYPE_MAPS, AST_STRING_CONFIGS, astStopRecurseKinds) in src/ast-analysis/rules/index.ts:76-259
  • Proper abstraction: Uses existing infrastructure rather than ad-hoc fixes
  • Thread-safe design: Updates both main thread and WASM worker contexts (src/domain/wasm-worker-entry.ts:590-597)
  • Precise bug fixes:
    • Removes await from WASM skipChildren set at ast-store-visitor.ts:196
    • Fixes UTF-8 byte vs char count at javascript.rs:559 and ast-store-visitor.ts:148

Final Recommendation

  • Rating: ⭐⭐⭐⭐⭐ (5/5)
  • Action: APPROVE
  • Reasoning: Exemplary engineering work that fixes genuine parity issues with comprehensive testing, systematic implementation, and no downsides. The 0.12% final divergence is well within the documented ≤1% acceptance criteria. This is exactly the kind of precision fix that improves engine reliability without breaking changes.
    | Branch: fix/ast-nodes-parity-1010

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 24, 2026

Greptile Summary

This PR fixes three independent WASM/native ast_nodes extraction divergences (~7,200 excess rows): it mirrors all 16 missing language configs into AST_TYPE_MAPS/AST_STRING_CONFIGS, removes the spurious await_expression skipChildren in the WASM visitor, and switches the native JS string filter from UTF-8 byte length to Unicode code-point count. The approach is well-reasoned and the measured residual delta (0.12 %) is well inside the acceptance target.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/quality concerns with no impact on the production worker pipeline.

No P0 or P1 issues found. The two P2 items (missing extToLang fallback for a dead legacy code path, and a trivial Set allocation in a default parameter) do not affect the modern worker-isolated pipeline. The core parity fixes are correct and well-tested.

src/features/ast.ts (legacy _tree walk path missing extToLang fallback) and src/ast-analysis/visitors/ast-store-visitor.ts (stopRecurseKinds default parameter allocation).

Important Files Changed

Filename Overview
crates/codegraph-core/src/extractors/javascript.rs Single-line fix: content.len() < 2content.chars().count() < 2 to count Unicode code points instead of UTF-8 bytes, closing the ~40-row native excess for non-ASCII single-glyph strings.
src/ast-analysis/rules/index.ts Adds AST type maps and string configs for 16 new languages, mirroring native LangAstConfig; introduces astStopRecurseKinds to replicate JS/TS-only stop-recurse behaviour. Well-structured and clearly documented.
src/ast-analysis/visitors/ast-store-visitor.ts Major refactor: replaces ad-hoc JS-only string stripping with a generic extractStringContent (mirroring Rust helpers.rs), adds IDENT_TYPES/CALL_TYPES sets, and threads stringConfig/stopRecurseKinds through the visitor. One minor allocation nit on the stopRecurseKinds default parameter.
src/features/ast.ts Extends the WASM fallback walk to cover all languages in AST_TYPE_MAPS (not just extension-matched JS/TS); correctly threads stringConfig and stopRecurseKinds. Missing extToLang fallback for the legacy _tree path noted.
src/ast-analysis/engine.ts Minor plumbing change: threads AST_STRING_CONFIGS and astStopRecurseKinds into setupAstVisitor, consistent with the same pattern used in wasm-worker-entry.ts and ast.ts.
src/domain/wasm-worker-entry.ts Small update to setupVisitorsLocal: passes stringConfig and stopRecurseKinds to createAstStoreVisitor, consistent with the rest of the PR.
tests/engines/ast-parity.test.ts Adds 16 per-language parity fixtures comparing WASM vs native row counts with ≤1 tolerance. Coverage gap for Ruby, PHP, Elixir, Bash etc. was flagged in a previous review comment.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[File to analyse] --> B{astNodes pre-computed?}
    B -- yes, native/worker path --> C[bulkInsertAstNodes / JS insert]
    B -- no, legacy _tree path --> D{WALK_EXTENSIONS or AST_TYPE_MAPS has langId?}
    D -- no --> E[skip — return empty]
    D -- yes --> F[walkAst with langId]
    F --> G{AST_TYPE_MAPS.get langId}
    G -- miss, langId='' --> H[debug log, return empty]
    G -- hit --> I[createAstStoreVisitor\nastTypeMap + stringConfig\n+ stopRecurseKinds]
    I --> J{node.type in astTypeMap?}
    J -- no --> K[recurse children]
    J -- yes, kind=string --> L[extractStringContent\nstrip prefixes+quotes\ncount code points]
    L -- codePoints < 2 --> K
    L -- codePoints >= 2 --> M[emit string row]
    J -- yes, kind=new/throw JS only --> N[emit row + skipChildren\nstopRecurseKinds]
    J -- yes, kind=await/regex/etc --> O[emit row + recurse children]
Loading

Fix All in Claude Code

Reviews (2): Last reviewed commit: "test: add ast_nodes parity fixtures for ..." | Re-trigger Greptile

Comment on lines +120 to 128
function extractChildExpressionText(node: TreeSitterNode): string | null {
const keywords = new Set(['throw', 'raise', 'await', 'new']);
for (let i = 0; i < node.childCount; i++) {
const child = node.child(i);
if (!child) continue;
if (child.type === 'call_expression') {
const fn = child.childForFieldName('function');
return fn ? fn.text : child.text?.split('(')[0] || '?';
}
if (child.type === 'identifier' || child.type === 'member_expression') {
return child.text;
}
if (!keywords.has(child.type)) return truncate(child.text);
}
return truncate(node.text);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inline new Set allocation on every call

extractChildExpressionText creates a brand-new Set on every invocation. For a large file with many throw/await nodes this is avoidable churn. The set is stateless, so it can be a module-level constant.

Suggested change
function extractChildExpressionText(node: TreeSitterNode): string | null {
const keywords = new Set(['throw', 'raise', 'await', 'new']);
for (let i = 0; i < node.childCount; i++) {
const child = node.child(i);
if (!child) continue;
if (child.type === 'call_expression') {
const fn = child.childForFieldName('function');
return fn ? fn.text : child.text?.split('(')[0] || '?';
}
if (child.type === 'identifier' || child.type === 'member_expression') {
return child.text;
}
if (!keywords.has(child.type)) return truncate(child.text);
}
return truncate(node.text);
}
const SKIP_KEYWORDS = new Set(['throw', 'raise', 'await', 'new']);
/** Extract text of the expression inside a throw/await, skipping the keyword. */
function extractChildExpressionText(node: TreeSitterNode): string | null {
for (let i = 0; i < node.childCount; i++) {
const child = node.child(i);
if (!child) continue;
if (!SKIP_KEYWORDS.has(child.type)) return truncate(child.text);
}
return truncate(node.text);
}

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 63d1318 — hoisted the keyword set to a module-level CHILD_EXPR_SKIP_KEYWORDS constant. No more per-call Set allocation.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 24, 2026

Codegraph Impact Analysis

19 functions changed18 callers affected across 6 files

  • walk_ast_nodes_depth in crates/codegraph-core/src/extractors/javascript.rs:499 (2 transitive callers)
  • setupAstVisitor in src/ast-analysis/engine.ts:450 (3 transitive callers)
  • AstStringConfig.quoteChars in src/ast-analysis/rules/index.ts:190 (5 transitive callers)
  • AstStringConfig.stringPrefixes in src/ast-analysis/rules/index.ts:191 (5 transitive callers)
  • astStopRecurseKinds in src/ast-analysis/rules/index.ts:254 (9 transitive callers)
  • truncate in src/ast-analysis/visitors/ast-store-visitor.ts:54 (10 transitive callers)
  • trimLeadingChars in src/ast-analysis/visitors/ast-store-visitor.ts:59 (5 transitive callers)
  • trimTrailingChars in src/ast-analysis/visitors/ast-store-visitor.ts:66 (5 transitive callers)
  • extractConstructorName in src/ast-analysis/visitors/ast-store-visitor.ts:74 (8 transitive callers)
  • extractCallName in src/ast-analysis/visitors/ast-store-visitor.ts:90 (6 transitive callers)
  • extractThrowName in src/ast-analysis/visitors/ast-store-visitor.ts:100 (7 transitive callers)
  • extractAwaitName in src/ast-analysis/visitors/ast-store-visitor.ts:113 (7 transitive callers)
  • extractChildExpressionText in src/ast-analysis/visitors/ast-store-visitor.ts:125 (7 transitive callers)
  • extractStringContent in src/ast-analysis/visitors/ast-store-visitor.ts:139 (7 transitive callers)
  • createAstStoreVisitor in src/ast-analysis/visitors/ast-store-visitor.ts:157 (9 transitive callers)
  • enterNode in src/ast-analysis/visitors/ast-store-visitor.ts:237 (0 transitive callers)
  • setupVisitorsLocal in src/domain/wasm-worker-entry.ts:569 (2 transitive callers)
  • collectFileAstRows in src/features/ast.ts:145 (3 transitive callers)
  • walkAst in src/features/ast.ts:226 (3 transitive callers)

Avoid reallocating a Set on every extractChildExpressionText call —
the contents are stateless and this function runs per throw/await
node during AST-store extraction.

Impact: 1 functions changed, 7 affected
PR #1016 added AST_TYPE_MAPS entries for 16 languages beyond js/ts/
python/rust/go/java but PARITY_FIXTURES only covered the original 6,
leaving silent-divergence risk for languages with distinct string
node types (encapsed_string, sigil, etc.). Adds minimal fixtures for
csharp, ruby, php, c, cpp, kotlin, swift, scala, bash, elixir, lua,
dart, zig, haskell, ocaml — each exercises a string literal plus at
least one other kind from its AST_TYPE_MAP. ocaml-interface (.mli)
already covered by reusing the ocaml map. Tests return early when a
grammar is locally unavailable; CI has all grammars.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile's outside-diff feedback on parity test coverage:

(1) Inline Set allocation in extractChildExpressionText — Fixed in 63d1318. Hoisted to a module-level CHILD_EXPR_SKIP_KEYWORDS constant (src/ast-analysis/visitors/ast-store-visitor.ts).

(2) Parity test coverage gap for newly added languages — Fixed in 1eea986. Extended PARITY_FIXTURES in tests/engines/ast-parity.test.ts with minimal snippets for csharp, ruby, php, c, cpp, kotlin, swift, scala, bash, elixir, lua, dart, zig, haskell, ocaml (15 new fixtures). Each exercises at least one string literal plus one other kind from its AST_TYPE_MAP (throw, new, await, regex/sigil). ocaml-interface shares the ocaml map so it's transitively covered. Fixtures that hit languages whose grammar isn't locally available return early (wasmCount === -1); CI has all grammars loaded so the full parity check runs there.

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit a865127 into main Apr 24, 2026
20 checks passed
@carlos-alm carlos-alm deleted the fix/ast-nodes-parity-1010 branch April 24, 2026 21:57
@github-actions github-actions Bot locked and limited conversation to collaborators Apr 24, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(parity): native engine emits ~7,200 excess string AST nodes vs WASM — 7.5% DB bloat in 3.9.5

1 participant