fix(parity): align WASM and native ast_nodes extraction by carlos-alm · Pull Request #1016 · optave/ops-codegraph-tool

carlos-alm · 2026-04-24T18:53:09Z

Resolves #1010.

Summary

Three independent divergences were causing the native engine to emit ~7,200 excess string AST nodes vs WASM on self-build:

Language-coverage gap (~6,653 rows) — WASM's AST_TYPE_MAPS only registered javascript/typescript/tsx; native emitted ast_nodes for 19 more languages via walk_ast_nodes_with_config. Mirrored every LangAstConfig from crates/codegraph-core/src/extractors/helpers.rs into WASM (AST_TYPE_MAPS, AST_STRING_CONFIGS) and threaded a per-language stopRecurseKinds set through createAstStoreVisitor to replicate the native JS-only "don't recurse after new/throw" quirk.
WASM await skipChildren (~500 rows) — ast-store-visitor.ts returned skipChildren for await_expression, so string/call children of await import('x') / await fn('y') were never walked. Native javascript.rs explicitly recurses. Removed await from the filter.
UTF-8 byte-length check in native (~40 rows) — crates/codegraph-core/src/extractors/javascript.rs gated string emission on content.len() < 2 (UTF-8 bytes). Single non-ASCII glyphs like ─ (3 bytes) slipped through. Changed to content.chars().count() < 2, matching helpers.rs::build_string_node and the WASM visitor (which now uses [...s].length).

Measured parity

Self-build, 775 shared files (excluding files edited in this PR):

kind	WASM	Native	Δ
new	1097	1097	0
throw	230	230	0
await	620	620	0
regex	345	345	0
string	35313	35357	+44
total	37605	37649	+44 (0.12 %)

Comfortably inside the ≤1 % acceptance target. The residual +44 string delta is entirely the native UTF byte-length bug still present in the currently-installed 3.9.5 binary; the Rust fix in this PR reduces it to ~0 once CI rebuilds the native addon.

Test plan

tests/engines/ast-parity.test.ts — six new fixtures assert ≤1 row divergence for js, ts, python, rust, go, java
tests/engines/ — all 45 engine tests pass
tests/parsers/ast-*.test.ts — 43 tests pass
Local self-build on both engines — per-file kind counts verified
CI will rebuild native addon (local rebuild failed with rustc STATUS_ACCESS_VIOLATION during cdylib link — known Windows issue, cargo check passes cleanly)

Follow-ups (not in this PR)

The comparison also surfaced that WASM silently skips 29 files native processes (.sh, .cpp, .cs, .kt, .hs) because the corresponding tree-sitter-*.wasm grammars aren't shipped in the npm package. That's a separate grammar-availability issue — consider filing it.

Resolves #1010. Three independent divergences were causing the native engine to emit ~7,200 excess `string` AST nodes vs WASM on self-build: 1. Language coverage gap (~6,653 rows) — WASM's AST_TYPE_MAPS registered only javascript/typescript/tsx; native emitted ast_nodes for 19 more languages via `walk_ast_nodes_with_config`. Mirrored every LangAstConfig from `helpers.rs` into WASM as `AST_TYPE_MAPS` + `AST_STRING_CONFIGS` entries and threaded a per-language `stopRecurseKinds` set through `createAstStoreVisitor`. 2. WASM `await` skipChildren (~500 rows) — the visitor returned skipChildren for `await_expression`, so string/call children of `await import('x')` / `await fn('y')` were never walked. Native's javascript.rs explicitly recurses. Removed `await` from the skipChildren filter. 3. UTF-8 byte-length check in native (~40 rows) — `crates/codegraph-core/src/extractors/javascript.rs` gated string emission on `content.len() < 2` (UTF-8 byte count). Any single non- ASCII glyph like `─` (3 bytes) was emitted. Changed to `content.chars().count() < 2` for parity with helpers.rs and JS `.length`. WASM's filter uses code-point count (`[...s].length`). Measured parity after fix on 775 shared files (excluding files edited in this PR): 37,605 (WASM) vs 37,649 (native) = 0.12 % delta. Every kind except `string` is at 0 delta; the remaining 44-row string gap is the UTF fix still waiting on the next native binary rebuild. New parity test in tests/engines/ast-parity.test.ts asserts ≤1 row divergence between engines for six languages (js, ts, python, rust, go, java).

claude · 2026-04-24T18:53:24Z

Claude finished @carlos-alm's task —— View job

Reviewing PR: fix(parity): align WASM and native ast_nodes extraction

Todo List:

Read and analyze all changed files to understand implementation
Phase 1: Root cause analysis - verify genuine fixes vs symptom hiding
Phase 2: Configuration change detection - check for rule relaxation
Phase 3: Test integrity - verify test quality and assertions
Phase 4: Code quality - review tree-sitter extraction logic
Phase 5: Backlog compliance - rate against criteria
Provide final assessment with rating and recommendation

Root Cause Analysis

Problem Being Solved: Genuine parity issues where native Rust engine extracted ~7,200 excess AST nodes vs WASM across 3 specific divergences.
Why This Approach: Surgical fixes for actual extraction differences - language coverage gap (6,653 rows), WASM await skipChildren bug (500 rows), UTF-8 byte vs char length (40 rows).
Risk Assessment: LOW RISK - Precise fixes with comprehensive test coverage, no symptom hiding.

Backlog Compliance

Zero-dep: ✓ — No new runtime dependencies, builds on existing tree-sitter/SQLite infrastructure
Foundation-aligned: ✓ — Improves engine consistency without violating any core principles (P1-P8)
Problem-fit: 4/5 — Directly improves agent accuracy by ensuring consistent AST extraction between engines, reducing hallucinated dependencies
Breaking: No — Purely internal engine consistency fix, no API/CLI changes
Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Critical Concerns

None identified. This is high-quality parity work that:

✅ Addresses real inconsistencies - Not hiding symptoms but fixing actual engine divergences
✅ Maintains test integrity - Adds stricter validation without weakening existing assertions
✅ Follows documented patterns - Mirrors native LangAstConfig systematically across all languages
✅ Preserves code quality - No configuration relaxation, maintains linting standards
✅ Includes comprehensive testing - 6 language fixtures with ≤1 row tolerance validation

Code Quality Assessment

The implementation is exemplary:

Systematic approach: Complete mirroring of native LangAstConfig constants (AST_TYPE_MAPS, AST_STRING_CONFIGS, astStopRecurseKinds) in src/ast-analysis/rules/index.ts:76-259
Proper abstraction: Uses existing infrastructure rather than ad-hoc fixes
Thread-safe design: Updates both main thread and WASM worker contexts (src/domain/wasm-worker-entry.ts:590-597)
Precise bug fixes:
- Removes await from WASM skipChildren set at ast-store-visitor.ts:196
- Fixes UTF-8 byte vs char count at javascript.rs:559 and ast-store-visitor.ts:148

Final Recommendation

Rating: ⭐⭐⭐⭐⭐ (5/5)
Action: APPROVE
Reasoning: Exemplary engineering work that fixes genuine parity issues with comprehensive testing, systematic implementation, and no downsides. The 0.12% final divergence is well within the documented ≤1% acceptance criteria. This is exactly the kind of precision fix that improves engine reliability without breaking changes.
| Branch: fix/ast-nodes-parity-1010

greptile-apps · 2026-04-24T18:59:09Z

Greptile Summary

This PR fixes three independent WASM/native ast_nodes extraction divergences (~7,200 excess rows): it mirrors all 16 missing language configs into AST_TYPE_MAPS/AST_STRING_CONFIGS, removes the spurious await_expression skipChildren in the WASM visitor, and switches the native JS string filter from UTF-8 byte length to Unicode code-point count. The approach is well-reasoned and the measured residual delta (0.12 %) is well inside the acceptance target.

Confidence Score: 5/5

Safe to merge — all findings are P2 style/quality concerns with no impact on the production worker pipeline.

No P0 or P1 issues found. The two P2 items (missing extToLang fallback for a dead legacy code path, and a trivial Set allocation in a default parameter) do not affect the modern worker-isolated pipeline. The core parity fixes are correct and well-tested.

src/features/ast.ts (legacy _tree walk path missing extToLang fallback) and src/ast-analysis/visitors/ast-store-visitor.ts (stopRecurseKinds default parameter allocation).

Important Files Changed

Filename	Overview
crates/codegraph-core/src/extractors/javascript.rs	Single-line fix: `content.len() < 2` → `content.chars().count() < 2` to count Unicode code points instead of UTF-8 bytes, closing the ~40-row native excess for non-ASCII single-glyph strings.
src/ast-analysis/rules/index.ts	Adds AST type maps and string configs for 16 new languages, mirroring native LangAstConfig; introduces `astStopRecurseKinds` to replicate JS/TS-only stop-recurse behaviour. Well-structured and clearly documented.
src/ast-analysis/visitors/ast-store-visitor.ts	Major refactor: replaces ad-hoc JS-only string stripping with a generic `extractStringContent` (mirroring Rust helpers.rs), adds `IDENT_TYPES`/`CALL_TYPES` sets, and threads `stringConfig`/`stopRecurseKinds` through the visitor. One minor allocation nit on the `stopRecurseKinds` default parameter.
src/features/ast.ts	Extends the WASM fallback walk to cover all languages in AST_TYPE_MAPS (not just extension-matched JS/TS); correctly threads stringConfig and stopRecurseKinds. Missing extToLang fallback for the legacy _tree path noted.
src/ast-analysis/engine.ts	Minor plumbing change: threads AST_STRING_CONFIGS and astStopRecurseKinds into setupAstVisitor, consistent with the same pattern used in wasm-worker-entry.ts and ast.ts.
src/domain/wasm-worker-entry.ts	Small update to setupVisitorsLocal: passes stringConfig and stopRecurseKinds to createAstStoreVisitor, consistent with the rest of the PR.
tests/engines/ast-parity.test.ts	Adds 16 per-language parity fixtures comparing WASM vs native row counts with ≤1 tolerance. Coverage gap for Ruby, PHP, Elixir, Bash etc. was flagged in a previous review comment.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[File to analyse] --> B{astNodes pre-computed?}
    B -- yes, native/worker path --> C[bulkInsertAstNodes / JS insert]
    B -- no, legacy _tree path --> D{WALK_EXTENSIONS or AST_TYPE_MAPS has langId?}
    D -- no --> E[skip — return empty]
    D -- yes --> F[walkAst with langId]
    F --> G{AST_TYPE_MAPS.get langId}
    G -- miss, langId='' --> H[debug log, return empty]
    G -- hit --> I[createAstStoreVisitor\nastTypeMap + stringConfig\n+ stopRecurseKinds]
    I --> J{node.type in astTypeMap?}
    J -- no --> K[recurse children]
    J -- yes, kind=string --> L[extractStringContent\nstrip prefixes+quotes\ncount code points]
    L -- codePoints < 2 --> K
    L -- codePoints >= 2 --> M[emit string row]
    J -- yes, kind=new/throw JS only --> N[emit row + skipChildren\nstopRecurseKinds]
    J -- yes, kind=await/regex/etc --> O[emit row + recurse children]

_{Reviews (2): Last reviewed commit: "test: add ast_nodes parity fixtures for ..." | Re-trigger Greptile}

greptile-apps · 2026-04-24T18:59:13Z

+function extractChildExpressionText(node: TreeSitterNode): string | null {
+  const keywords = new Set(['throw', 'raise', 'await', 'new']);
  for (let i = 0; i < node.childCount; i++) {
    const child = node.child(i);
    if (!child) continue;
-    if (child.type === 'call_expression') {
-      const fn = child.childForFieldName('function');
-      return fn ? fn.text : child.text?.split('(')[0] || '?';
-    }
-    if (child.type === 'identifier' || child.type === 'member_expression') {
-      return child.text;
-    }
+    if (!keywords.has(child.type)) return truncate(child.text);
  }
  return truncate(node.text);
 }


Inline new Set allocation on every call

extractChildExpressionText creates a brand-new Set on every invocation. For a large file with many throw/await nodes this is avoidable churn. The set is stateless, so it can be a module-level constant.

Suggested change

function extractChildExpressionText(node: TreeSitterNode): string | null {

const keywords = new Set(['throw', 'raise', 'await', 'new']);

for (let i = 0; i < node.childCount; i++) {

const child = node.child(i);

if (!child) continue;

if (child.type === 'call_expression') {

const fn = child.childForFieldName('function');

return fn ? fn.text : child.text?.split('(')[0] || '?';

}

if (child.type === 'identifier' || child.type === 'member_expression') {

return child.text;

}

if (!keywords.has(child.type)) return truncate(child.text);

}

return truncate(node.text);

}

const SKIP_KEYWORDS = new Set(['throw', 'raise', 'await', 'new']);

/** Extract text of the expression inside a throw/await, skipping the keyword. */

function extractChildExpressionText(node: TreeSitterNode): string | null {

for (let i = 0; i < node.childCount; i++) {

const child = node.child(i);

if (!child) continue;

if (!SKIP_KEYWORDS.has(child.type)) return truncate(child.text);

}

return truncate(node.text);

}

Fixed in 63d1318 — hoisted the keyword set to a module-level CHILD_EXPR_SKIP_KEYWORDS constant. No more per-call Set allocation.

github-actions · 2026-04-24T19:00:47Z

Codegraph Impact Analysis

19 functions changed → 18 callers affected across 6 files

walk_ast_nodes_depth in crates/codegraph-core/src/extractors/javascript.rs:499 (2 transitive callers)
setupAstVisitor in src/ast-analysis/engine.ts:450 (3 transitive callers)
AstStringConfig.quoteChars in src/ast-analysis/rules/index.ts:190 (5 transitive callers)
AstStringConfig.stringPrefixes in src/ast-analysis/rules/index.ts:191 (5 transitive callers)
astStopRecurseKinds in src/ast-analysis/rules/index.ts:254 (9 transitive callers)
truncate in src/ast-analysis/visitors/ast-store-visitor.ts:54 (10 transitive callers)
trimLeadingChars in src/ast-analysis/visitors/ast-store-visitor.ts:59 (5 transitive callers)
trimTrailingChars in src/ast-analysis/visitors/ast-store-visitor.ts:66 (5 transitive callers)
extractConstructorName in src/ast-analysis/visitors/ast-store-visitor.ts:74 (8 transitive callers)
extractCallName in src/ast-analysis/visitors/ast-store-visitor.ts:90 (6 transitive callers)
extractThrowName in src/ast-analysis/visitors/ast-store-visitor.ts:100 (7 transitive callers)
extractAwaitName in src/ast-analysis/visitors/ast-store-visitor.ts:113 (7 transitive callers)
extractChildExpressionText in src/ast-analysis/visitors/ast-store-visitor.ts:125 (7 transitive callers)
extractStringContent in src/ast-analysis/visitors/ast-store-visitor.ts:139 (7 transitive callers)
createAstStoreVisitor in src/ast-analysis/visitors/ast-store-visitor.ts:157 (9 transitive callers)
enterNode in src/ast-analysis/visitors/ast-store-visitor.ts:237 (0 transitive callers)
setupVisitorsLocal in src/domain/wasm-worker-entry.ts:569 (2 transitive callers)
collectFileAstRows in src/features/ast.ts:145 (3 transitive callers)
walkAst in src/features/ast.ts:226 (3 transitive callers)

Avoid reallocating a Set on every extractChildExpressionText call — the contents are stateless and this function runs per throw/await node during AST-store extraction. Impact: 1 functions changed, 7 affected

PR #1016 added AST_TYPE_MAPS entries for 16 languages beyond js/ts/ python/rust/go/java but PARITY_FIXTURES only covered the original 6, leaving silent-divergence risk for languages with distinct string node types (encapsed_string, sigil, etc.). Adds minimal fixtures for csharp, ruby, php, c, cpp, kotlin, swift, scala, bash, elixir, lua, dart, zig, haskell, ocaml — each exercises a string literal plus at least one other kind from its AST_TYPE_MAP. ocaml-interface (.mli) already covered by reusing the ocaml map. Tests return early when a grammar is locally unavailable; CI has all grammars.

carlos-alm · 2026-04-24T21:33:50Z

Addressed Greptile's outside-diff feedback on parity test coverage:

(1) Inline Set allocation in extractChildExpressionText — Fixed in 63d1318. Hoisted to a module-level CHILD_EXPR_SKIP_KEYWORDS constant (src/ast-analysis/visitors/ast-store-visitor.ts).

(2) Parity test coverage gap for newly added languages — Fixed in 1eea986. Extended PARITY_FIXTURES in tests/engines/ast-parity.test.ts with minimal snippets for csharp, ruby, php, c, cpp, kotlin, swift, scala, bash, elixir, lua, dart, zig, haskell, ocaml (15 new fixtures). Each exercises at least one string literal plus one other kind from its AST_TYPE_MAP (throw, new, await, regex/sigil). ocaml-interface shares the ocaml map so it's transitively covered. Fixtures that hit languages whose grammar isn't locally available return early (wasmCount === -1); CI has all grammars loaded so the full parity check runs there.

carlos-alm · 2026-04-24T21:34:36Z

@greptileai

greptile-apps Bot reviewed Apr 24, 2026

View reviewed changes

carlos-alm added 2 commits April 24, 2026 15:32

perf: hoist SKIP_KEYWORDS set to module scope (#1016)

63d1318

Avoid reallocating a Set on every extractChildExpressionText call — the contents are stateless and this function runs per throw/await node during AST-store extraction. Impact: 1 functions changed, 7 affected

Merge branch 'main' into fix/ast-nodes-parity-1010

ac4e357

carlos-alm merged commit a865127 into main Apr 24, 2026
20 checks passed

carlos-alm deleted the fix/ast-nodes-parity-1010 branch April 24, 2026 21:57

github-actions Bot locked and limited conversation to collaborators Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parity): align WASM and native ast_nodes extraction#1016

fix(parity): align WASM and native ast_nodes extraction#1016
carlos-alm merged 4 commits intomainfrom
fix/ast-nodes-parity-1010

carlos-alm commented Apr 24, 2026

Uh oh!

claude Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 24, 2026

Uh oh!

carlos-alm Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026 •

edited

Loading

Uh oh!

carlos-alm commented Apr 24, 2026

Uh oh!

carlos-alm commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 24, 2026

Summary

Measured parity

Test plan

Follow-ups (not in this PR)

Uh oh!

claude Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewing PR: fix(parity): align WASM and native ast_nodes extraction

Root Cause Analysis

Backlog Compliance

Critical Concerns

Code Quality Assessment

Final Recommendation

Uh oh!

greptile-apps Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented Apr 24, 2026

Uh oh!

carlos-alm commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 24, 2026 •

edited

Loading

greptile-apps Bot commented Apr 24, 2026 •

edited

Loading

github-actions Bot commented Apr 24, 2026 •

edited

Loading