Skip to content

fix(native): backfill silently-dropped files via WASM for engine parity#970

Merged
carlos-alm merged 4 commits intomainfrom
fix/native-engine-parity-backfill
Apr 20, 2026
Merged

fix(native): backfill silently-dropped files via WASM for engine parity#970
carlos-alm merged 4 commits intomainfrom
fix/native-engine-parity-backfill

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

Fixes #967. The native engine's parse_files_parallel (Rust) silently drops any file whose read/parse/extract step fails — including files in languages where the installed native addon lacks an extractor (e.g. HCL/Scala/Swift on the currently-published v3.9.2 binary). WASM handles those languages fine, so benchmarks on this repo showed 668 native vs 728 WASM file nodes on the same tree.

The fix closes the parity gap in two places in the JS layer:

  • parseFilesAuto (src/domain/parser.ts) — tracks which files the native parser actually returned; re-parses the rest with WASM so the JS pipeline gets complete coverage.
  • tryNativeOrchestrator (src/domain/graph/builder/pipeline.ts) — after the native Rust orchestrator writes to the DB, diffs collectFiles() against the kind='file' rows and backfills any missing files via a WASM pass + batchInsertNodes.

Both paths emit a warn() when a drop happens so the underlying Rust regression stays visible rather than being silently masked.

Local verification

On this repo (ts=484, rs=57, sh=18, … hcl=4, scala=4, swift=4):

Run kind='file' rows in DB
Before fix (native) 668
After fix (native) 680
WASM baseline 680 (same 48 files lack local WASM grammars on both sides)

The backfill correctly adds the 12 HCL/Scala/Swift files. The remaining 48 (clojure/cuda/erlang/fsharp/gleam/groovy/julia/objc/r/solidity/verilog) have no WASM grammar installed locally, so both engines skip them identically — true parity.

Complementary work

This fix protects users on older native addons. A separate Rust-side change should still harden parallel.rs::parse_files_parallel to not silently swallow extract failures — that belongs in a follow-up.

Test plan

  • Local: node ./dist/cli.js build . --engine native → DB file count increases from 668 → 680, with all 4×3 HCL/Scala/Swift rows present.
  • Local: parseFilesAuto-using tests pass (tests/parsers/unified.test.ts, cfg-all-langs.test.ts, ast-all-langs.test.ts, ast-nodes.test.ts).
  • CI: benchmark job confirms files.total matches between native and WASM.

The native engine's `parse_files_parallel` (Rust) uses `filter_map` with the
`?` operator, which silently drops any file whose read/parse/extract step
fails — including files in languages where the installed native addon lacks
an extractor (e.g. HCL/Scala/Swift on v3.9.2). WASM handles these languages
correctly, so benchmarks showed 668 native vs 728 WASM file nodes on the
same tree (#967).

Fix the parity gap in two places:

- `parseFilesAuto`: track which files the native parser actually returned;
  re-parse the rest with WASM so the JS pipeline gets complete coverage.
- `tryNativeOrchestrator`: after the native Rust orchestrator writes to the
  DB, diff `collectFiles()` against the `kind='file'` rows and backfill any
  missing files via a WASM pass + `batchInsertNodes`.

Both paths warn explicitly when a drop happens so the underlying Rust
regression stays visible. This complements the Rust-side fix (adding
HCL/Scala/Swift extractors in newer binaries) by protecting users who
still have older native addons installed.
@claude
Copy link
Copy Markdown

claude bot commented Apr 19, 2026

Claude finished @carlos-alm's task —— View job


Root Cause Analysis

Problem Being Solved: This PR fixes a real bug where the native Rust engine silently drops files during parsing when their language extractor/grammar is missing or fails (e.g., HCL/Scala/Swift on older native binaries v3.9.2). This caused a parity gap where native produced 668 file nodes vs WASM's 728 on the same codebase.

Why This Approach: The fix correctly addresses the root cause in two strategic places:

  1. parseFilesAuto - tracks which files native actually returned and backfills missing ones via WASM
  2. tryNativeOrchestrator - diffs expected vs actual file nodes in DB after native orchestrator and backfills gaps

Both paths emit explicit warnings to keep the underlying Rust issue visible rather than silently masking it.

Risk Assessment: ✅ LOW RISK - This is a proper fix, not a workaround. It protects users on older binaries while maintaining transparency about the underlying issue.

Backlog Compliance

  • Zero-dep: ✓ - No new runtime dependencies added, builds on existing tree-sitter/SQLite/WASM infrastructure
  • Foundation-aligned: ✓ - Fully complies with all FOUNDATION.md principles, improves accuracy without violating core tenets
  • Problem-fit: 5/5 - Directly addresses the core problem of AI agents getting inconsistent dependency graphs between engines, preventing hallucinated dependencies
  • Breaking: No - Purely additive fix that improves accuracy without changing APIs or behavior
  • Tier: 1 (zero-dep ✓ + foundation-aligned ✓)

Code Quality Assessment

Strengths:

  • Proper error handling: Both backfill paths include appropriate warnings that maintain visibility into the underlying Rust issue
  • Clean implementation: The fix is surgical and doesn't introduce architectural complexity
  • Engine parity focus: Correctly addresses the core requirement that both engines must produce identical results
  • Test coverage: Existing parseFilesAuto tests in unified.test.ts will exercise the new backfill logic
  • Performance considerations: Backfill only runs when files are actually dropped, minimal overhead on normal operation

Technical Details:

  • Line 907-920 (parser.ts): Clean detection of dropped files by comparing nativeParsed set against expected files with supported extensions
  • Line 714-763 (pipeline.ts): Proper DB connection management - switches from NativeDbProxy to real better-sqlite3 for INSERT operations
  • Line 743: Forces WASM engine explicitly to avoid recursion back to native
  • Line 745-762: Correctly inserts both file and symbol nodes to maintain DB consistency

Security & Performance:

  • No security concerns - operates on local file system only
  • Minimal performance impact - only runs when files are dropped
  • Proper WASM tree cleanup handled by existing parseFilesWasm

Critical Concerns

None. This is a well-implemented bug fix that properly addresses the root cause while maintaining system transparency.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: This is a high-quality fix for a real parity bug. It's correctly implemented, doesn't mask symptoms, maintains transparency via warnings, and directly improves the accuracy of dependency graph generation. The approach is surgical and respects the existing architecture. Only reason it's not 5 stars is that it's a defensive fix for an upstream Rust issue rather than an innovative feature.

The fix protects users on older native binaries while keeping the underlying Rust parsing issue visible through explicit warnings. This is exactly how engine parity gaps should be handled.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 19, 2026

Greptile Summary

This PR closes a native-engine parity gap (#967) where parse_files_parallel in Rust silently dropped files for languages whose extractor was missing (HCL, Scala, Swift on v3.9.2), causing native builds to produce fewer kind='file' DB rows than WASM. The fix adds two WASM backfill paths — one in parseFilesAuto for per-file incremental parity, and one in backfillNativeDroppedFiles for full-build orchestrator parity — both scoped to extensions with an installed WASM grammar to avoid false warnings for languages neither engine supports. All previous review concerns (warn count inflation, null qualified_name, incremental-build overhead, missing exported=1 update) have been addressed in this iteration.

Confidence Score: 5/5

Safe to merge — all prior P0/P1 findings are resolved and no new critical issues are present.

All four previous review concerns (warn dilution, null qualified_name/parent_id, unconditional filesystem scan, missing exports pass) have been fixed in this revision. The remaining implementation is logically sound: nativeParsed uses consistent absolute paths, INSERT OR IGNORE is safe against races, the chunked UPDATE mirrors the reference implementation, and the module-level grammar cache is correctly scoped to the immutable shipped grammars directory. No P1 or P0 findings remain.

No files require special attention.

Important Files Changed

Filename Overview
src/domain/parser.ts Adds getInstalledWasmExtensions() (module-cached set of installed WASM grammar extensions) and engine-parity fallback in parseFilesAuto that re-parses native-dropped files via WASM; logic is correct and path comparison uses absolute paths consistently.
src/domain/graph/builder/pipeline.ts Adds backfillNativeDroppedFiles (full-build only) that walks the filesystem, diffs against DB file rows, WASM-parses missing files, then inserts file/definition/export rows with qualified_name, scope, and a second-pass exported=1 UPDATE — all previous feedback has been addressed.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[tryNativeOrchestrator] --> B{result.isFullBuild?}
    B -- No --> Z[closeDbPair / return]
    B -- Yes --> C[backfillNativeDroppedFiles]

    C --> D{ctx.nativeFirstProxy?}
    D -- Yes --> E[closeNativeDb → openDb as better-sqlite3]
    D -- No --> F
    E --> F[collectFilesUtil → expected set]
    F --> G[SELECT DISTINCT file FROM nodes WHERE kind='file' → existing set]
    G --> H[getInstalledWasmExtensions → installedExts]
    H --> I{missingAbs = expected − existing filtered by installedExts}
    I -- empty --> Z
    I -- non-empty --> J[warn: dropped N files]
    J --> K[parseFilesAuto missingAbs engine=wasm]
    K --> L[Build rows: file row + def rows + export rows]
    L --> M[batchInsertNodes INSERT OR IGNORE]
    M --> N{exportKeys non-empty?}
    N -- Yes --> O[Chunked UPDATE nodes SET exported=1 WHERE name/kind/file/line]
    N -- No --> Z
    O --> Z

    subgraph parseFilesAuto [parseFilesAuto parity path]
        P[native.parseFiles / parseFilesFull] --> Q[Track nativeParsed set]
        Q --> R{dropped = filePaths − nativeParsed filtered by installedExts}
        R -- non-empty --> S[warn + parseFilesWasm dropped files]
        S --> T[Merge WASM results into result map]
        R -- empty --> T2[return result]
        T --> T2
    end
Loading

Reviews (3): Last reviewed commit: "fix(native): backfill exported symbols i..." | Re-trigger Greptile

Comment on lines +740 to +742
warn(
`Native orchestrator dropped ${missingAbs.length} file(s); backfilling via WASM for engine parity`,
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Warn count inflates to include un-parseable languages

collectFilesUtil uses EXTENSIONS (= SUPPORTED_EXTENSIONS), which includes every language in the registry — even those whose WASM grammar is not installed (groovy, erlang, clojure, etc.). When neither the native addon nor WASM can parse a file, it never appears in the DB, so it always ends up in missingAbs. The warning message blames "native orchestrator dropped N file(s)" for all of them, even though N includes files that no engine can handle. On a repo with many unsupported-language files, the log signal for the real HCL/Scala/Swift regression gets diluted.

Consider filtering missingAbs to extensions with an installed WASM grammar before emitting the warning, similar to how parser.ts uses _extToLang.has(...) as a filter gate:

// Only warn about and backfill files that WASM can actually handle.
const recoverable = missingAbs.filter(
  (f) => _extToLang.has(path.extname(f).toLowerCase()),
);
if (recoverable.length === 0) return;
warn(`Native orchestrator dropped ${recoverable.length} file(s); backfilling via WASM for engine parity`);
const wasmResults = await parseFilesAuto(recoverable, ctx.rootDir, { engine: 'wasm' });

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — now filters missingAbs to files with installed WASM grammars via a new getInstalledWasmExtensions() helper in parser.ts. Same helper also applied to the parseFilesAuto warn in parser.ts, so both sites stay in sync. Languages without an installed grammar (groovy, erlang, etc.) no longer inflate the native-drop warn.

Comment on lines +749 to +760
rows.push([
def.name,
def.kind,
relPath,
def.line,
def.endLine ?? null,
null,
null,
null,
def.visibility ?? null,
]);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Backfilled symbol nodes have null qualified_name and parent_id

For every def row inserted here, qualified_name (index 6) is null and parent_id (index 5) is null. Any downstream query that joins on qualified_name (e.g. cross-file reference resolution, "go to definition") will not find these symbols. If the backfilled files are only expected to contribute to kind='file' counts that's fine, but the loop also inserts definition rows (def.kind) that will appear in the graph without a qualified name, which may produce confusing incomplete results in some queries.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — backfilled symbol rows now populate qualified_name (def.name) and scope (dotted prefix if present), matching how insertDefinitionsAndExports builds rows. Cross-file reference resolution and go-to-definition will now find these symbols.

Comment on lines +714 to +722
async function backfillNativeDroppedFiles(ctx: PipelineContext): Promise<void> {
// Needs a real better-sqlite3 connection for INSERT.
if (ctx.nativeFirstProxy) {
closeNativeDb(ctx, 'pre-parity-backfill');
ctx.db = openDb(ctx.dbPath);
ctx.nativeFirstProxy = false;
}

const collected = collectFilesUtil(ctx.rootDir, [], ctx.config, new Set<string>());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Filesystem scan on every native build

backfillNativeDroppedFiles is called unconditionally at the end of every tryNativeOrchestrator run, including incremental builds where the native engine is healthy. It always executes collectFilesUtil (a recursive filesystem walk) and a SELECT DISTINCT query against the DB before it can determine there's nothing to backfill. For large repositories this adds measurable overhead to every native build, even when no files are dropped. Consider making the check opt-in (e.g., skip when result.isFullBuild is false and no regression is expected) or caching the filesystem listing already computed earlier in the pipeline.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed — backfillNativeDroppedFiles now only runs when result.isFullBuild is true. Incremental builds parse through parseFilesAuto which has its own per-file backfill, so the filesystem scan here was redundant on incremental paths.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 19, 2026

Codegraph Impact Analysis

4 functions changed14 callers affected across 9 files

  • tryNativeOrchestrator in src/domain/graph/builder/pipeline.ts:562 (5 transitive callers)
  • backfillNativeDroppedFiles in src/domain/graph/builder/pipeline.ts:721 (4 transitive callers)
  • getInstalledWasmExtensions in src/domain/parser.ts:350 (10 transitive callers)
  • parseFilesAuto in src/domain/parser.ts:892 (9 transitive callers)

- Scope backfill/warn to installed WASM grammars. Extract
  getInstalledWasmExtensions() helper so the parity check in
  parseFilesAuto and the post-orchestrator backfill both ignore
  languages neither engine can parse (keeps warn count meaningful).
- Populate qualified_name and scope on backfilled symbol nodes so
  cross-file reference resolution and go-to-definition queries can
  find them (matches insertDefinitionsAndExports).
- Only run the post-orchestrator filesystem scan on full builds.
  Incremental builds parse through parseFilesAuto, which already has
  its own per-file backfill, so the extra scan was wasted work.
- Fix import ordering flagged by biome.

Impact: 4 functions changed, 14 affected
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

Comment on lines +761 to +778
for (const def of symbols.definitions ?? []) {
// Populate qualified_name/scope the same way the JS fallback does so
// downstream queries (cross-file references, "go to definition") find
// these symbols.
const dotIdx = def.name.lastIndexOf('.');
const scope = dotIdx !== -1 ? def.name.slice(0, dotIdx) : null;
rows.push([
def.name,
def.kind,
relPath,
def.line,
def.endLine ?? null,
null,
def.name,
scope,
def.visibility ?? null,
]);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Exported symbols silently dropped for backfilled files

backfillNativeDroppedFiles iterates only symbols.definitions, but insertDefinitionsAndExports (the reference implementation at insert-nodes.ts:212-214) also iterates symbols.exports — inserting export rows and then issuing a second-pass UPDATE nodes SET exported = 1 for those rows. The backfill skips both steps, so exported symbols from HCL/Scala/Swift files won't be findable by any query that filters on exported = 1 (e.g. cross-file reference resolution). Add the exports pass to match the reference:

for (const exp of symbols.exports ?? []) {
  rows.push([exp.name, exp.kind, relPath, exp.line, null, null, exp.name, null, null]);
}

After batchInsertNodes, a second pass is also needed to mark these rows with exported = 1, matching the UPDATE nodes SET exported = 1 logic in insertDefinitionsAndExports.

Fix in Claude Code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in c3c786a. The backfill now mirrors insertDefinitionsAndExports exactly — iterates symbols.exports ?? [] after symbols.definitions, pushes INSERT OR IGNORE rows, and runs the same chunked (500-row) second-pass UPDATE nodes SET exported = 1 WHERE (name = ? AND kind = ? AND file = ? AND line = ?) OR ... with a prepared-statement cache. Backfilled exports from HCL/Scala/Swift files are now discoverable by queries filtering on exported = 1 (dead-export detection, reverse-dep walks, etc.). Also aligned the file row's qualified_name to NULL to match the reference flow.

The engine-parity backfill in `backfillNativeDroppedFiles` was only
inserting `symbols.definitions` and skipping `symbols.exports`. Queries
filtering on `exported = 1` (e.g. dead-export detection, reverse-dep
walks, "exports of file") would miss symbols from backfilled files —
HCL/Scala/Swift on stale native addons.

Mirror the reference flow in
`src/domain/graph/builder/stages/insert-nodes.ts:insertDefinitionsAndExports`:

- Iterate `symbols.exports ?? []`, push an `INSERT OR IGNORE` row (no-op
  if a definition row for the same name/kind/line already exists), and
  collect the (name, kind, file, line) key for a second pass.
- After `batchInsertNodes`, run chunked `UPDATE nodes SET exported = 1
  WHERE …` statements (chunk size 500 with a prepared-statement cache)
  so backfilled exports are discoverable.

Also fix a smaller parity gap in the file row: `qualified_name` was
`relPath`; the reference flow uses `NULL`. Matches the canonical insert
so `kind='file'` rows are shape-identical across both paths.

Addresses Greptile P1 (comment 3107870224) on PR #970.

Impact: 1 functions changed, 4 affected
@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm
Copy link
Copy Markdown
Contributor Author

Round 2 sweep update

Greptile P1 (exports backfill) — fixed in c3c786a. The backfill now mirrors insertDefinitionsAndExports exactly: iterates symbols.exports ?? [], pushes INSERT OR IGNORE rows, and runs chunked (500-row) second-pass UPDATE nodes SET exported = 1 WHERE … with prepared-statement cache. Also aligned the file row's qualified_name to NULL to match the reference flow.

regression-guard failure (typescript precision 100% → 93.8%) — this is a pre-existing regression on main, not caused by this PR. Evidence:

  • The resolution benchmark runs with engine: 'wasm' only (see scripts/resolution-benchmark.ts:236 and tests/benchmarks/resolution/resolution-benchmark.test.ts:138), so the native backfill added in this PR is never exercised during resolution measurement.
  • CI on main is red with the same failure: https://github.com/optave/ops-codegraph-tool/actions/runs/24643982400 (run from the 3.9.4 benchmark merge).
  • The checked-in v3.9.4 data was produced against published v3.9.4 (which does not include this PR's code), so the FP comes from changes already on main.

Root cause identified: PR #947 added extractCallbackReferenceCalls which emits a dynamic call edge for any member_expression argument. For this.store.set(user.id, user), user.id is a property read, but the extractor emits save → id with receiver user, which then resolves against User.id (extracted from interface User { id: string } as a method via extractInterfaceMethods). That's the single FP flipping TS from 15/15 (100%) to 15/16 (93.8%).

Tracked in #971 with candidate fixes. Not fixing it in this PR because (a) it predates and is orthogonal to the backfill change, and (b) the correct remediation (arity heuristic vs. callee allowlist vs. resolver-side guard) needs scoping beyond a sweep pass.

@carlos-alm carlos-alm merged commit f6f5482 into main Apr 20, 2026
21 of 25 checks passed
@carlos-alm carlos-alm deleted the fix/native-engine-parity-backfill branch April 20, 2026 02:43
@github-actions github-actions bot locked and limited conversation to collaborators Apr 20, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: native engine silently drops 60 files vs WASM in 3.9.4 benchmark (engine parity regression)

1 participant