fix(native): backfill silently-dropped files via WASM for engine parity#970
fix(native): backfill silently-dropped files via WASM for engine parity#970carlos-alm merged 4 commits intomainfrom
Conversation
The native engine's `parse_files_parallel` (Rust) uses `filter_map` with the `?` operator, which silently drops any file whose read/parse/extract step fails — including files in languages where the installed native addon lacks an extractor (e.g. HCL/Scala/Swift on v3.9.2). WASM handles these languages correctly, so benchmarks showed 668 native vs 728 WASM file nodes on the same tree (#967). Fix the parity gap in two places: - `parseFilesAuto`: track which files the native parser actually returned; re-parse the rest with WASM so the JS pipeline gets complete coverage. - `tryNativeOrchestrator`: after the native Rust orchestrator writes to the DB, diff `collectFiles()` against the `kind='file'` rows and backfill any missing files via a WASM pass + `batchInsertNodes`. Both paths warn explicitly when a drop happens so the underlying Rust regression stays visible. This complements the Rust-side fix (adding HCL/Scala/Swift extractors in newer binaries) by protecting users who still have older native addons installed.
|
Claude finished @carlos-alm's task —— View job Root Cause AnalysisProblem Being Solved: This PR fixes a real bug where the native Rust engine silently drops files during parsing when their language extractor/grammar is missing or fails (e.g., HCL/Scala/Swift on older native binaries v3.9.2). This caused a parity gap where native produced 668 file nodes vs WASM's 728 on the same codebase. Why This Approach: The fix correctly addresses the root cause in two strategic places:
Both paths emit explicit warnings to keep the underlying Rust issue visible rather than silently masking it. Risk Assessment: ✅ LOW RISK - This is a proper fix, not a workaround. It protects users on older binaries while maintaining transparency about the underlying issue. Backlog Compliance
Code Quality AssessmentStrengths:
Technical Details:
Security & Performance:
Critical ConcernsNone. This is a well-implemented bug fix that properly addresses the root cause while maintaining system transparency. Final Recommendation
The fix protects users on older native binaries while keeping the underlying Rust parsing issue visible through explicit warnings. This is exactly how engine parity gaps should be handled. |
Greptile SummaryThis PR closes a native-engine parity gap (#967) where Confidence Score: 5/5Safe to merge — all prior P0/P1 findings are resolved and no new critical issues are present. All four previous review concerns (warn dilution, null qualified_name/parent_id, unconditional filesystem scan, missing exports pass) have been fixed in this revision. The remaining implementation is logically sound: No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[tryNativeOrchestrator] --> B{result.isFullBuild?}
B -- No --> Z[closeDbPair / return]
B -- Yes --> C[backfillNativeDroppedFiles]
C --> D{ctx.nativeFirstProxy?}
D -- Yes --> E[closeNativeDb → openDb as better-sqlite3]
D -- No --> F
E --> F[collectFilesUtil → expected set]
F --> G[SELECT DISTINCT file FROM nodes WHERE kind='file' → existing set]
G --> H[getInstalledWasmExtensions → installedExts]
H --> I{missingAbs = expected − existing filtered by installedExts}
I -- empty --> Z
I -- non-empty --> J[warn: dropped N files]
J --> K[parseFilesAuto missingAbs engine=wasm]
K --> L[Build rows: file row + def rows + export rows]
L --> M[batchInsertNodes INSERT OR IGNORE]
M --> N{exportKeys non-empty?}
N -- Yes --> O[Chunked UPDATE nodes SET exported=1 WHERE name/kind/file/line]
N -- No --> Z
O --> Z
subgraph parseFilesAuto [parseFilesAuto parity path]
P[native.parseFiles / parseFilesFull] --> Q[Track nativeParsed set]
Q --> R{dropped = filePaths − nativeParsed filtered by installedExts}
R -- non-empty --> S[warn + parseFilesWasm dropped files]
S --> T[Merge WASM results into result map]
R -- empty --> T2[return result]
T --> T2
end
Reviews (3): Last reviewed commit: "fix(native): backfill exported symbols i..." | Re-trigger Greptile |
| warn( | ||
| `Native orchestrator dropped ${missingAbs.length} file(s); backfilling via WASM for engine parity`, | ||
| ); |
There was a problem hiding this comment.
Warn count inflates to include un-parseable languages
collectFilesUtil uses EXTENSIONS (= SUPPORTED_EXTENSIONS), which includes every language in the registry — even those whose WASM grammar is not installed (groovy, erlang, clojure, etc.). When neither the native addon nor WASM can parse a file, it never appears in the DB, so it always ends up in missingAbs. The warning message blames "native orchestrator dropped N file(s)" for all of them, even though N includes files that no engine can handle. On a repo with many unsupported-language files, the log signal for the real HCL/Scala/Swift regression gets diluted.
Consider filtering missingAbs to extensions with an installed WASM grammar before emitting the warning, similar to how parser.ts uses _extToLang.has(...) as a filter gate:
// Only warn about and backfill files that WASM can actually handle.
const recoverable = missingAbs.filter(
(f) => _extToLang.has(path.extname(f).toLowerCase()),
);
if (recoverable.length === 0) return;
warn(`Native orchestrator dropped ${recoverable.length} file(s); backfilling via WASM for engine parity`);
const wasmResults = await parseFilesAuto(recoverable, ctx.rootDir, { engine: 'wasm' });There was a problem hiding this comment.
Fixed — now filters missingAbs to files with installed WASM grammars via a new getInstalledWasmExtensions() helper in parser.ts. Same helper also applied to the parseFilesAuto warn in parser.ts, so both sites stay in sync. Languages without an installed grammar (groovy, erlang, etc.) no longer inflate the native-drop warn.
| rows.push([ | ||
| def.name, | ||
| def.kind, | ||
| relPath, | ||
| def.line, | ||
| def.endLine ?? null, | ||
| null, | ||
| null, | ||
| null, | ||
| def.visibility ?? null, | ||
| ]); | ||
| } |
There was a problem hiding this comment.
Backfilled symbol nodes have null
qualified_name and parent_id
For every def row inserted here, qualified_name (index 6) is null and parent_id (index 5) is null. Any downstream query that joins on qualified_name (e.g. cross-file reference resolution, "go to definition") will not find these symbols. If the backfilled files are only expected to contribute to kind='file' counts that's fine, but the loop also inserts definition rows (def.kind) that will appear in the graph without a qualified name, which may produce confusing incomplete results in some queries.
There was a problem hiding this comment.
Fixed — backfilled symbol rows now populate qualified_name (def.name) and scope (dotted prefix if present), matching how insertDefinitionsAndExports builds rows. Cross-file reference resolution and go-to-definition will now find these symbols.
| async function backfillNativeDroppedFiles(ctx: PipelineContext): Promise<void> { | ||
| // Needs a real better-sqlite3 connection for INSERT. | ||
| if (ctx.nativeFirstProxy) { | ||
| closeNativeDb(ctx, 'pre-parity-backfill'); | ||
| ctx.db = openDb(ctx.dbPath); | ||
| ctx.nativeFirstProxy = false; | ||
| } | ||
|
|
||
| const collected = collectFilesUtil(ctx.rootDir, [], ctx.config, new Set<string>()); |
There was a problem hiding this comment.
Filesystem scan on every native build
backfillNativeDroppedFiles is called unconditionally at the end of every tryNativeOrchestrator run, including incremental builds where the native engine is healthy. It always executes collectFilesUtil (a recursive filesystem walk) and a SELECT DISTINCT query against the DB before it can determine there's nothing to backfill. For large repositories this adds measurable overhead to every native build, even when no files are dropped. Consider making the check opt-in (e.g., skip when result.isFullBuild is false and no regression is expected) or caching the filesystem listing already computed earlier in the pipeline.
There was a problem hiding this comment.
Fixed — backfillNativeDroppedFiles now only runs when result.isFullBuild is true. Incremental builds parse through parseFilesAuto which has its own per-file backfill, so the filesystem scan here was redundant on incremental paths.
Codegraph Impact Analysis4 functions changed → 14 callers affected across 9 files
|
- Scope backfill/warn to installed WASM grammars. Extract getInstalledWasmExtensions() helper so the parity check in parseFilesAuto and the post-orchestrator backfill both ignore languages neither engine can parse (keeps warn count meaningful). - Populate qualified_name and scope on backfilled symbol nodes so cross-file reference resolution and go-to-definition queries can find them (matches insertDefinitionsAndExports). - Only run the post-orchestrator filesystem scan on full builds. Incremental builds parse through parseFilesAuto, which already has its own per-file backfill, so the extra scan was wasted work. - Fix import ordering flagged by biome. Impact: 4 functions changed, 14 affected
| for (const def of symbols.definitions ?? []) { | ||
| // Populate qualified_name/scope the same way the JS fallback does so | ||
| // downstream queries (cross-file references, "go to definition") find | ||
| // these symbols. | ||
| const dotIdx = def.name.lastIndexOf('.'); | ||
| const scope = dotIdx !== -1 ? def.name.slice(0, dotIdx) : null; | ||
| rows.push([ | ||
| def.name, | ||
| def.kind, | ||
| relPath, | ||
| def.line, | ||
| def.endLine ?? null, | ||
| null, | ||
| def.name, | ||
| scope, | ||
| def.visibility ?? null, | ||
| ]); | ||
| } |
There was a problem hiding this comment.
Exported symbols silently dropped for backfilled files
backfillNativeDroppedFiles iterates only symbols.definitions, but insertDefinitionsAndExports (the reference implementation at insert-nodes.ts:212-214) also iterates symbols.exports — inserting export rows and then issuing a second-pass UPDATE nodes SET exported = 1 for those rows. The backfill skips both steps, so exported symbols from HCL/Scala/Swift files won't be findable by any query that filters on exported = 1 (e.g. cross-file reference resolution). Add the exports pass to match the reference:
for (const exp of symbols.exports ?? []) {
rows.push([exp.name, exp.kind, relPath, exp.line, null, null, exp.name, null, null]);
}After batchInsertNodes, a second pass is also needed to mark these rows with exported = 1, matching the UPDATE nodes SET exported = 1 logic in insertDefinitionsAndExports.
There was a problem hiding this comment.
Fixed in c3c786a. The backfill now mirrors insertDefinitionsAndExports exactly — iterates symbols.exports ?? [] after symbols.definitions, pushes INSERT OR IGNORE rows, and runs the same chunked (500-row) second-pass UPDATE nodes SET exported = 1 WHERE (name = ? AND kind = ? AND file = ? AND line = ?) OR ... with a prepared-statement cache. Backfilled exports from HCL/Scala/Swift files are now discoverable by queries filtering on exported = 1 (dead-export detection, reverse-dep walks, etc.). Also aligned the file row's qualified_name to NULL to match the reference flow.
The engine-parity backfill in `backfillNativeDroppedFiles` was only inserting `symbols.definitions` and skipping `symbols.exports`. Queries filtering on `exported = 1` (e.g. dead-export detection, reverse-dep walks, "exports of file") would miss symbols from backfilled files — HCL/Scala/Swift on stale native addons. Mirror the reference flow in `src/domain/graph/builder/stages/insert-nodes.ts:insertDefinitionsAndExports`: - Iterate `symbols.exports ?? []`, push an `INSERT OR IGNORE` row (no-op if a definition row for the same name/kind/line already exists), and collect the (name, kind, file, line) key for a second pass. - After `batchInsertNodes`, run chunked `UPDATE nodes SET exported = 1 WHERE …` statements (chunk size 500 with a prepared-statement cache) so backfilled exports are discoverable. Also fix a smaller parity gap in the file row: `qualified_name` was `relPath`; the reference flow uses `NULL`. Matches the canonical insert so `kind='file'` rows are shape-identical across both paths. Addresses Greptile P1 (comment 3107870224) on PR #970. Impact: 1 functions changed, 4 affected
Round 2 sweep updateGreptile P1 (exports backfill) — fixed in c3c786a. The backfill now mirrors regression-guard failure (
Root cause identified: PR #947 added Tracked in #971 with candidate fixes. Not fixing it in this PR because (a) it predates and is orthogonal to the backfill change, and (b) the correct remediation (arity heuristic vs. callee allowlist vs. resolver-side guard) needs scoping beyond a |
Summary
Fixes #967. The native engine's
parse_files_parallel(Rust) silently drops any file whose read/parse/extract step fails — including files in languages where the installed native addon lacks an extractor (e.g. HCL/Scala/Swift on the currently-published v3.9.2 binary). WASM handles those languages fine, so benchmarks on this repo showed 668 native vs 728 WASM file nodes on the same tree.The fix closes the parity gap in two places in the JS layer:
parseFilesAuto(src/domain/parser.ts) — tracks which files the native parser actually returned; re-parses the rest with WASM so the JS pipeline gets complete coverage.tryNativeOrchestrator(src/domain/graph/builder/pipeline.ts) — after the native Rust orchestrator writes to the DB, diffscollectFiles()against thekind='file'rows and backfills any missing files via a WASM pass +batchInsertNodes.Both paths emit a
warn()when a drop happens so the underlying Rust regression stays visible rather than being silently masked.Local verification
On this repo (ts=484, rs=57, sh=18, … hcl=4, scala=4, swift=4):
kind='file'rows in DBThe backfill correctly adds the 12 HCL/Scala/Swift files. The remaining 48 (clojure/cuda/erlang/fsharp/gleam/groovy/julia/objc/r/solidity/verilog) have no WASM grammar installed locally, so both engines skip them identically — true parity.
Complementary work
This fix protects users on older native addons. A separate Rust-side change should still harden
parallel.rs::parse_files_parallelto not silently swallow extract failures — that belongs in a follow-up.Test plan
node ./dist/cli.js build . --engine native→ DB file count increases from 668 → 680, with all 4×3 HCL/Scala/Swift rows present.parseFilesAuto-using tests pass (tests/parsers/unified.test.ts,cfg-all-langs.test.ts,ast-all-langs.test.ts,ast-nodes.test.ts).files.totalmatches between native and WASM.