Skip to content

fix(perf): scope WASM grammar load in engine-parity backfill (#1054)#1058

Merged
carlos-alm merged 4 commits into
mainfrom
fix/issue-1054-scoped-wasm-backfill
May 4, 2026
Merged

fix(perf): scope WASM grammar load in engine-parity backfill (#1054)#1058
carlos-alm merged 4 commits into
mainfrom
fix/issue-1054-scoped-wasm-backfill

Conversation

@carlos-alm
Copy link
Copy Markdown
Contributor

Summary

  • Avoid the worker-pool first-call overhead (~1.7s on slow CI runners) when only a handful of files need WASM backfill.
  • Adds parseFilesWasmInline — a main-thread parse path that loads only the grammars for the input extensions and returns symbols with _tree set, so the unified walker in runAnalyses populates AST/CFG/dataflow downstream.
  • Routes both backfill call sites (parseFilesAuto per-call backfill, backfillNativeDroppedFiles post-orchestrator backfill) through parseFilesWasmForBackfill, which picks inline for batches ≤ 16 files and the worker pool for larger batches (preserving tree-sitter WASM crash isolation bug: WASM engine crashes V8 reproducibly on Windows + Node 22 when building codegraph source #965 where it matters).

Context

#1054 — every native buildGraph call in the pre-publish gate logs Native orchestrator dropped 4 file(s) in natively-supported languages — likely a Rust extractor bug. Backfilling via WASM: .tf and falls back via the worker pool. This PR makes the fallback cheap for that case. The underlying Rust-side drop (why the freshly-built binary drops these 4 fixture files when the v3.9.6 published binary doesn't) is tracked separately and addressed in a follow-up PR.

Test plan

  • npx tsc --noEmit clean.
  • npx vitest run tests/integration/build-parity.test.ts — 4/4 pass.
  • npx vitest run tests/benchmarks/resolution/resolution-benchmark.test.ts -t "hcl" — 5/5 pass (HCL backfill is the prime case the new path serves).
  • Local incremental-benchmark.ts reports identical numbers (full=2773ms, noop=17ms, 1-file=266ms) to before — Windows didn't pay the worker overhead anyway, so unchanged is expected. Effect should land on CI Linux.
  • Pre-publish benchmark gate runs cleanly on dispatch.

Refs #1054

The native engine drops files in some build environments (#1054), triggering
a WASM backfill via the worker pool. The pool's first-call overhead is fine
for full builds (amortized over hundreds of files) but dwarfs the actual
parse work for small backfill batches — on slow CI runners, ~1.7s for 4
fixture files in one language.

Add `parseFilesWasmInline`: a main-thread, no-worker parse path that loads
only the grammars matching the input extensions and returns symbols with
`_tree` set so the unified walker in `runAnalyses` populates AST/CFG/dataflow
data downstream. New `parseFilesWasmForBackfill` chooses inline for batches
≤ 16 files, keeping worker isolation for larger batches where tree-sitter
WASM crash protection matters more (#965).

Routes both backfill sites through the new helper:
- `parseFilesAuto`'s per-call inline backfill in `domain/parser.ts`
- `backfillNativeDroppedFiles` in `domain/graph/builder/pipeline.ts`

Refs #1054

Impact: 4 functions changed, 13 affected
@claude
Copy link
Copy Markdown

claude Bot commented May 4, 2026

Claude finished @carlos-alm's task —— View job


Root Cause Analysis

Problem Being Solved: The worker pool's cold-start overhead (~1.7s on slow CI runners) was dominating the actual parse work (~10ms) when only a few files needed WASM backfill during engine-parity operations. This primarily affected the pre-publish benchmark gate which logs native orchestrator drops like "Native orchestrator dropped 4 file(s)... Backfilling via WASM".

Why This Approach: The solution maintains the existing worker pool isolation for large batches (preserving crash safety from tree-sitter WASM issues #965) while routing small batches (≤16 files) through a new inline parsing path that avoids worker spawn overhead. The threshold (16 files) aligns with typical engine-parity drop sizes.

Risk Assessment: Low risk - this is a pure performance optimization that preserves existing behavior. The inline path reuses established grammar loading patterns and still sets _tree for downstream analysis compatibility.

Backlog Compliance

  • Zero-dep: ✓ — No new runtime dependencies added. Builds entirely on existing tree-sitter WASM infrastructure
  • Foundation-aligned: ✓ — Directly supports P1 (always-current graph) by reducing rebuild latency on CI, especially for the pre-publish gate that validates every release
  • Problem-fit: 3 — Performance optimization that reduces CI friction but doesn't directly address the core "lost AI" problem
  • Breaking: No — Purely internal optimization with identical external behavior

Tier: 1

Configuration Changes Assessment

No problematic configuration changes detected. The PR introduces a single hardcoded threshold (INLINE_BACKFILL_THRESHOLD = 16) but this is appropriate as a performance tuning constant. Similar patterns exist in the config system.

Code Quality Assessment

Strengths:

  • Clean separation with parseFilesWasmForBackfill routing function
  • Preserves crash isolation for large batches where it matters most
  • Maintains backward compatibility by setting _tree in inline path
  • Good documentation explaining performance trade-offs

Minor Issues:

  • The threshold constant could benefit from a comment explaining the 16-file rationale
  • No security or bug concerns detected

Critical Concerns

None identified. This is a well-scoped performance optimization.

Final Recommendation

  • Rating: ⭐⭐⭐⭐☆ (4/5)
  • Action: APPROVE
  • Reasoning: Clean performance optimization that directly supports the foundation principle of keeping the graph always-current by reducing CI rebuild times. Well-implemented with appropriate safety preservation for larger batches. The only minor improvement would be adding a comment explaining the 16-file threshold choice.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 4, 2026

Greptile Summary

This PR scopes WASM grammar loading for small engine-parity backfills by introducing parseFilesWasmInline (main-thread parse, no worker spawn) and routing both backfill call sites through parseFilesWasmForBackfill, which picks inline for ≤16 files and the worker pool above that. The previously-reported P1 tree leak in backfillNativeDroppedFiles is correctly fixed by the new cleanup loop at the end of that function. One residual concern: the same inline path now also runs inside parseFilesAuto's drop-backfill, and callers like detect-changes.ts discard those results without freeing _tree — a bounded but cumulative leak in in-process test runs.

Confidence Score: 4/5

Safe to merge — the primary leak fix is correct and the residual leak is small, bounded, and confined to edge-case callers.

The previously-reported P1 is fixed. The remaining finding is P2: a bounded WASM tree leak (≤16 trees) in the secondary parseFilesAuto callers that don't route through ctx.allSymbols. It only triggers when the native engine drops files in those paths, which is already an error condition, so real-world impact is minimal. No correctness or data-loss risk.

src/domain/parser.ts — the inline-backfill symbols returned from parseFilesAuto to detect-changes.ts and resolve-imports.ts callers carry unreleased _tree references.

Important Files Changed

Filename Overview
src/domain/parser.ts Adds parseFilesWasmInline (main-thread, sets _tree) and parseFilesWasmForBackfill (threshold router); replaces parseFilesWasm with parseFilesWasmForBackfill in the parseFilesAuto drop-backfill path — symbols with _tree returned to callers that don't route into ctx.allSymbols (e.g., detect-changes) are never freed.
src/domain/graph/builder/pipeline.ts Switches backfillNativeDroppedFiles to parseFilesWasmForBackfill and adds an explicit cleanup loop for the inline-path _tree references — correctly addresses the previously-reported WASM memory leak.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[parseFilesWasmForBackfill] -->|"filePaths.length <= 16"| B[parseFilesWasmInline\nmain-thread, sets _tree]
    A -->|"filePaths.length > 16"| C[parseFilesWasm\nworker pool, no _tree]

    D[backfillNativeDroppedFiles] --> A
    D -->|"after DB insert"| E["cleanup loop\ntree.delete() + _tree = undefined ✅"]

    F[parseFilesAuto] --> A
    F -->|caller: parse-files.ts| G["ctx.allSymbols\n→ releaseWasmTrees ✅"]
    F -->|caller: detect-changes.ts| H["analysisSymbols\n→ runAnalyses\n→ discarded, no cleanup ⚠️"]
    F -->|caller: resolve-imports.ts| I["ctx.fileSymbols\n→ may not reach releaseWasmTrees ⚠️"]
Loading

Fix All in Claude Code

Reviews (2): Last reviewed commit: "docs(parser): explain INLINE_BACKFILL_TH..." | Re-trigger Greptile

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Codegraph Impact Analysis

4 functions changed13 callers affected across 7 files

  • backfillNativeDroppedFiles in src/domain/graph/builder/pipeline.ts:747 (4 transitive callers)
  • parseFilesWasmInline in src/domain/parser.ts:1092 (7 transitive callers)
  • parseFilesWasmForBackfill in src/domain/parser.ts:1125 (10 transitive callers)
  • parseFilesAuto in src/domain/parser.ts:1138 (7 transitive callers)

carlos-alm added 3 commits May 4, 2026 04:02
The inline backfill path sets symbols._tree (live web-tree-sitter Tree
backed by WASM linear memory) on every result, but those symbols are
consumed locally for DB row construction in backfillNativeDroppedFiles
and never added to ctx.allSymbols, so the finalize-stage releaseWasmTrees
sweep never frees them. Without explicit cleanup, trees leak WASM memory
until process exit — bounded per run but cumulative across in-process
integration tests.

Adds a cleanup loop after batchInsertNodes that mirrors releaseWasmTrees,
and drops the now-unused parseFilesAuto import.
Adds context for the 16-file threshold per Claude review feedback:
sized for typical engine-parity drops (recurring HCL case is 4 files);
above it, the worker-pool's IPC + crash-isolation cost is amortized over
enough parse work to be worth paying; below it, the cold-start dominates.
@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Greptile's P1 — WASM tree leak in backfillNativeDroppedFiles.

Added a cleanup pass after batchInsertNodes that mirrors releaseWasmTrees: iterates the wasmResults map, calls tree.delete() on every symbols._tree, and clears _tree / _langId. Now the inline-backfill path frees its WASM linear memory on the same path the finalize sweep would have if these symbols had been routed through ctx.allSymbols.

Commit: 49c9461

@carlos-alm
Copy link
Copy Markdown
Contributor Author

Addressed Claude's review feedback:

  • Added a doc comment on INLINE_BACKFILL_THRESHOLD explaining the 16-file rationale (sized for typical engine-parity drops; recurring HCL case is 4 files; threshold is where worker-pool IPC + crash-isolation cost amortizes).

Commit: ffd3431

@carlos-alm
Copy link
Copy Markdown
Contributor Author

@greptileai

@carlos-alm carlos-alm merged commit 251b2c4 into main May 4, 2026
21 checks passed
@carlos-alm carlos-alm deleted the fix/issue-1054-scoped-wasm-backfill branch May 4, 2026 10:39
@github-actions github-actions Bot locked and limited conversation to collaborators May 4, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant