fix(parity): log per-file reasons for native orchestrator drops (#1011) by carlos-alm · Pull Request #1024 · optave/ops-codegraph-tool

carlos-alm · 2026-04-29T09:11:27Z

Summary

Fixes #1011. Replaces the coarse Native orchestrator dropped N file(s); backfilling via WASM for engine parity warning with per-extension counts and sample paths, so users can tell legitimate parser limits from real native bugs.

What changed

src/domain/parser.ts: Adds NATIVE_SUPPORTED_EXTENSIONS (mirrors LanguageKind::from_extension in crates/codegraph-core/src/parser_registry.rs) and a pure classifyNativeDrops(relPaths) helper that buckets dropped files by reason × extension.
src/domain/graph/builder/pipeline.ts: backfillNativeDroppedFiles now classifies drops:
- unsupported-by-native (no Rust extractor — legitimate parser limit) → logged at info level.
- native-extractor-failure (extension IS in the addon yet the file was still dropped) → logged at warn level with explicit "likely a Rust extractor bug" framing so it stays loud.
- Output caps at 3 sample paths per extension and 6 extensions per line for readability.
tests/parsers/native-drop-classification.test.ts: Unit tests covering classification, mixed inputs, case-insensitive extensions, and empty input.

Why this approach

The issue offers two acceptance paths: fix the gap (add native extractors for 11 WASM-only languages) or log per-file drop reasons that are legitimate parser limits. The WASM-only languages (.fs, .gleam, .clj, .jl, .R, .erl, .sol, .m, .cu, .groovy, .v — ~48 of the 49 dropped files on a 3.9.5 self-build) are precisely the legitimate limits the issue describes; the second criterion fits the scope. Adding 11 Rust extractors is a much larger effort and a separate concern.

The categorization makes any future regression in a natively-supported language stand out as a warn rather than blending into the existing baseline noise — the CI parity gate from #1014 still catches the count-level threshold.

Sample output

Before:

[codegraph WARN] Native orchestrator dropped 49 file(s); backfilling via WASM for engine parity

After (info — legitimate gap):

[codegraph INFO] Native orchestrator skipped 48 file(s) in languages without a Rust extractor; backfilling via WASM: .clj (5: a.clj, b.clj, c.clj, +2 more); .jl (5: a.jl, b.jl, c.jl, +2 more); .r (5: a.R, b.R, c.R, +2 more); .fs (4: a.fs, b.fs, c.fs, +1 more); .gleam (4: a.gleam, b.gleam, c.gleam, +1 more); .erl (4: a.erl, b.erl, c.erl, +1 more); +5 more extension(s)

After (warn — real bug):

[codegraph WARN] Native orchestrator dropped 1 file(s) in natively-supported languages — likely a Rust extractor bug. Backfilling via WASM: .ts (1: src/foo.ts)

Test plan

Unit tests added in tests/parsers/native-drop-classification.test.ts (logic verified locally via standalone Node script — pre-existing Node 24 vitest config bug --strip-types is not allowed in NODE_OPTIONS blocks running tests on this machine; CI on Node 22 will run them)
npx tsc --noEmit — clean
npx biome check on touched files — only pre-existing unused-function warnings in unrelated parser internals
CI: vitest, build, parity gate (ci(bench): gate release benchmark on engine parity thresholds #1014 threshold should pass — file-set gap is unaffected since backfill still runs)

Out of scope

Adding native Rust extractors for .fs/.gleam/.clj/.jl/.R/.erl/.sol/.m/.cu/.groovy/.v/.sv — separate, larger effort. Once any of those are added on the Rust side, removing the corresponding entries from NATIVE_SUPPORTED_EXTENSIONS will automatically reclassify those drops as failures (good).

The native orchestrator's WASM backfill warning was a single line — "dropped N file(s); backfilling via WASM" — which made it impossible to tell which files were dropped or why. With 49 drops on the 3.9.5 self- build, users could not distinguish legitimate parser limits (languages with no Rust extractor, e.g. .fs/.gleam/.clj/.jl/.R/.sol/.cu) from real extractor regressions in natively-supported languages. Add a `classifyNativeDrops` helper in `src/domain/parser.ts` that buckets each missing file by extension and reason: - `unsupported-by-native`: extension has no Rust extractor — legitimate parser limit per the issue's acceptance criteria. Logged at info level. - `native-extractor-failure`: extension IS in the Rust addon yet the file was still dropped — points at a real native bug. Logged at warn level with explicit "likely a Rust extractor bug" framing so it stays loud. `backfillNativeDroppedFiles` in `pipeline.ts` now uses the helper to emit per-extension counts and up to 3 sample paths per extension (capped at 6 extensions to keep the log line readable). The classifier set `NATIVE_SUPPORTED_EXTENSIONS` mirrors `LanguageKind::from_extension` in `crates/codegraph-core/src/parser_registry.rs` and carries a comment flagging the sync requirement — the native addon ships as a separate npm package, so JS has no runtime way to discover its language coverage. Adds unit tests covering the classification, sample paths, case-insensitive extensions, and empty input. Note: the Node 24 vitest config has a pre-existing incompatibility (`--strip-types is not allowed in NODE_OPTIONS`); test logic was verified locally via a standalone Node script that mirrors the unit tests, and the PR's CI on Node 22 will run the vitest suite normally. docs check acknowledged: internal logging refactor; no user-visible language support, command, or roadmap deliverable changes.

claude · 2026-04-29T09:11:43Z

Claude finished @carlos-alm's task —— View job

PR Review: Fix Native Orchestrator Drop Logging

Reviewing PR #1024 that improves logging for native orchestrator drops.

Todo List

Read and analyze changed files
Perform root cause analysis
Check for configuration changes (high scrutiny)
Verify test integrity
Review code quality
Assess backlog compliance
Provide final rating and recommendation

Root Cause Analysis

Problem Being Solved: Issue #1011 - Replace the vague Native orchestrator dropped N file(s) warning with detailed per-extension classification so users can distinguish legitimate parser limits (WASM-only languages) from real native bugs (supported extensions that still get dropped).

Why This Approach: The PR chooses to improve logging rather than add 11 new Rust extractors. This is justified because:

The 48 dropped files on a self-build are primarily WASM-only languages (.fs, .gleam, .clj, .jl, .R, .erl, .sol, .m, .cu, .groovy, .v)
Adding 11 new Rust extractors is a much larger effort and separate concern
The categorization makes future regressions in natively-supported languages stand out as warn rather than blending into baseline noise
CI parity gate from ci(bench): gate release benchmark on engine parity thresholds #1014 still catches count-level threshold issues

Risk Assessment: Low risk. This is purely additive logging with no functional changes to parsing logic.

Critical Concerns

None identified. This is a well-implemented logging improvement that:

✅ No Configuration Relaxations

No changes to linting rules, test timeouts, coverage thresholds, or build configs.

✅ Test Integrity Maintained

New test file native-drop-classification.test.ts has proper unit tests covering:
- WASM-only languages categorized as unsupported-by-native
- Natively-supported extensions flagged as native-extractor-failure
- Mixed scenarios and edge cases (empty input, case-insensitive extensions)
No weakened assertions or conditional test logic
Test is focused and doesn't test the code it's testing alongside

✅ Clean Code Quality

parser.ts:427-467 - NATIVE_SUPPORTED_EXTENSIONS constant mirrors Rust LanguageKind::from_extension accurately. Good documentation explains the sync requirement.

parser.ts:488-513 - classifyNativeDrops() is a pure function with clear separation of concerns. Clean implementation using Map buckets and proper type safety.

pipeline.ts:747-760 - formatDropExtensionSummary() has reasonable output limits (3 samples per extension, 6 extensions total) to prevent log spam.

pipeline.ts:800-814 - Integration point provides clear differentiation:

info level for legitimate gaps with "languages without a Rust extractor"
warn level for real bugs with "likely a Rust extractor bug" framing

The logging approach ensures real regressions stay loud while legitimate limitations become quieter info messages.

Backlog Compliance

Zero-dep: ✓ - No new runtime dependencies. Builds entirely on existing tree-sitter/parser infrastructure.
Foundation-aligned: ✓ - Fully aligned with core principles. Improves debugging without changing core pipeline performance or adding external dependencies. Supports P1 (always-current) by making native engine issues more diagnosable.
Problem-fit: 3 - Useful improvement but doesn't directly address the core "lost AI" problem. This is infrastructure debugging rather than agent guidance.
Breaking: No - Purely additive logging changes, no API/CLI/schema changes.
Tier: 1

Final Recommendation

Rating: ⭐⭐⭐⭐☆ (4/5)
Action: APPROVE
Reasoning: This is a solid, well-tested infrastructure improvement. The implementation is clean, the approach is justified, and it addresses a real usability issue without introducing risk. The categorization will help users distinguish legitimate parser limits from actual bugs, making the native engine more debuggable. The only reason this isn't 5 stars is that it's infrastructure work rather than a feature that directly improves the user experience.

greptile-apps · 2026-04-29T09:14:16Z

Greptile Summary

Replaces the coarse native-orchestrator drop warning with per-extension, per-reason log messages: unsupported-by-native drops (no Rust extractor) are demoted to info, while native-extractor-failure drops (supported extension that still failed) stay at warn to keep regressions loud. New classifyNativeDrops and formatDropExtensionSummary helpers are pure functions with solid unit coverage, and the drift guard cross-checks NATIVE_SUPPORTED_EXTENSIONS against the Rust source at CI time.

Confidence Score: 5/5

Safe to merge — logic is correct, well-tested, and the only finding is a robustness nit in the drift guard test.

No P0 or P1 issues found. The single P2 comment concerns a fragile boundary condition in the drift guard regex that could produce spurious test failures but cannot cause a mis-classification in production.

tests/parsers/native-drop-classification.test.ts — drift guard body-slice fallback could be unbounded if from_extension is the last pub fn in the file.

Important Files Changed

Filename	Overview
src/domain/parser.ts	Adds NATIVE_SUPPORTED_EXTENSIONS constant, classifyNativeDrops, and formatDropExtensionSummary — all pure, well-typed, and accompanied by matching tests; logic is correct.
src/domain/graph/builder/pipeline.ts	backfillNativeDroppedFiles now classifies drops into info/warn tiers; missingRel is correctly gathered alongside missingAbs and passed to classifyNativeDrops; info import was already present.
tests/parsers/native-drop-classification.test.ts	Thorough unit tests for classifyNativeDrops and formatDropExtensionSummary; drift guard regex could produce spurious failures if from_extension is the last pub fn in the file or if non-pub functions follow it.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[backfillNativeDroppedFiles] --> B{file in existing DB?}
    B -- yes --> C[skip]
    B -- no --> D{WASM grammar installed?}
    D -- no --> C
    D -- yes --> E[add to missingRel / missingAbs]
    E --> F[classifyNativeDrops]
    F --> G{extension in NATIVE_SUPPORTED_EXTENSIONS?}
    G -- yes --> H[native-extractor-failure bucket]
    G -- no --> I[unsupported-by-native bucket]
    H --> J[warn log + formatDropExtensionSummary]
    I --> K[info log + formatDropExtensionSummary]
    J --> L[parseFilesAuto via WASM backfill]
    K --> L

_{Reviews (2): Last reviewed commit: "test(parity): cover formatDropExtensionS..." | Re-trigger Greptile}

greptile-apps · 2026-04-29T09:14:21Z

+/**
+ * Lowercase file extensions covered by the native Rust addon.
+ *
+ * Mirrors `LanguageKind::from_extension` in
+ * `crates/codegraph-core/src/parser_registry.rs`. Used to classify why the
+ * native orchestrator dropped a file: extensions outside this set are a
+ * legitimate parser limit (no Rust extractor exists), while extensions inside
+ * it indicate a real native bug (parse/read/extract failure).
+ *
+ * Keep this list in sync with the Rust enum — the native addon is a separate
+ * npm package, so JS has no runtime way to discover its language coverage.
+ */
+export const NATIVE_SUPPORTED_EXTENSIONS: ReadonlySet<string> = new Set([
+  '.js',
+  '.jsx',
+  '.mjs',
+  '.cjs',
+  '.ts',
+  '.tsx',
+  '.py',
+  '.pyi',
+  '.tf',
+  '.hcl',
+  '.go',
+  '.rs',
+  '.java',
+  '.cs',
+  '.rb',
+  '.rake',
+  '.gemspec',
+  '.php',
+  '.phtml',
+  '.c',
+  '.h',
+  '.cpp',
+  '.cc',
+  '.cxx',
+  '.hpp',
+  '.kt',
+  '.kts',
+  '.swift',
+  '.scala',
+  '.sh',
+  '.bash',
+  '.ex',
+  '.exs',
+  '.lua',
+  '.dart',
+  '.zig',
+  '.hs',
+  '.ml',
+  '.mli',
+]);


Silent mis-classification risk on native addon version drift

NATIVE_SUPPORTED_EXTENSIONS is keyed to one specific snapshot of LanguageKind::from_extension. If the Rust addon gains a new language (or drops one) between addon releases without a matching JS update, drops will be silently mis-classified: a real native failure shows up as unsupported-by-native (info, quiet) instead of native-extractor-failure (warn, loud). The inverse case — removed support — would spam false native-extractor-failure warnings. There's no runtime assertion that the two lists agree, so the drift won't be caught until a user notices wrong log levels. Consider adding a CI step or a startup assertion that cross-checks the set against the native addon's own exported metadata if the addon exposes it; if it doesn't, at minimum add an integration test that verifies the current addon version is the one this set was generated from.

Fixed in cbbc9ae — added a drift guard test that parses crates/codegraph-core/src/parser_registry.rs and asserts NATIVE_SUPPORTED_EXTENSIONS agrees with the Rust LanguageKind::from_extension arms. The native addon doesn't expose its own metadata, so source-level cross-check at CI time is the cheapest way to catch drift before users see mis-classified log levels. If parser_registry.rs adds or removes an extension, the test fails loudly with a list of mismatches.

greptile-apps · 2026-04-29T09:14:22Z

+import { describe, expect, it } from 'vitest';
+import { classifyNativeDrops, NATIVE_SUPPORTED_EXTENSIONS } from '../../src/domain/parser.js';
+
+describe('classifyNativeDrops', () => {
+  it('groups WASM-only languages under unsupported-by-native', () => {
+    const { byReason, totals } = classifyNativeDrops([
+      'src/a.fs',
+      'src/b.gleam',
+      'src/c.clj',
+      'src/d.jl',
+      'src/e.R',
+      'src/f.erl',
+      'src/g.sol',
+      'src/h.cu',
+      'src/i.groovy',
+      'src/j.v',
+      'src/k.m',
+    ]);
+    expect(totals['unsupported-by-native']).toBe(11);
+    expect(totals['native-extractor-failure']).toBe(0);
+    expect(byReason['unsupported-by-native'].get('.fs')).toEqual(['src/a.fs']);
+    expect(byReason['unsupported-by-native'].get('.gleam')).toEqual(['src/b.gleam']);
+    expect(byReason['unsupported-by-native'].get('.r')).toEqual(['src/e.R']);
+  });
+
+  it('flags natively-supported extensions as native-extractor-failure', () => {
+    const { byReason, totals } = classifyNativeDrops([
+      'src/a.ts',
+      'src/b.py',
+      'src/c.go',
+      'src/d.rs',
+    ]);
+    expect(totals['native-extractor-failure']).toBe(4);
+    expect(totals['unsupported-by-native']).toBe(0);
+    expect(byReason['native-extractor-failure'].get('.ts')).toEqual(['src/a.ts']);
+    expect(byReason['native-extractor-failure'].get('.py')).toEqual(['src/b.py']);
+  });
+
+  it('handles a mix of supported and unsupported extensions', () => {
+    const { byReason, totals } = classifyNativeDrops([
+      'src/a.ts',
+      'src/b.fs',
+      'src/c.fs',
+      'src/d.gleam',
+    ]);
+    expect(totals['native-extractor-failure']).toBe(1);
+    expect(totals['unsupported-by-native']).toBe(3);
+    expect(byReason['unsupported-by-native'].get('.fs')).toEqual(['src/b.fs', 'src/c.fs']);
+    expect(byReason['unsupported-by-native'].get('.gleam')).toEqual(['src/d.gleam']);
+  });
+
+  it('lowercases extensions so .R and .r share a bucket', () => {
+    const { byReason, totals } = classifyNativeDrops(['scripts/a.R', 'scripts/b.r']);
+    expect(totals['unsupported-by-native']).toBe(2);
+    expect(byReason['unsupported-by-native'].get('.r')).toEqual(['scripts/a.R', 'scripts/b.r']);
+  });
+
+  it('returns empty buckets when no files are passed', () => {
+    const { byReason, totals } = classifyNativeDrops([]);
+    expect(totals['native-extractor-failure']).toBe(0);
+    expect(totals['unsupported-by-native']).toBe(0);
+    expect(byReason['native-extractor-failure'].size).toBe(0);
+    expect(byReason['unsupported-by-native'].size).toBe(0);
+  });
+
+  it('exposes the native-supported extension set for callers', () => {
+    expect(NATIVE_SUPPORTED_EXTENSIONS.has('.ts')).toBe(true);
+    expect(NATIVE_SUPPORTED_EXTENSIONS.has('.py')).toBe(true);
+    expect(NATIVE_SUPPORTED_EXTENSIONS.has('.fs')).toBe(false);
+    expect(NATIVE_SUPPORTED_EXTENSIONS.has('.gleam')).toBe(false);
+  });
+});


formatDropExtensionSummary cap logic is not unit-tested

The non-trivial formatting function (MAX_EXTS = 6, MAX_SAMPLES = 3, +N more extension(s) display) lives in pipeline.ts but has no dedicated unit tests. The current test suite only covers classifyNativeDrops. A regression in the cap logic (e.g., off-by-one in the +N more calculation, or the sort being reversed) would produce silently truncated log lines without any test failure. Adding a small test or extracting and testing formatDropExtensionSummary directly would cover that path.

Fixed in cbbc9ae — moved formatDropExtensionSummary from pipeline.ts to parser.ts (next to classifyNativeDrops) and exported it. Added 6 unit tests covering: empty buckets, under-cap rendering, the +N more per-extension sample cap, the exact-cap boundary, the +N more extension(s) suffix when the extension cap is hit, and descending-count ordering so a future regression in the sort or off-by-one in the +N math fails loudly.

github-actions · 2026-04-29T09:19:17Z

Codegraph Impact Analysis

5 functions changed → 5 callers affected across 3 files

backfillNativeDroppedFiles in src/domain/graph/builder/pipeline.ts:747 (4 transitive callers)
NativeDropClassification.byReason in src/domain/parser.ts:479 (0 transitive callers)
NativeDropClassification.totals in src/domain/parser.ts:481 (0 transitive callers)
classifyNativeDrops in src/domain/parser.ts:489 (3 transitive callers)
formatDropExtensionSummary in src/domain/parser.ts:522 (3 transitive callers)

…RTED_EXTENSIONS drift (#1024) Address Greptile P2 review feedback on #1024: - Move `formatDropExtensionSummary` from `pipeline.ts` to `parser.ts` next to `classifyNativeDrops` and export it so the cap logic can be unit-tested. - Add tests for the formatter covering: empty buckets, under-cap rendering, per-extension sample cap with `+N more`, exact-cap boundary, extension cap with `+N more extension(s)` suffix, and descending-count ordering. - Add a drift guard test that parses `crates/codegraph-core/src/parser_registry.rs` and asserts the JS `NATIVE_SUPPORTED_EXTENSIONS` set agrees with the Rust `LanguageKind::from_extension` arms. The native addon doesn't expose its own metadata, so source-level cross-check at CI time is the cheapest way to catch drift before users see mis-classified log levels. Impact: 1 functions changed, 3 affected

carlos-alm · 2026-04-29T19:12:56Z

@greptileai

greptile-apps Bot reviewed Apr 29, 2026

View reviewed changes

carlos-alm merged commit 57a6df4 into main Apr 29, 2026
29 checks passed

carlos-alm deleted the fix/native-drop-per-file-reasons-1011 branch April 29, 2026 19:35

github-actions Bot locked and limited conversation to collaborators Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(parity): log per-file reasons for native orchestrator drops (#1011)#1024

fix(parity): log per-file reasons for native orchestrator drops (#1011)#1024
carlos-alm merged 2 commits intomainfrom
fix/native-drop-per-file-reasons-1011

carlos-alm commented Apr 29, 2026

Uh oh!

claude Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

carlos-alm Apr 29, 2026

Uh oh!

greptile-apps Bot Apr 29, 2026

Uh oh!

carlos-alm Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026 •

edited

Loading

Uh oh!

carlos-alm commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

carlos-alm commented Apr 29, 2026

Summary

What changed

Why this approach

Sample output

Test plan

Out of scope

Uh oh!

claude Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Fix Native Orchestrator Drop Logging

Todo List

Root Cause Analysis

Critical Concerns

✅ No Configuration Relaxations

✅ Test Integrity Maintained

✅ Clean Code Quality

Backlog Compliance

Final Recommendation

Uh oh!

greptile-apps Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 29, 2026 •

edited

Loading

greptile-apps Bot commented Apr 29, 2026 •

edited

Loading

github-actions Bot commented Apr 29, 2026 •

edited

Loading