fix(extraction): drop duplicate export-var nodes and honour maxFileSize in bulk path#8
Open
mschreib28 wants to merge 1 commit into
Open
Conversation
…ze in bulk path
Two correctness bugs in the core extraction pipeline, surfaced by an
adversarial stress corpus (5k synthetic export-const declarations
plus a deliberate 8MB single-line file):
1) Every `export const X = ...` produced TWO nodes for the same
symbol — one kind:'variable' from extractExportedVariables, plus
one kind:'constant' from extractVariable (called when the walker
descended into the export_statement child). Stress test showed
100% duplication across 5,003 export-const declarations. The
dedicated extractVariable dispatch is the correct one — it picks
kind from isConst, captures the initializer signature, and walks
type annotations; the export-statement helper was redundant
because the language extractors' isExported predicate already
walks parent chains. Remove the export_statement branch from the
dispatch (children are descended into normally) and drop the
private helper.
2) The bulk indexAll path read each file's stats but never compared
stats.size against config.maxFileSize. Vendored generated files
(multi-MB headers, minified bundles, etc.) were indexed regardless
of the user's size cap. The single-file extractFile path enforced
it; only the bulk path was missing the check. Mirror the
single-file behaviour: emit a 'size_exceeded' warning, count the
file as skipped, advance progress, and continue.
On the stress workspace (5,005 synthetic files; 50,000 fns in one
3MB file; 8MB single-line file; 5,000 export-const declarations):
before: 65,014 nodes (100% var/const duplication, every >1MB file
indexed despite maxFileSize=1MB)
after: 10,008 nodes (0 duplicates, large files correctly skipped
with size_exceeded warnings)
Tests calibrated to the duplicate behavior were updated to look for
kind:'constant' on `export const`, which is the correct kind. Full
suite: 380 passed (was 374 passed, 6 failed before this fix).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary\n\nTwo correctness bugs in the core extraction pipeline, surfaced by stress-testing against an adversarial corpus (5k synthetic export-const declarations plus an 8MB single-line file).\n\n### 1. Every
export constproduces two duplicate nodes\n\nexport const X = ...was producing two nodes for the same symbol:\n- one withkind: 'variable'fromextractExportedVariables\n- one withkind: 'constant'fromextractVariable(invoked when the walker descended into theexport_statementchild)\n\nStress-testing showed 100% duplication across 5,003export constdeclarations.\n\nThe dedicatedextractVariabledispatch is the correct one — it pickskindfromisConst, captures the initializer signature, and walks type annotations. TheextractExportedVariableshelper was redundant because each language extractor'sisExportedpredicate already walks the parent chain to detect the export wrapper. Removed theexport_statementbranch from the walker dispatch (children are descended into normally) and dropped the private helper.\n\n### 2.maxFileSizesilently ignored on the bulk-index path\n\nextractFile()(single-file API) checkedstats.size > config.maxFileSize, but the bulkindexAll()path read each file's stats and never compared. Vendored generated files (multi-MB headers, minified bundles) were indexed regardless of the user's cap. Mirrored the single-file behaviour: emit asize_exceededwarning, count the file as skipped, advance progress, continue.\n\n## Test plan\n\nVerified live against a stress workspace (5,005 synthetic files; 50,000 fns in one 3MB file; 8MB single-line file; 5,000export constdeclarations):\n\n| Metric | Before | After |\n|--------|--------|-------|\n| Duplicate var/const node sets | 5,003 (100%) | 0 |\n| Files >1MB indexed despitemaxFileSize: 1MB| 2 | 0 |\n| Total nodes | 65,014 | 10,008 |\n\n- [x]npx vitest run— 380 passed (was 374 passed, 6 failed before this fix; the 6 failing tests asserted the duplicate behavior, now updated to match the correctkind: 'constant')\n- [x]npx tsc --noEmitpasses\n- [x]npm run buildsucceeds\n- [x] Livecodegraph indexagainst stress corpus produces the expected node counts andsize_exceededwarnings\n\n🤖 Generated with Claude Code\nCopied from colbymchenry/codegraph#129