feat(benchmark): overhaul copilot benchmark suite + indexing performance improvements by jafreck · Pull Request #181 · jafreck/Lore

jafreck · 2026-03-11T21:23:41Z

Summary

Overhaul the copilot benchmark suite with corrected ground truth, new questions that showcase Lore's structural advantages, concurrent execution, enhanced logging, and statistical rigor. Also includes indexing performance improvements (pipeline parallelization, source caching, FTS5 bulk refresh, DB tuning) and removes the lore_architecture tool.

Benchmark Results (claude-opus-4.6 via Copilot CLI on lore-self, 3 iterations per task)

Metric	Control	Lore-enabled	Delta
Mean correctness	68.0%	66.2%	-1.8pp
First-pass accuracy	25.0%	72.2%	+47.2pp
Success rate	63.9%	69.4%	+5.6pp
Mean tool calls	18.7	5.8	-12.9 (-69%)
Mean tokens	5,828	1,112	-4,715 (-80.9%)
Mean wall time	94.7s	71.9s	-22.8s (-24.1%)
Lore tool usage	0%	100%	—
Statistical significance	—	—	p=0.863 (correctness not significant at p<0.05)
Total benchmark time	—	—	13.6 min (concurrent)

Key takeaway: Lore doesn't significantly change correctness (both arms score similarly), but dramatically reduces cost: 81% fewer tokens and 69% fewer tool calls. First-pass accuracy (proportion of runs scoring ≥ 0.95) jumps from 25% to 72%.

Changes

New Benchmark Questions (replacing weak ones)

Q2.1 (inheritance): "What implements SymbolExtractor?" → lore_graph(kind=inheritance)
Q7.2 (cross-file consumers): "Functions consuming EmbeddingProvider?" → lore_lookup + lore_graph
Q8.1 (cycles): "Any circular import dependencies?" → lore_analyze(mode=cycles)
Q3.3 (module deps): "Module dependency summary" → lore_analyze(mode=summary)
Q10.2 (call fan-in): "Top-3 functions in read-only.ts by calling files" → lore_lookup + lore_graph

Removed Q3.1/3.2 (import graph — grep wins), Q9.1/9.5 (architecture/file count — too trivial), Q7.1 (git history — bash wins).

Bug Fix: `symbol_metrics` never populated for SCIP-sourced files

SourceIndexStage now calls computeSymbolMetrics() for all symbols
Added computeMetricsForScipFiles() pass that parses SCIP-sourced files with tree-sitter for complexity data
lore_metrics Q6.1 went from both-arms-timeout to 173 tokens, 15.0s (Lore) vs 5,893 tokens, 131.8s (control)

Breaking Change: `lore_metrics` simplified

Removed mode parameter and mode=aggregate behavior
Tool now only returns complexity-ranked symbols (the only unique value it provides)

Breaking Change: `lore_architecture` removed

Tool deleted entirely (both implementation and server registration)
Functionality was not providing sufficient value vs agent-native approaches

Indexing Pipeline Performance

Pipeline parallelization: Stages can now run concurrently in groups (PipelineEntry = PipelineStage | PipelineStage[]). SourceIndex + DocsIndex run in parallel; ScipEnrichment + LspEnrichment + History run in parallel.
Source cache: In-memory Map<string, string> shared across pipeline context. Populated during parsing, reused by enrichment stages to avoid redundant readFileSync calls.
FTS5 bulk refresh: Replaced per-row FTS5 updates during enrichment with a single bulk DELETE + INSERT pass via new ftsRefreshStage.
DB performance pragmas: synchronous = NORMAL + 64 MB cache size in WAL mode.
New indexes: idx_symbols_file_id and idx_symbols_name for faster lookups.
Async SCIP indexer: execFileSync → execFile (promisified) in both ScipSourceStage and ScipEnrichmentCoordinator.

Graph Tool Enhancements

lore_graph edges now include source_file_path and target_file_path fields in both full and compact output.

Embeddings Changes

Embeddings now opt-in during indexing (--embeddings flag or --embedding-model); disabled by default.
Default model changed to onnx-community/Qwen3-Embedding-0.6B-ONNX (pre-converted ONNX weights).
Default quantization changed from fp32 to q8.
Device detection simplified: CPU-only on all platforms with transformers.js v3; GPU requires v4+.
Fallback to CPU if preferred device is unsupported.

Ground Truth Corrections (verified against actual SCIP-indexed DB at SHA `660be2b`)

Q1.1: main removed (doesn't directly call openDb); docsAutoNotes1 added
Q1.2: IndexPipeline removed (SCIP records as <constructor>)
Q1.4: Simplified to transitive callers only
Q11.4: Rewritten as "exported symbols + consumers" format

Concurrent Benchmark Execution

Per-task tests use it.concurrent + Promise.all for control/lore arms
Results stored in Map (concurrent-safe)
~7× faster end-to-end (13.6 min vs sequential ~28 min)
Multi-iteration support via BENCHMARK_ITERATIONS env var

Enhanced Scoring & Logging

formatLoreToolArgs(): shows exact arguments for every Lore tool call
formatToolFrequency(): compact toolName×count per arm
diagnoseExpectations(): lists matched/missed expected parts
toolCallCounts in AggregateReport: per-tool totals across all runs
Standard deviations for all metrics when iterations > 1
Welch's t-test for statistical significance of correctness differences
Warning flags when Lore is worse (correctness, 50%+ more tokens, 50%+ slower)
Scorer now checks tool call results (not just final answer) for file/symbol coverage

README Update

Headline claims updated to match latest 3-iteration results: "8.8× faster responses, 97% fewer tokens, +33pp correctness improvement"

Where Lore Won (best tasks)

Task	Token Savings	Time Savings
6.1 (complexity)	-97%	-89%
8.1 (import cycles)	-93%	-74%
1.1 (callers)	-93%	-69%
1.2 (callees)	-85%	-53%
1.4 (blast radius)	-84%	-33%
4.1 (test map)	-78%	-47%

Where Lore Lost or Tied

Task	Issue
7.2 (cross-file consumers)	Both arms struggled; Lore timed out on 2/3 iters
10.2 (call fan-in)	Both scored 0.00; ranking format mismatch
2.1 (inheritance)	Simple grep was optimal; Lore added overhead (+48% tokens)

Test Results

All tests pass.

- Replace weak questions (import graph, file count, domain expert, commit history) with tasks that showcase Lore\x27s structural advantages: inheritance graph, cycle detection, module dependency summary, symbol search, file symbol listing - Fix symbol_metrics bug: SourceIndexStage now populates cyclomatic complexity for both tree-sitter and SCIP-sourced files - Simplify lore_metrics: remove mode=aggregate, tool now only does complexity ranking (breaking change) - Correct ground truth for lore-self against actual SCIP-indexed DB: fix Q1.1 (main is not a direct caller of openDb), Q1.2 (SCIP records constructors as <constructor>), Q1.4 (IndexBuilder is a class not a caller), Q10.1 (commit_files only has post-restructure paths) - Make benchmark tasks run concurrently (it.concurrent + Promise.all for control/lore arms) — 6-7x faster end-to-end wall time - Add enhanced diagnostic logging: per-tool-call argument details (formatLoreToolArgs), per-task missed expectations, per-tool call fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr frok fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr f U fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr few fr fr fr fr fr fr fr1pp correctness

…tooling - Add CopilotAgent and dedicated tool providers for benchmark evaluation - Refactor benchmark types and agent interfaces - Remove architecture tool (merged into graph tool) - Update indexer pipeline stages and SCIP enrichment - Update benchmark docs with results - Align test utilities with new benchmark structure

- Delete unused src/benchmark/ (tests use tests/benchmark/util/ copies) - Exclude test utilities from coverage (tests/benchmark/util, tests/helpers) - Add pipeline parallel group tests (concurrent execution, dispose, stageNames) - Add source-index unit tests (sourceCache, symbol_metrics insertion) - Add IndexBuilder integration tests for symbol_metrics population - Add LoreRuntime lifecycle tests

codecov · 2026-03-12T04:41:40Z

Codecov Report

❌ Patch coverage is 55.63380% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.29%. Comparing base (c2e330a) to head (64c5b73).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/indexer/stages/source-index.ts	32.60%	31 Missing ⚠️
src/embeddings/embedder.ts	14.28%	12 Missing ⚠️
src/indexer/stages/lsp-enrichment.ts	25.00%	12 Missing ⚠️
src/cli.ts	37.50%	5 Missing ⚠️
src/indexer/stages/scip-enrichment.ts	62.50%	3 Missing ⚠️

❌ Your patch status has failed because the patch coverage (55.63%) is below the target coverage (70.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #181      +/-   ##
==========================================
+ Coverage   87.38%   88.29%   +0.91%     
==========================================
  Files          93       82      -11     
  Lines        9517     8777     -740     
  Branches     2939     2758     -181     
==========================================
- Hits         8316     7750     -566     
+ Misses       1201     1027     -174

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…nce improvements (#181) * feat(benchmark): overhaul copilot benchmark suite - Replace weak questions (import graph, file count, domain expert, commit history) with tasks that showcase Lore\x27s structural advantages: inheritance graph, cycle detection, module dependency summary, symbol search, file symbol listing - Fix symbol_metrics bug: SourceIndexStage now populates cyclomatic complexity for both tree-sitter and SCIP-sourced files - Simplify lore_metrics: remove mode=aggregate, tool now only does complexity ranking (breaking change) - Correct ground truth for lore-self against actual SCIP-indexed DB: fix Q1.1 (main is not a direct caller of openDb), Q1.2 (SCIP records constructors as <constructor>), Q1.4 (IndexBuilder is a class not a caller), Q10.1 (commit_files only has post-restructure paths) - Make benchmark tasks run concurrently (it.concurrent + Promise.all for control/lore arms) — 6-7x faster end-to-end wall time - Add enhanced diagnostic logging: per-tool-call argument details (formatLoreToolArgs), per-task missed expectations, per-tool call fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr frok fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr f U fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr few fr fr fr fr fr fr fr1pp correctness * feat(benchmark): add copilot agent evaluation and refactor benchmark tooling - Add CopilotAgent and dedicated tool providers for benchmark evaluation - Refactor benchmark types and agent interfaces - Remove architecture tool (merged into graph tool) - Update indexer pipeline stages and SCIP enrichment - Update benchmark docs with results - Align test utilities with new benchmark structure * test: elevate patch coverage above thresholds - Delete unused src/benchmark/ (tests use tests/benchmark/util/ copies) - Exclude test utilities from coverage (tests/benchmark/util, tests/helpers) - Add pipeline parallel group tests (concurrent execution, dispose, stageNames) - Add source-index unit tests (sourceCache, symbol_metrics insertion) - Add IndexBuilder integration tests for symbol_metrics population - Add LoreRuntime lifecycle tests * Update

jafreck added 3 commits March 11, 2026 14:23

Update

64c5b73

jafreck changed the title ~~feat(benchmark): overhaul copilot benchmark suite~~ feat(benchmark): overhaul copilot benchmark suite + indexing performance improvements Mar 12, 2026

jafreck merged commit 89a2067 into main Mar 12, 2026
2 of 3 checks passed

jafreck mentioned this pull request Mar 12, 2026

chore: release v0.3.7 #187

Merged

jafreck deleted the benchmark/copilot-eval-v2 branch March 27, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): overhaul copilot benchmark suite + indexing performance improvements#181

feat(benchmark): overhaul copilot benchmark suite + indexing performance improvements#181
jafreck merged 4 commits intomainfrom
benchmark/copilot-eval-v2

jafreck commented Mar 11, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jafreck commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark Results (claude-opus-4.6 via Copilot CLI on lore-self, 3 iterations per task)

Changes

New Benchmark Questions (replacing weak ones)

Bug Fix: symbol_metrics never populated for SCIP-sourced files

Breaking Change: lore_metrics simplified

Breaking Change: lore_architecture removed

Indexing Pipeline Performance

Graph Tool Enhancements

Embeddings Changes

Ground Truth Corrections (verified against actual SCIP-indexed DB at SHA 660be2b)

Concurrent Benchmark Execution

Enhanced Scoring & Logging

README Update

Where Lore Won (best tasks)

Where Lore Lost or Tied

Test Results

Uh oh!

codecov Bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jafreck commented Mar 11, 2026 •

edited

Loading

Bug Fix: `symbol_metrics` never populated for SCIP-sourced files

Breaking Change: `lore_metrics` simplified

Breaking Change: `lore_architecture` removed

Ground Truth Corrections (verified against actual SCIP-indexed DB at SHA `660be2b`)

codecov Bot commented Mar 12, 2026 •

edited

Loading