feat(benchmark): overhaul copilot benchmark suite + indexing performance improvements#181
Merged
feat(benchmark): overhaul copilot benchmark suite + indexing performance improvements#181
Conversation
- Replace weak questions (import graph, file count, domain expert, commit history) with tasks that showcase Lore\x27s structural advantages: inheritance graph, cycle detection, module dependency summary, symbol search, file symbol listing - Fix symbol_metrics bug: SourceIndexStage now populates cyclomatic complexity for both tree-sitter and SCIP-sourced files - Simplify lore_metrics: remove mode=aggregate, tool now only does complexity ranking (breaking change) - Correct ground truth for lore-self against actual SCIP-indexed DB: fix Q1.1 (main is not a direct caller of openDb), Q1.2 (SCIP records constructors as <constructor>), Q1.4 (IndexBuilder is a class not a caller), Q10.1 (commit_files only has post-restructure paths) - Make benchmark tasks run concurrently (it.concurrent + Promise.all for control/lore arms) — 6-7x faster end-to-end wall time - Add enhanced diagnostic logging: per-tool-call argument details (formatLoreToolArgs), per-task missed expectations, per-tool call fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr frok fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr f U fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr few fr fr fr fr fr fr fr1pp correctness
…tooling - Add CopilotAgent and dedicated tool providers for benchmark evaluation - Refactor benchmark types and agent interfaces - Remove architecture tool (merged into graph tool) - Update indexer pipeline stages and SCIP enrichment - Update benchmark docs with results - Align test utilities with new benchmark structure
- Delete unused src/benchmark/ (tests use tests/benchmark/util/ copies) - Exclude test utilities from coverage (tests/benchmark/util, tests/helpers) - Add pipeline parallel group tests (concurrent execution, dispose, stageNames) - Add source-index unit tests (sourceCache, symbol_metrics insertion) - Add IndexBuilder integration tests for symbol_metrics population - Add LoreRuntime lifecycle tests
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (55.63%) is below the target coverage (70.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #181 +/- ##
==========================================
+ Coverage 87.38% 88.29% +0.91%
==========================================
Files 93 82 -11
Lines 9517 8777 -740
Branches 2939 2758 -181
==========================================
- Hits 8316 7750 -566
+ Misses 1201 1027 -174 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Merged
jafreck
added a commit
that referenced
this pull request
Mar 27, 2026
…nce improvements (#181) * feat(benchmark): overhaul copilot benchmark suite - Replace weak questions (import graph, file count, domain expert, commit history) with tasks that showcase Lore\x27s structural advantages: inheritance graph, cycle detection, module dependency summary, symbol search, file symbol listing - Fix symbol_metrics bug: SourceIndexStage now populates cyclomatic complexity for both tree-sitter and SCIP-sourced files - Simplify lore_metrics: remove mode=aggregate, tool now only does complexity ranking (breaking change) - Correct ground truth for lore-self against actual SCIP-indexed DB: fix Q1.1 (main is not a direct caller of openDb), Q1.2 (SCIP records constructors as <constructor>), Q1.4 (IndexBuilder is a class not a caller), Q10.1 (commit_files only has post-restructure paths) - Make benchmark tasks run concurrently (it.concurrent + Promise.all for control/lore arms) — 6-7x faster end-to-end wall time - Add enhanced diagnostic logging: per-tool-call argument details (formatLoreToolArgs), per-task missed expectations, per-tool call fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr frok fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr f U fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr fr few fr fr fr fr fr fr fr1pp correctness * feat(benchmark): add copilot agent evaluation and refactor benchmark tooling - Add CopilotAgent and dedicated tool providers for benchmark evaluation - Refactor benchmark types and agent interfaces - Remove architecture tool (merged into graph tool) - Update indexer pipeline stages and SCIP enrichment - Update benchmark docs with results - Align test utilities with new benchmark structure * test: elevate patch coverage above thresholds - Delete unused src/benchmark/ (tests use tests/benchmark/util/ copies) - Exclude test utilities from coverage (tests/benchmark/util, tests/helpers) - Add pipeline parallel group tests (concurrent execution, dispose, stageNames) - Add source-index unit tests (sourceCache, symbol_metrics insertion) - Add IndexBuilder integration tests for symbol_metrics population - Add LoreRuntime lifecycle tests * Update
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Overhaul the copilot benchmark suite with corrected ground truth, new questions that showcase Lore's structural advantages, concurrent execution, enhanced logging, and statistical rigor. Also includes indexing performance improvements (pipeline parallelization, source caching, FTS5 bulk refresh, DB tuning) and removes the
lore_architecturetool.Benchmark Results (claude-opus-4.6 via Copilot CLI on lore-self, 3 iterations per task)
Key takeaway: Lore doesn't significantly change correctness (both arms score similarly), but dramatically reduces cost: 81% fewer tokens and 69% fewer tool calls. First-pass accuracy (proportion of runs scoring ≥ 0.95) jumps from 25% to 72%.
Changes
New Benchmark Questions (replacing weak ones)
SymbolExtractor?" →lore_graph(kind=inheritance)EmbeddingProvider?" →lore_lookup+lore_graphlore_analyze(mode=cycles)lore_analyze(mode=summary)read-only.tsby calling files" →lore_lookup+lore_graphRemoved Q3.1/3.2 (import graph — grep wins), Q9.1/9.5 (architecture/file count — too trivial), Q7.1 (git history — bash wins).
Bug Fix:
symbol_metricsnever populated for SCIP-sourced filesSourceIndexStagenow callscomputeSymbolMetrics()for all symbolscomputeMetricsForScipFiles()pass that parses SCIP-sourced files with tree-sitter for complexity datalore_metricsQ6.1 went from both-arms-timeout to 173 tokens, 15.0s (Lore) vs 5,893 tokens, 131.8s (control)Breaking Change:
lore_metricssimplifiedmodeparameter andmode=aggregatebehaviorBreaking Change:
lore_architectureremovedIndexing Pipeline Performance
PipelineEntry = PipelineStage | PipelineStage[]). SourceIndex + DocsIndex run in parallel; ScipEnrichment + LspEnrichment + History run in parallel.Map<string, string>shared across pipeline context. Populated during parsing, reused by enrichment stages to avoid redundantreadFileSynccalls.DELETE + INSERTpass via newftsRefreshStage.synchronous = NORMAL+ 64 MB cache size in WAL mode.idx_symbols_file_idandidx_symbols_namefor faster lookups.execFileSync→execFile(promisified) in both ScipSourceStage and ScipEnrichmentCoordinator.Graph Tool Enhancements
lore_graphedges now includesource_file_pathandtarget_file_pathfields in both full and compact output.Embeddings Changes
--embeddingsflag or--embedding-model); disabled by default.onnx-community/Qwen3-Embedding-0.6B-ONNX(pre-converted ONNX weights).fp32toq8.Ground Truth Corrections (verified against actual SCIP-indexed DB at SHA 660be2b)
mainremoved (doesn't directly callopenDb);docsAutoNotes1addedIndexPipelineremoved (SCIP records as<constructor>)Concurrent Benchmark Execution
it.concurrent+Promise.allfor control/lore armsMap(concurrent-safe)BENCHMARK_ITERATIONSenv varEnhanced Scoring & Logging
formatLoreToolArgs(): shows exact arguments for every Lore tool callformatToolFrequency(): compacttoolName×countper armdiagnoseExpectations(): lists matched/missed expected partstoolCallCountsinAggregateReport: per-tool totals across all runsREADME Update
Where Lore Won (best tasks)
Where Lore Lost or Tied
Test Results
All tests pass.