feat(benchmark): expand postgres question set and add results by jafreck · Pull Request #217 · jafreck/Lore

jafreck · 2026-03-16T01:29:39Z

Summary

Expands the PostgreSQL benchmark question set from 11 to 15 questions and adds a dedicated results page for the postgres benchmark run.

Changes

Ground truth fixes

Removed question 3.3 (top-level module dependency map) — the ground truth expected src/backend/* subdirectory-level dependencies, but the question asks for "top-level modules". Both arms answered correctly at the actual top-level granularity (src/port, src/common, src/backend, etc.) but scored 0 because the expected answer was at the wrong level. Neither arm used Lore tools for this question.

New questions (verified against postgres source)

ID	Category	Ground Truth
1.6	Dead code detection	`src/backend/parser/analyze.c` — all exported functions are called → "None"
2.1	Type hierarchy	`TupleTableSlotOps` vtable → 4 implementations: `TTSOpsVirtual`, `TTSOpsHeapTuple`, `TTSOpsMinimalTuple`, `TTSOpsBufferHeapTuple`
3.5	External packages	`libxml2 → src/backend/utils/adt`, `libxslt → contrib/xml2`, `uuid → contrib/uuid-ossp`
5.1	Semantic similarity	Functions similar to `heap_insert`: `heap_update`, `heap_delete`, `heap_multi_insert`, `simple_heap_insert`
12.1	Architecture layers	No layering violations → "None"

Benchmark results page

New docs/benchmark-results-postgres.md with full results from a 15-question run:

Metric	Control	Lore	Delta
Mean correctness	69.4%	72.2%	+2.8pp
First-pass accuracy	33.3%	53.3%	+20.0pp
Mean tokens	11,584	3,096	-73.3%
Mean wall time	134.8 s	60.3 s	-55.3%

Biggest wins: task 12.1 (layer violations, -99% tokens) and 1.4 (3-hop blast radius, -95% tokens).

Test results

15/15 tasks completed, 14/15 passed
1 assertion failure: task 2.1 — both arms scored perfectly but the lore arm used grep instead of any lore_* tool (simple enough for grep on a C struct pattern)

- Remove question 3.3 (module dependency map) whose ground truth was at the wrong granularity for a C codebase - Add 5 new verified questions: 1.6 (dead code), 2.1 (type hierarchy), 3.5 (external packages), 5.1 (semantic similarity), 12.1 (architecture) - Postgres now has 15 questions (up from 11) - Add benchmark results page for postgres run (73% token reduction, 55% wall-time reduction, +20pp first-pass accuracy)

codecov · 2026-03-16T01:31:24Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.49%. Comparing base (a564531) to head (eb90133).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #217   +/-   ##
=======================================
  Coverage   87.49%   87.49%           
=======================================
  Files          85       85           
  Lines        9475     9475           
  Branches     2932     2932           
=======================================
  Hits         8290     8290           
  Misses       1185     1185

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jafreck merged commit af76cab into main Mar 16, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(benchmark): expand postgres question set and add results#217

feat(benchmark): expand postgres question set and add results#217
jafreck merged 1 commit intomainfrom
benchmark/postgres-questions

jafreck commented Mar 16, 2026

Uh oh!

codecov bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jafreck commented Mar 16, 2026

Summary

Changes

Ground truth fixes

New questions (verified against postgres source)

Benchmark results page

Test results

Uh oh!

codecov bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

codecov bot commented Mar 16, 2026 •

edited

Loading