Skip to content

feat(benchmark): expand postgres question set and add results#217

Merged
jafreck merged 1 commit intomainfrom
benchmark/postgres-questions
Mar 16, 2026
Merged

feat(benchmark): expand postgres question set and add results#217
jafreck merged 1 commit intomainfrom
benchmark/postgres-questions

Conversation

@jafreck
Copy link
Owner

@jafreck jafreck commented Mar 16, 2026

Summary

Expands the PostgreSQL benchmark question set from 11 to 15 questions and adds a dedicated results page for the postgres benchmark run.

Changes

Ground truth fixes

  • Removed question 3.3 (top-level module dependency map) — the ground truth expected src/backend/* subdirectory-level dependencies, but the question asks for "top-level modules". Both arms answered correctly at the actual top-level granularity (src/port, src/common, src/backend, etc.) but scored 0 because the expected answer was at the wrong level. Neither arm used Lore tools for this question.

New questions (verified against postgres source)

ID Category Ground Truth
1.6 Dead code detection src/backend/parser/analyze.c — all exported functions are called → "None"
2.1 Type hierarchy TupleTableSlotOps vtable → 4 implementations: TTSOpsVirtual, TTSOpsHeapTuple, TTSOpsMinimalTuple, TTSOpsBufferHeapTuple
3.5 External packages libxml2 → src/backend/utils/adt, libxslt → contrib/xml2, uuid → contrib/uuid-ossp
5.1 Semantic similarity Functions similar to heap_insert: heap_update, heap_delete, heap_multi_insert, simple_heap_insert
12.1 Architecture layers No layering violations → "None"

Benchmark results page

New docs/benchmark-results-postgres.md with full results from a 15-question run:

Metric Control Lore Delta
Mean correctness 69.4% 72.2% +2.8pp
First-pass accuracy 33.3% 53.3% +20.0pp
Mean tokens 11,584 3,096 -73.3%
Mean wall time 134.8 s 60.3 s -55.3%

Biggest wins: task 12.1 (layer violations, -99% tokens) and 1.4 (3-hop blast radius, -95% tokens).

Test results

  • 15/15 tasks completed, 14/15 passed
  • 1 assertion failure: task 2.1 — both arms scored perfectly but the lore arm used grep instead of any lore_* tool (simple enough for grep on a C struct pattern)

- Remove question 3.3 (module dependency map) whose ground truth was
  at the wrong granularity for a C codebase
- Add 5 new verified questions: 1.6 (dead code), 2.1 (type hierarchy),
  3.5 (external packages), 5.1 (semantic similarity), 12.1 (architecture)
- Postgres now has 15 questions (up from 11)
- Add benchmark results page for postgres run (73% token reduction,
  55% wall-time reduction, +20pp first-pass accuracy)
@codecov
Copy link

codecov bot commented Mar 16, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.49%. Comparing base (a564531) to head (eb90133).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #217   +/-   ##
=======================================
  Coverage   87.49%   87.49%           
=======================================
  Files          85       85           
  Lines        9475     9475           
  Branches     2932     2932           
=======================================
  Hits         8290     8290           
  Misses       1185     1185           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jafreck jafreck merged commit af76cab into main Mar 16, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant