Skip to content

Add Phase 1 benchmark runner and CI gates#14

Merged
magic-alt merged 4 commits into
chore/phase0-eval-gatesfrom
feat/phase1-benchmark-runner
May 22, 2026
Merged

Add Phase 1 benchmark runner and CI gates#14
magic-alt merged 4 commits into
chore/phase0-eval-gatesfrom
feat/phase1-benchmark-runner

Conversation

@magic-alt
Copy link
Copy Markdown
Owner

Summary

This stacked PR advances the remaining Phase 1 benchmark work on top of chore/phase0-eval-gates.

It adds a runnable benchmark manifest flow for local datasets, broadens key-fact evaluation coverage, introduces the first committed local private manifest batches, and wires the eval threshold gate into CI and local CI.

What Changed

  • Extended scripts/eval.py to:
    • run local benchmark manifests via --benchmark-manifest
    • emit document-level metrics in the report
    • compute per-key-fact aggregate accuracy metrics
    • keep the synthetic golden suite as the default path
  • Expanded synthetic golden coverage in tests/golden/conftest.py so the suite now exercises the key facts needed for gating.
  • Added alias-aware metric handling for concept names such as operating_cf vs operating_cash_flow.
  • Extended normalization and statement extraction heuristics for:
    • gross_profit
    • operating_income
    • cash_and_equivalents
    • operating_cash_flow
    • capex
  • Added committed local manifest batches under benchmarks/local_manifests/ for:
    • US 10-K / 10-Q filings
    • stress cases
  • Updated benchmarks/README.md with local private dataset usage.
  • Updated benchmarks/thresholds/golden_minimum.json to gate the new key-fact metrics.
  • Wired the eval threshold gate into:
    • .github/workflows/ci.yml
    • scripts/local_ci.sh
  • Added/updated tests for:
    • benchmark manifest structure
    • eval runner behavior
    • key-fact metrics
    • normalization aliases

Why

The previous slice established the fact layer and the first eval runner, but Phase 1 still lacked three operational pieces:

  1. a runnable local benchmark manifest path
  2. per-key-fact metrics aligned with the roadmap acceptance criteria
  3. a real CI gate that fails when eval quality regresses

This PR closes those gaps and makes the benchmark/eval layer reviewable and enforceable.

Validation

  • python -m pytest tests/test_metrics.py tests/test_eval_script.py tests/test_benchmark_manifest.py tests/test_normalizer_and_events.py tests/golden/test_golden.py -q
  • python scripts/eval.py --skip-pytest --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-gate-check
  • python scripts/eval.py --benchmark-manifest benchmarks/local_manifests/us_filings_private_batch.json --skip-pytest --output-dir data/eval-us-private
  • python scripts/eval.py --benchmark-manifest benchmarks/local_manifests/stress_private_batch.json --skip-pytest --output-dir data/eval-stress-private

Notes

  • The committed local manifests point to ignored private paths under benchmarks/private/; raw PDFs remain out of git.
  • This PR is intentionally stacked on chore/phase0-eval-gates so the review stays focused on the Phase 1 delta.

Copilot AI review requested due to automatic review settings May 22, 2026 05:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Phase 1 benchmark-manifest evaluation path and enforces quality gates in CI/local CI, while expanding the “golden” synthetic suite and normalization/alias handling to cover additional key financial facts.

Changes:

  • Extended scripts/eval.py to run manifest-driven benchmarks (synthetic fixtures + local PDFs), emit per-document metrics, and compute per-key-fact aggregate accuracy metrics.
  • Added concept alias support (notably operating_cfoperating_cash_flow) and expanded extraction/normalization heuristics for additional key facts.
  • Introduced CI/local-CI eval threshold gating and committed sample + private local manifest batches plus fixtures/docs.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_normalizer_and_events.py Adds normalization tests for new aliases (cash equivalents, capex).
tests/test_metrics.py Updates fact helper + adds test for operating cash flow alias handling in metrics.
tests/test_eval_script.py Adds tests for running a sample benchmark manifest, PDF source handling, and key-fact alias metrics.
tests/test_benchmark_manifest.py Adds assertions for committed local-private manifests and new threshold metrics.
tests/golden/conftest.py Expands golden synthetic cases to include additional expected key facts (cash, gross profit, OI, OCF, capex).
src/utils/metrics.py Introduces concept alias lookup and uses it in statement/fact accuracy calculations.
src/finance/normalizer.py Adds new account-name normalizations for key facts and additional language variants.
src/agent/nodes.py Expands totals capture and label-based extraction heuristics for additional key facts across statements.
scripts/local_ci.sh Adds eval-thresholds step and includes scripts/ in ruff linting.
scripts/eval.py Implements benchmark manifest runner, dataset metadata, per-case metrics in reports, and key-fact aggregate accuracy metrics.
benchmarks/thresholds/golden_minimum.json Tightens thresholds and adds per-key-fact minimum gates.
benchmarks/sample_manifest.json Updates sample manifest to point at committed synthetic fixture and adds expected facts.
benchmarks/README.md Documents running sample and local-private benchmark manifests.
benchmarks/local_manifests/us_filings_private_batch.json Adds committed local-private manifest batch for US filings (paths point to ignored private PDFs).
benchmarks/local_manifests/stress_private_batch.json Adds committed local-private manifest batch for stress cases (paths point to ignored private PDFs).
benchmarks/fixtures/synthetic_income_001.json Adds committed synthetic fixture used by the sample manifest.
.github/workflows/ci.yml Adds eval thresholds gate step and includes scripts/ in ruff linting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/eval.py
Comment on lines +377 to +385
source_path = _resolve_source_path(source["path"], manifest_dir)
expected_fact_entries = case.get("expected_facts", [])
pages = _load_synthetic_pages(source_path) if source_type == "synthetic" else None
pdf_path = str(source_path) if source_type == "pdf" else None
if source_type not in {"synthetic", "pdf"}:
raise ValueError(
f"Unsupported benchmark source type '{source_type}'. The current eval runner supports synthetic fixtures and local PDFs."
)

Comment thread scripts/eval.py
Comment on lines +262 to +264
return {
"statement_accuracy_by_type": statement_metrics,
"balance_equation_pass": balance_equation_pass_rate([statements]) if statements else None,
@@ -32,9 +68,45 @@
"evidence": [
{"page": 1, "quote": "Revenue 100"}
@magic-alt magic-alt merged commit 25275ca into chore/phase0-eval-gates May 22, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants