Add Phase 1 benchmark runner and CI gates by magic-alt · Pull Request #14 · magic-alt/jetbot

magic-alt · 2026-05-22T05:49:05Z

Summary

This stacked PR advances the remaining Phase 1 benchmark work on top of chore/phase0-eval-gates.

It adds a runnable benchmark manifest flow for local datasets, broadens key-fact evaluation coverage, introduces the first committed local private manifest batches, and wires the eval threshold gate into CI and local CI.

What Changed

Extended scripts/eval.py to:
- run local benchmark manifests via --benchmark-manifest
- emit document-level metrics in the report
- compute per-key-fact aggregate accuracy metrics
- keep the synthetic golden suite as the default path
Expanded synthetic golden coverage in tests/golden/conftest.py so the suite now exercises the key facts needed for gating.
Added alias-aware metric handling for concept names such as operating_cf vs operating_cash_flow.
Extended normalization and statement extraction heuristics for:
- gross_profit
- operating_income
- cash_and_equivalents
- operating_cash_flow
- capex
Added committed local manifest batches under benchmarks/local_manifests/ for:
- US 10-K / 10-Q filings
- stress cases
Updated benchmarks/README.md with local private dataset usage.
Updated benchmarks/thresholds/golden_minimum.json to gate the new key-fact metrics.
Wired the eval threshold gate into:
- .github/workflows/ci.yml
- scripts/local_ci.sh
Added/updated tests for:
- benchmark manifest structure
- eval runner behavior
- key-fact metrics
- normalization aliases

Why

The previous slice established the fact layer and the first eval runner, but Phase 1 still lacked three operational pieces:

a runnable local benchmark manifest path
per-key-fact metrics aligned with the roadmap acceptance criteria
a real CI gate that fails when eval quality regresses

This PR closes those gaps and makes the benchmark/eval layer reviewable and enforceable.

Validation

python -m pytest tests/test_metrics.py tests/test_eval_script.py tests/test_benchmark_manifest.py tests/test_normalizer_and_events.py tests/golden/test_golden.py -q
python scripts/eval.py --skip-pytest --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-gate-check
python scripts/eval.py --benchmark-manifest benchmarks/local_manifests/us_filings_private_batch.json --skip-pytest --output-dir data/eval-us-private
python scripts/eval.py --benchmark-manifest benchmarks/local_manifests/stress_private_batch.json --skip-pytest --output-dir data/eval-stress-private

Notes

The committed local manifests point to ignored private paths under benchmarks/private/; raw PDFs remain out of git.
This PR is intentionally stacked on chore/phase0-eval-gates so the review stays focused on the Phase 1 delta.

Copilot

Pull request overview

Adds a Phase 1 benchmark-manifest evaluation path and enforces quality gates in CI/local CI, while expanding the “golden” synthetic suite and normalization/alias handling to cover additional key financial facts.

Changes:

Extended scripts/eval.py to run manifest-driven benchmarks (synthetic fixtures + local PDFs), emit per-document metrics, and compute per-key-fact aggregate accuracy metrics.
Added concept alias support (notably operating_cf ↔ operating_cash_flow) and expanded extraction/normalization heuristics for additional key facts.
Introduced CI/local-CI eval threshold gating and committed sample + private local manifest batches plus fixtures/docs.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/test_normalizer_and_events.py	Adds normalization tests for new aliases (cash equivalents, capex).
tests/test_metrics.py	Updates fact helper + adds test for operating cash flow alias handling in metrics.
tests/test_eval_script.py	Adds tests for running a sample benchmark manifest, PDF source handling, and key-fact alias metrics.
tests/test_benchmark_manifest.py	Adds assertions for committed local-private manifests and new threshold metrics.
tests/golden/conftest.py	Expands golden synthetic cases to include additional expected key facts (cash, gross profit, OI, OCF, capex).
src/utils/metrics.py	Introduces concept alias lookup and uses it in statement/fact accuracy calculations.
src/finance/normalizer.py	Adds new account-name normalizations for key facts and additional language variants.
src/agent/nodes.py	Expands totals capture and label-based extraction heuristics for additional key facts across statements.
scripts/local_ci.sh	Adds eval-thresholds step and includes `scripts/` in ruff linting.
scripts/eval.py	Implements benchmark manifest runner, dataset metadata, per-case metrics in reports, and key-fact aggregate accuracy metrics.
benchmarks/thresholds/golden_minimum.json	Tightens thresholds and adds per-key-fact minimum gates.
benchmarks/sample_manifest.json	Updates sample manifest to point at committed synthetic fixture and adds expected facts.
benchmarks/README.md	Documents running sample and local-private benchmark manifests.
benchmarks/local_manifests/us_filings_private_batch.json	Adds committed local-private manifest batch for US filings (paths point to ignored private PDFs).
benchmarks/local_manifests/stress_private_batch.json	Adds committed local-private manifest batch for stress cases (paths point to ignored private PDFs).
benchmarks/fixtures/synthetic_income_001.json	Adds committed synthetic fixture used by the sample manifest.
.github/workflows/ci.yml	Adds eval thresholds gate step and includes `scripts/` in ruff linting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    source_path = _resolve_source_path(source["path"], manifest_dir)
+    expected_fact_entries = case.get("expected_facts", [])
+    pages = _load_synthetic_pages(source_path) if source_type == "synthetic" else None
+    pdf_path = str(source_path) if source_type == "pdf" else None
+    if source_type not in {"synthetic", "pdf"}:
+        raise ValueError(
+            f"Unsupported benchmark source type '{source_type}'. The current eval runner supports synthetic fixtures and local PDFs."
+        )
+


+    return {
+        "statement_accuracy_by_type": statement_metrics,
+        "balance_equation_pass": balance_equation_pass_rate([statements]) if statements else None,


@@ -32,9 +68,45 @@
          "evidence": [
            {"page": 1, "quote": "Revenue 100"}


add phase 1 benchmark runner and ci gates

803fc3d

Copilot AI review requested due to automatic review settings May 22, 2026 05:49

Copilot started reviewing on behalf of magic-alt May 22, 2026 05:49 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

magic-alt added 3 commits May 22, 2026 13:57

fix PR14 CI regressions

309830b

fix eval gate extraction

fe89119

fix markdown lint and ts config warnings

5b47c78

magic-alt merged commit 25275ca into chore/phase0-eval-gates May 22, 2026
3 checks passed

magic-alt mentioned this pull request May 22, 2026

Recover Phase 1 benchmark runner and CI gates onto main #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Phase 1 benchmark runner and CI gates#14

Add Phase 1 benchmark runner and CI gates#14
magic-alt merged 4 commits into
chore/phase0-eval-gatesfrom
feat/phase1-benchmark-runner

magic-alt commented May 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -32,9 +68,45 @@
		"evidence": [
		{"page": 1, "quote": "Revenue 100"}

Conversation

magic-alt commented May 22, 2026

Summary

What Changed

Why

Validation

Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants