Add Phase 1 benchmark runner and CI gates#14
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a Phase 1 benchmark-manifest evaluation path and enforces quality gates in CI/local CI, while expanding the “golden” synthetic suite and normalization/alias handling to cover additional key financial facts.
Changes:
- Extended
scripts/eval.pyto run manifest-driven benchmarks (synthetic fixtures + local PDFs), emit per-document metrics, and compute per-key-fact aggregate accuracy metrics. - Added concept alias support (notably
operating_cf↔operating_cash_flow) and expanded extraction/normalization heuristics for additional key facts. - Introduced CI/local-CI eval threshold gating and committed sample + private local manifest batches plus fixtures/docs.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_normalizer_and_events.py | Adds normalization tests for new aliases (cash equivalents, capex). |
| tests/test_metrics.py | Updates fact helper + adds test for operating cash flow alias handling in metrics. |
| tests/test_eval_script.py | Adds tests for running a sample benchmark manifest, PDF source handling, and key-fact alias metrics. |
| tests/test_benchmark_manifest.py | Adds assertions for committed local-private manifests and new threshold metrics. |
| tests/golden/conftest.py | Expands golden synthetic cases to include additional expected key facts (cash, gross profit, OI, OCF, capex). |
| src/utils/metrics.py | Introduces concept alias lookup and uses it in statement/fact accuracy calculations. |
| src/finance/normalizer.py | Adds new account-name normalizations for key facts and additional language variants. |
| src/agent/nodes.py | Expands totals capture and label-based extraction heuristics for additional key facts across statements. |
| scripts/local_ci.sh | Adds eval-thresholds step and includes scripts/ in ruff linting. |
| scripts/eval.py | Implements benchmark manifest runner, dataset metadata, per-case metrics in reports, and key-fact aggregate accuracy metrics. |
| benchmarks/thresholds/golden_minimum.json | Tightens thresholds and adds per-key-fact minimum gates. |
| benchmarks/sample_manifest.json | Updates sample manifest to point at committed synthetic fixture and adds expected facts. |
| benchmarks/README.md | Documents running sample and local-private benchmark manifests. |
| benchmarks/local_manifests/us_filings_private_batch.json | Adds committed local-private manifest batch for US filings (paths point to ignored private PDFs). |
| benchmarks/local_manifests/stress_private_batch.json | Adds committed local-private manifest batch for stress cases (paths point to ignored private PDFs). |
| benchmarks/fixtures/synthetic_income_001.json | Adds committed synthetic fixture used by the sample manifest. |
| .github/workflows/ci.yml | Adds eval thresholds gate step and includes scripts/ in ruff linting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+377
to
+385
| source_path = _resolve_source_path(source["path"], manifest_dir) | ||
| expected_fact_entries = case.get("expected_facts", []) | ||
| pages = _load_synthetic_pages(source_path) if source_type == "synthetic" else None | ||
| pdf_path = str(source_path) if source_type == "pdf" else None | ||
| if source_type not in {"synthetic", "pdf"}: | ||
| raise ValueError( | ||
| f"Unsupported benchmark source type '{source_type}'. The current eval runner supports synthetic fixtures and local PDFs." | ||
| ) | ||
|
|
Comment on lines
+262
to
+264
| return { | ||
| "statement_accuracy_by_type": statement_metrics, | ||
| "balance_equation_pass": balance_equation_pass_rate([statements]) if statements else None, |
| @@ -32,9 +68,45 @@ | |||
| "evidence": [ | |||
| {"page": 1, "quote": "Revenue 100"} | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This stacked PR advances the remaining Phase 1 benchmark work on top of
chore/phase0-eval-gates.It adds a runnable benchmark manifest flow for local datasets, broadens key-fact evaluation coverage, introduces the first committed local private manifest batches, and wires the eval threshold gate into CI and local CI.
What Changed
scripts/eval.pyto:--benchmark-manifesttests/golden/conftest.pyso the suite now exercises the key facts needed for gating.operating_cfvsoperating_cash_flow.gross_profitoperating_incomecash_and_equivalentsoperating_cash_flowcapexbenchmarks/local_manifests/for:benchmarks/README.mdwith local private dataset usage.benchmarks/thresholds/golden_minimum.jsonto gate the new key-fact metrics..github/workflows/ci.ymlscripts/local_ci.shWhy
The previous slice established the fact layer and the first eval runner, but Phase 1 still lacked three operational pieces:
This PR closes those gaps and makes the benchmark/eval layer reviewable and enforceable.
Validation
python -m pytest tests/test_metrics.py tests/test_eval_script.py tests/test_benchmark_manifest.py tests/test_normalizer_and_events.py tests/golden/test_golden.py -qpython scripts/eval.py --skip-pytest --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-gate-checkpython scripts/eval.py --benchmark-manifest benchmarks/local_manifests/us_filings_private_batch.json --skip-pytest --output-dir data/eval-us-privatepython scripts/eval.py --benchmark-manifest benchmarks/local_manifests/stress_private_batch.json --skip-pytest --output-dir data/eval-stress-privateNotes
benchmarks/private/; raw PDFs remain out of git.chore/phase0-eval-gatesso the review stays focused on the Phase 1 delta.