Recover Phase 1 benchmark runner and CI gates onto main#15
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR reintroduces the “Phase 1” benchmark runner and CI quality gates onto main, expanding eval coverage (including key-fact/document-level metrics), adding local benchmark manifests/fixtures, and wiring the thresholds gate into CI and the local CI script.
Changes:
- Extend
scripts/eval.pyto support running benchmark manifests (--benchmark-manifest), emit per-document metrics, and compute key-fact accuracy metrics used by threshold gates. - Add committed local manifest metadata + a synthetic fixture to exercise the manifest flow, and expand golden cases/normalization to cover additional key facts (cash, gross profit, operating income, operating cash flow, capex).
- Wire eval threshold gating into CI/local CI and update web TS config/tooling (TypeScript upgrade + config warnings cleanup).
Reviewed changes
Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| web/tsconfig.json | Adjust TS config (deprecation ignore + path mapping tweak). |
| web/package.json | Bump TypeScript dev dependency. |
| web/package-lock.json | Lockfile updates for the TypeScript bump. |
| tests/test_normalizer_and_events.py | Add normalization alias tests for cash equivalents/capex. |
| tests/test_metrics.py | Update test helper + add alias coverage for operating cash flow. |
| tests/test_eval_script.py | Add tests for manifest execution, document metrics output, and local PDF manifest spec. |
| tests/test_benchmark_manifest.py | Validate committed local manifests and ensure thresholds include key-fact metrics. |
| tests/golden/conftest.py | Expand golden cases to include expected facts for new key metrics (incl. capex/cash). |
| src/utils/metrics.py | Add concept alias support and reuse it for statement/fact metric lookups. |
| src/finance/normalizer.py | Expand account-name normalization mappings (EN + ZH) for new concepts. |
| src/agent/nodes.py | Expand extracted totals/labels; refine labeled-metric extraction segmentation/value selection. |
| scripts/local_ci.sh | Add eval thresholds step and expand ruff target set. |
| scripts/eval.py | Add benchmark-manifest runner, dataset metadata, per-case metrics, and key-fact accuracy metrics. |
| README.md | Formatting-only newline normalization at EOF. |
| docs/financial_fact_platform_roadmap.md | Heading formatting + EOF newline normalization. |
| benchmarks/thresholds/golden_minimum.json | Tighten thresholds and add key-fact accuracy gates. |
| benchmarks/sample_manifest.json | Switch sample manifest to use a committed synthetic fixture + richer expected facts. |
| benchmarks/README.md | Document running sample manifest + local private batches. |
| benchmarks/local_manifests/us_filings_private_batch.json | Add committed “local-private” US filings manifest metadata. |
| benchmarks/local_manifests/stress_private_batch.json | Add committed “local-private” stress manifest metadata. |
| benchmarks/fixtures/synthetic_income_001.json | Add synthetic fixture used by sample manifest. |
| .github/workflows/ci.yml | Run ruff on scripts/, add eval thresholds gate, and change docker job triggering behavior. |
Files not reviewed (1)
- web/package-lock.json: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
126
to
+130
| "## Metrics", | ||
| "", | ||
| ] | ||
| dataset = report.get("dataset") | ||
| if dataset: |
| @@ -79,7 +82,6 @@ jobs: | |||
| docker: | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR recovers the Phase 1 benchmark runner and CI gate changes onto
main.The original Phase 1 work was previously reviewed in PR #14, but that PR was intentionally stacked on
chore/phase0-eval-gates. PR #13 merged the Phase 0 branch intomainfirst, while PR #14 later merged only intochore/phase0-eval-gates. As a result, the 4 Phase 1 commits are still not present onmain.This PR re-targets the same Phase 1 delta directly to
mainso the work is visible and mergeable on the default branch.Included commits
803fc3dadd phase 1 benchmark runner and ci gates309830bfix PR14 CI regressionsfe89119fix eval gate extraction5b47c78fix markdown lint and ts config warningsWhat Changed
scripts/eval.pywith local benchmark manifest support and document-level metricsbenchmarks/local_manifests/Why this recovery PR exists
PR #14 was merged into
chore/phase0-eval-gates, not intomain.Because
mainalready received PR #13 earlier, GitHub no longer shows the original stacked path as the active route for getting this work onto the default branch. This PR restores that route explicitly.Validation
python -m pytest tests/test_metrics.py tests/test_eval_script.py tests/test_benchmark_manifest.py tests/test_normalizer_and_events.py tests/golden/test_golden.py -qpython scripts/eval.py --skip-pytest --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-gate-checkpython scripts/eval.py --benchmark-manifest benchmarks/local_manifests/us_filings_private_batch.json --skip-pytest --output-dir data/eval-us-privatepython scripts/eval.py --benchmark-manifest benchmarks/local_manifests/stress_private_batch.json --skip-pytest --output-dir data/eval-stress-private