Recover Phase 1 benchmark runner and CI gates onto main by magic-alt · Pull Request #15 · magic-alt/jetbot

magic-alt · 2026-05-22T06:36:22Z

Summary

This PR recovers the Phase 1 benchmark runner and CI gate changes onto main.

The original Phase 1 work was previously reviewed in PR #14, but that PR was intentionally stacked on chore/phase0-eval-gates. PR #13 merged the Phase 0 branch into main first, while PR #14 later merged only into chore/phase0-eval-gates. As a result, the 4 Phase 1 commits are still not present on main.

This PR re-targets the same Phase 1 delta directly to main so the work is visible and mergeable on the default branch.

Included commits

803fc3d add phase 1 benchmark runner and ci gates
309830b fix PR14 CI regressions
fe89119 fix eval gate extraction
5b47c78 fix markdown lint and ts config warnings

What Changed

Extend scripts/eval.py with local benchmark manifest support and document-level metrics
Add committed local manifest batches under benchmarks/local_manifests/
Expand key-fact eval coverage and thresholds
Wire eval threshold gating into CI and local CI
Include the follow-up CI, markdownlint, and TypeScript config fixes from the original stacked branch

Why this recovery PR exists

PR #14 was merged into chore/phase0-eval-gates, not into main.

Because main already received PR #13 earlier, GitHub no longer shows the original stacked path as the active route for getting this work onto the default branch. This PR restores that route explicitly.

Validation

python -m pytest tests/test_metrics.py tests/test_eval_script.py tests/test_benchmark_manifest.py tests/test_normalizer_and_events.py tests/golden/test_golden.py -q
python scripts/eval.py --skip-pytest --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-gate-check
python scripts/eval.py --benchmark-manifest benchmarks/local_manifests/us_filings_private_batch.json --skip-pytest --output-dir data/eval-us-private
python scripts/eval.py --benchmark-manifest benchmarks/local_manifests/stress_private_batch.json --skip-pytest --output-dir data/eval-stress-private

Copilot

Pull request overview

This PR reintroduces the “Phase 1” benchmark runner and CI quality gates onto main, expanding eval coverage (including key-fact/document-level metrics), adding local benchmark manifests/fixtures, and wiring the thresholds gate into CI and the local CI script.

Changes:

Extend scripts/eval.py to support running benchmark manifests (--benchmark-manifest), emit per-document metrics, and compute key-fact accuracy metrics used by threshold gates.
Add committed local manifest metadata + a synthetic fixture to exercise the manifest flow, and expand golden cases/normalization to cover additional key facts (cash, gross profit, operating income, operating cash flow, capex).
Wire eval threshold gating into CI/local CI and update web TS config/tooling (TypeScript upgrade + config warnings cleanup).

Reviewed changes

Copilot reviewed 20 out of 22 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
web/tsconfig.json	Adjust TS config (deprecation ignore + path mapping tweak).
web/package.json	Bump TypeScript dev dependency.
web/package-lock.json	Lockfile updates for the TypeScript bump.
tests/test_normalizer_and_events.py	Add normalization alias tests for cash equivalents/capex.
tests/test_metrics.py	Update test helper + add alias coverage for operating cash flow.
tests/test_eval_script.py	Add tests for manifest execution, document metrics output, and local PDF manifest spec.
tests/test_benchmark_manifest.py	Validate committed local manifests and ensure thresholds include key-fact metrics.
tests/golden/conftest.py	Expand golden cases to include expected facts for new key metrics (incl. capex/cash).
src/utils/metrics.py	Add concept alias support and reuse it for statement/fact metric lookups.
src/finance/normalizer.py	Expand account-name normalization mappings (EN + ZH) for new concepts.
src/agent/nodes.py	Expand extracted totals/labels; refine labeled-metric extraction segmentation/value selection.
scripts/local_ci.sh	Add eval thresholds step and expand ruff target set.
scripts/eval.py	Add benchmark-manifest runner, dataset metadata, per-case metrics, and key-fact accuracy metrics.
README.md	Formatting-only newline normalization at EOF.
docs/financial_fact_platform_roadmap.md	Heading formatting + EOF newline normalization.
benchmarks/thresholds/golden_minimum.json	Tighten thresholds and add key-fact accuracy gates.
benchmarks/sample_manifest.json	Switch sample manifest to use a committed synthetic fixture + richer expected facts.
benchmarks/README.md	Document running sample manifest + local private batches.
benchmarks/local_manifests/us_filings_private_batch.json	Add committed “local-private” US filings manifest metadata.
benchmarks/local_manifests/stress_private_batch.json	Add committed “local-private” stress manifest metadata.
benchmarks/fixtures/synthetic_income_001.json	Add synthetic fixture used by sample manifest.
.github/workflows/ci.yml	Run ruff on `scripts/`, add eval thresholds gate, and change docker job triggering behavior.

Files not reviewed (1)

web/package-lock.json: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

        "## Metrics",
        "",
    ]
+    dataset = report.get("dataset")
+    if dataset:


@@ -79,7 +82,6 @@ jobs:
  docker:


magic-alt added 4 commits May 22, 2026 13:47

add phase 1 benchmark runner and ci gates

803fc3d

fix PR14 CI regressions

309830b

fix eval gate extraction

fe89119

fix markdown lint and ts config warnings

5b47c78

Copilot AI review requested due to automatic review settings May 22, 2026 06:36

Copilot started reviewing on behalf of magic-alt May 22, 2026 06:36 View session

magic-alt merged commit f742895 into main May 22, 2026
8 checks passed

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread scripts/eval.py

Comment on lines 126 to +130

"## Metrics",

"",

]

dataset = report.get("dataset")

if dataset:

Comment thread .github/workflows/ci.yml

@@ -79,7 +82,6 @@ jobs:

docker:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover Phase 1 benchmark runner and CI gates onto main#15

Recover Phase 1 benchmark runner and CI gates onto main#15
magic-alt merged 4 commits into
mainfrom
feat/phase1-benchmark-runner

magic-alt commented May 22, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

magic-alt commented May 22, 2026

Summary

Included commits

What Changed

Why this recovery PR exists

Validation

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants