Add benchmark thresholds and positioning docs by magic-alt · Pull Request #13 · magic-alt/jetbot

magic-alt · 2026-05-22T03:50:24Z

Summary

This PR closes the remaining Phase 0 positioning work and adds the first executable Phase 1 quality gate for evaluation.

It formalizes Jetbot as a Filing-to-Model Copilot / Financial Fact Platform, adds committed benchmark manifest metadata, and makes scripts/eval.py fail when metrics fall below configured thresholds.

What Changed

Updated README.md to align product positioning with Filing-to-Model Copilot / Financial Fact Platform.
Updated docs/financial_fact_platform_roadmap.md to reflect the post-PR12 state and to sequence next work as Phase 0 closure plus Phase 1 eval gating.
Added benchmark documentation and committed metadata under benchmarks/:
- benchmarks/README.md
- benchmarks/manifest.schema.json
- benchmarks/sample_manifest.json
- benchmarks/thresholds/golden_minimum.json
Updated .gitignore to keep raw/private benchmark data out of git while allowing schemas, manifests, and thresholds.
Added --thresholds support to scripts/eval.py.
Made eval return a non-zero exit code when thresholds fail.
Added unit tests for threshold evaluation and benchmark manifest structure.

Why

PR12 added the fact foundation and eval runner, but two gaps remained:

the product/docs layer was not fully aligned around the Filing-to-Model Copilot direction
eval was informative but not yet enforceable as a CI quality gate

This PR closes both gaps so the next implementation slice can move into correction APIs, review UX, and exports with an actual regression gate in place.

Validation

python -m ruff check src tests scripts
python -m mypy src --ignore-missing-imports
python -m pytest -q --timeout=60
python scripts/eval.py --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-dev

Follow-up

Add correction APIs and effective facts
Add evidence review UI and PDF bbox highlighting
Add facts export endpoints
Extend benchmark manifests beyond synthetic golden cases

Copilot

Pull request overview

This PR finalizes Phase 0 positioning docs and introduces an executable Phase 1 quality gate by adding benchmark metadata artifacts and making the evaluation script enforce configurable metric thresholds.

Changes:

Added benchmark manifest documentation + JSON schema/sample manifest and a committed “golden minimum” thresholds config.
Extended scripts/eval.py to load thresholds, evaluate min/max metric gates, include threshold results in reports, and return a non-zero exit code on threshold failure.
Added tests covering threshold evaluation and basic manifest JSON sanity checks; updated README/roadmap and benchmark data policy guidance.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`scripts/eval.py`	Adds `--thresholds`, evaluates min/max metric gates, records threshold results in the report, and exits non-zero when gates fail.
`tests/test_eval_script.py`	Adds unit tests for threshold evaluation and verifies `main()` returns `2` when thresholds fail.
`tests/test_benchmark_manifest.py`	Adds a basic JSON-load sanity test for the manifest schema and sample manifest.
`benchmarks/manifest.schema.json`	Introduces a JSON Schema for benchmark manifests.
`benchmarks/sample_manifest.json`	Adds a committed sample benchmark manifest illustrating expected shape and data policy.
`benchmarks/thresholds/golden_minimum.json`	Adds initial threshold configuration for the golden suite.
`benchmarks/README.md`	Documents benchmark data policy and how to run eval with thresholds.
`README.md`	Updates product positioning and documents benchmark policy + thresholded eval invocation.
`docs/financial_fact_platform_roadmap.md`	Updates roadmap to reflect Phase 0 closure and Phase 1 eval gating.
`.gitignore`	Ignores raw/private benchmark data and common binary formats under `benchmarks/`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    for metric, threshold in thresholds.get("max_metrics", {}).items():
+        actual = metrics.get(metric)
+        passed = _is_number(actual) and float(actual) <= float(threshold)
+        failed = failed or not passed
+        checks.append(_threshold_check("max", metric, actual, threshold, passed))
+    return {"status": "failed" if failed else "passed", "checks": checks}


+def load_thresholds(path: str | None) -> dict[str, Any] | None:
+    if not path:
+        return None
+    return json.loads(Path(path).read_text(encoding="utf-8"))
+


add benchmark thresholds and positioning docs

0982cde

Copilot AI review requested due to automatic review settings May 22, 2026 03:50

Copilot started reviewing on behalf of magic-alt May 22, 2026 03:50 View session

magic-alt merged commit 73efd9c into main May 22, 2026
5 checks passed

Copilot AI reviewed May 22, 2026

View reviewed changes

magic-alt mentioned this pull request May 22, 2026

Recover Phase 1 benchmark runner and CI gates onto main #15

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchmark thresholds and positioning docs#13

Add benchmark thresholds and positioning docs#13
magic-alt merged 1 commit into
mainfrom
chore/phase0-eval-gates

magic-alt commented May 22, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

magic-alt commented May 22, 2026

Summary

What Changed

Why

Validation

Follow-up

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants