Skip to content

Add benchmark thresholds and positioning docs#13

Merged
magic-alt merged 1 commit into
mainfrom
chore/phase0-eval-gates
May 22, 2026
Merged

Add benchmark thresholds and positioning docs#13
magic-alt merged 1 commit into
mainfrom
chore/phase0-eval-gates

Conversation

@magic-alt
Copy link
Copy Markdown
Owner

Summary

This PR closes the remaining Phase 0 positioning work and adds the first executable Phase 1 quality gate for evaluation.

It formalizes Jetbot as a Filing-to-Model Copilot / Financial Fact Platform, adds committed benchmark manifest metadata, and makes scripts/eval.py fail when metrics fall below configured thresholds.

What Changed

  • Updated README.md to align product positioning with Filing-to-Model Copilot / Financial Fact Platform.
  • Updated docs/financial_fact_platform_roadmap.md to reflect the post-PR12 state and to sequence next work as Phase 0 closure plus Phase 1 eval gating.
  • Added benchmark documentation and committed metadata under benchmarks/:
    • benchmarks/README.md
    • benchmarks/manifest.schema.json
    • benchmarks/sample_manifest.json
    • benchmarks/thresholds/golden_minimum.json
  • Updated .gitignore to keep raw/private benchmark data out of git while allowing schemas, manifests, and thresholds.
  • Added --thresholds support to scripts/eval.py.
  • Made eval return a non-zero exit code when thresholds fail.
  • Added unit tests for threshold evaluation and benchmark manifest structure.

Why

PR12 added the fact foundation and eval runner, but two gaps remained:

  1. the product/docs layer was not fully aligned around the Filing-to-Model Copilot direction
  2. eval was informative but not yet enforceable as a CI quality gate

This PR closes both gaps so the next implementation slice can move into correction APIs, review UX, and exports with an actual regression gate in place.

Validation

  • python -m ruff check src tests scripts
  • python -m mypy src --ignore-missing-imports
  • python -m pytest -q --timeout=60
  • python scripts/eval.py --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-dev

Follow-up

  1. Add correction APIs and effective facts
  2. Add evidence review UI and PDF bbox highlighting
  3. Add facts export endpoints
  4. Extend benchmark manifests beyond synthetic golden cases

Copilot AI review requested due to automatic review settings May 22, 2026 03:50
@magic-alt magic-alt merged commit 73efd9c into main May 22, 2026
5 checks passed
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR finalizes Phase 0 positioning docs and introduces an executable Phase 1 quality gate by adding benchmark metadata artifacts and making the evaluation script enforce configurable metric thresholds.

Changes:

  • Added benchmark manifest documentation + JSON schema/sample manifest and a committed “golden minimum” thresholds config.
  • Extended scripts/eval.py to load thresholds, evaluate min/max metric gates, include threshold results in reports, and return a non-zero exit code on threshold failure.
  • Added tests covering threshold evaluation and basic manifest JSON sanity checks; updated README/roadmap and benchmark data policy guidance.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/eval.py Adds --thresholds, evaluates min/max metric gates, records threshold results in the report, and exits non-zero when gates fail.
tests/test_eval_script.py Adds unit tests for threshold evaluation and verifies main() returns 2 when thresholds fail.
tests/test_benchmark_manifest.py Adds a basic JSON-load sanity test for the manifest schema and sample manifest.
benchmarks/manifest.schema.json Introduces a JSON Schema for benchmark manifests.
benchmarks/sample_manifest.json Adds a committed sample benchmark manifest illustrating expected shape and data policy.
benchmarks/thresholds/golden_minimum.json Adds initial threshold configuration for the golden suite.
benchmarks/README.md Documents benchmark data policy and how to run eval with thresholds.
README.md Updates product positioning and documents benchmark policy + thresholded eval invocation.
docs/financial_fact_platform_roadmap.md Updates roadmap to reflect Phase 0 closure and Phase 1 eval gating.
.gitignore Ignores raw/private benchmark data and common binary formats under benchmarks/.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/eval.py
Comment on lines +140 to +145
for metric, threshold in thresholds.get("max_metrics", {}).items():
actual = metrics.get(metric)
passed = _is_number(actual) and float(actual) <= float(threshold)
failed = failed or not passed
checks.append(_threshold_check("max", metric, actual, threshold, passed))
return {"status": "failed" if failed else "passed", "checks": checks}
Comment thread scripts/eval.py
Comment on lines +123 to +127
def load_thresholds(path: str | None) -> dict[str, Any] | None:
if not path:
return None
return json.loads(Path(path).read_text(encoding="utf-8"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants