Add benchmark thresholds and positioning docs#13
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR finalizes Phase 0 positioning docs and introduces an executable Phase 1 quality gate by adding benchmark metadata artifacts and making the evaluation script enforce configurable metric thresholds.
Changes:
- Added benchmark manifest documentation + JSON schema/sample manifest and a committed “golden minimum” thresholds config.
- Extended
scripts/eval.pyto load thresholds, evaluate min/max metric gates, include threshold results in reports, and return a non-zero exit code on threshold failure. - Added tests covering threshold evaluation and basic manifest JSON sanity checks; updated README/roadmap and benchmark data policy guidance.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
scripts/eval.py |
Adds --thresholds, evaluates min/max metric gates, records threshold results in the report, and exits non-zero when gates fail. |
tests/test_eval_script.py |
Adds unit tests for threshold evaluation and verifies main() returns 2 when thresholds fail. |
tests/test_benchmark_manifest.py |
Adds a basic JSON-load sanity test for the manifest schema and sample manifest. |
benchmarks/manifest.schema.json |
Introduces a JSON Schema for benchmark manifests. |
benchmarks/sample_manifest.json |
Adds a committed sample benchmark manifest illustrating expected shape and data policy. |
benchmarks/thresholds/golden_minimum.json |
Adds initial threshold configuration for the golden suite. |
benchmarks/README.md |
Documents benchmark data policy and how to run eval with thresholds. |
README.md |
Updates product positioning and documents benchmark policy + thresholded eval invocation. |
docs/financial_fact_platform_roadmap.md |
Updates roadmap to reflect Phase 0 closure and Phase 1 eval gating. |
.gitignore |
Ignores raw/private benchmark data and common binary formats under benchmarks/. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+140
to
+145
| for metric, threshold in thresholds.get("max_metrics", {}).items(): | ||
| actual = metrics.get(metric) | ||
| passed = _is_number(actual) and float(actual) <= float(threshold) | ||
| failed = failed or not passed | ||
| checks.append(_threshold_check("max", metric, actual, threshold, passed)) | ||
| return {"status": "failed" if failed else "passed", "checks": checks} |
Comment on lines
+123
to
+127
| def load_thresholds(path: str | None) -> dict[str, Any] | None: | ||
| if not path: | ||
| return None | ||
| return json.loads(Path(path).read_text(encoding="utf-8")) | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR closes the remaining Phase 0 positioning work and adds the first executable Phase 1 quality gate for evaluation.
It formalizes Jetbot as a Filing-to-Model Copilot / Financial Fact Platform, adds committed benchmark manifest metadata, and makes
scripts/eval.pyfail when metrics fall below configured thresholds.What Changed
README.mdto align product positioning with Filing-to-Model Copilot / Financial Fact Platform.docs/financial_fact_platform_roadmap.mdto reflect the post-PR12 state and to sequence next work as Phase 0 closure plus Phase 1 eval gating.benchmarks/:benchmarks/README.mdbenchmarks/manifest.schema.jsonbenchmarks/sample_manifest.jsonbenchmarks/thresholds/golden_minimum.json.gitignoreto keep raw/private benchmark data out of git while allowing schemas, manifests, and thresholds.--thresholdssupport toscripts/eval.py.Why
PR12 added the fact foundation and eval runner, but two gaps remained:
This PR closes both gaps so the next implementation slice can move into correction APIs, review UX, and exports with an actual regression gate in place.
Validation
python -m ruff check src tests scriptspython -m mypy src --ignore-missing-importspython -m pytest -q --timeout=60python scripts/eval.py --thresholds benchmarks/thresholds/golden_minimum.json --output-dir data/eval-devFollow-up