Implement WP2 baselines and public mini sets#61
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0f6651c38
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| summary = _load_summary(Path(run_dir)) | ||
| metrics = summary.get("metrics", {}) | ||
| metadata = summary.get("metadata", {}) or {} | ||
|
|
||
| baseline = metadata.get("baseline_name") or metadata.get("baseline") or summary.get("model") | ||
| model = summary.get("model") |
There was a problem hiding this comment.
Preserve baseline_name to avoid overwriting scoreboard entries
The scoreboard key is derived from summary['metadata'], but the summaries generated by generate_summary only retain a small set of metadata fields (git_sha/env_version/etc.) and drop baseline_name/baseline. As a result, baseline falls back to summary['model'], and entries_by_key will overwrite results whenever multiple baselines share the same model (e.g., two prompt baselines with different prompts but identical model). This loses runs in the scoreboard and makes the baseline column misleading; you likely need to carry baseline_name through summary.json or load metadata.json directly here.
Useful? React with 👍 / 👎.
Code ReviewI found 2 issues that need to be addressed: Issue 1: Missing
|
|
@claude take another look |
|
Claude finished @intertwine's task in 2m 22s —— View job Code Review Complete ✅Great improvements! The two issues from the previous review have been fixed:
However, I found one critical issue that needs to be addressed: Issue: Baseline name lost in summary.json, causing scoreboard overwritesLocation: Problem: The Current flow:
Example scenario: # Both baselines use gpt-5-mini as the model
baseline-e1-prompt-20260204 (model: gpt-5-mini, baseline_name: "e1-prompt")
baseline-e1-cot-20260204 (model: gpt-5-mini, baseline_name: "e1-cot")
# Result: Only one entry in scoreboard, keyed by "gpt-5-mini"Fix: Add metadata_fields = {
"git_sha": metadata.get("git_sha") or metadata.get("git_commit"),
"env_version": metadata.get("env_version"),
"python_version": metadata.get("python_version"),
"verifiers_version": metadata.get("verifiers_version"),
"seed": metadata.get("seed"),
"baseline_name": metadata.get("baseline_name"), # ADD THIS
"baseline_type": metadata.get("baseline_type"), # OPTIONAL: also preserve baseline type
}Alternatively, if you want to keep Other observations (non-blocking):
|
|
@claude one last look |
|
Claude finished @intertwine's task in 2m 6s —— View job Final Code Review ✅Todo List:
Status: One Critical Issue RemainsThe Issue: Baseline name lost in summary.json (Critical)Location: Current behavior:
Evidence from scoreboards: // bench/scoreboards/e1_scoreboard.json shows:
{"baseline": "gpt-5-mini", "model": "gpt-5-mini", ...} // Should be "e1-prompt"
{"baseline": "heuristic", "model": "heuristic", ...} // Should be "e1-heuristic"Impact:
Fix: Add metadata_fields = {
"git_sha": metadata.get("git_sha") or metadata.get("git_commit"),
"env_version": metadata.get("env_version"),
"python_version": metadata.get("python_version"),
"verifiers_version": metadata.get("verifiers_version"),
"seed": metadata.get("seed"),
"baseline_name": metadata.get("baseline_name"), # ADD THIS
"baseline_type": metadata.get("baseline_type"), # OPTIONAL: also preserve type
"system_prompt_file": metadata.get("system_prompt_file"), # OPTIONAL: track prompt files
}What's Working Well ✅
RecommendationBefore merge: Fix the baseline_name preservation issue. This is a one-line change that prevents scoreboard corruption when multiple baselines share the same model. After fix: The PR is ready to merge. The baseline infrastructure is well-designed and the public mini datasets provide a good foundation for WP2. |
|
@claude check again |
|
Claude finished @intertwine's task in 2m 34s —— View job Final Code Review ✅Todo List:
Status: All Issues Resolved ✅The critical What Was Fixed Since Last Review
Complete PR ReviewInfrastructure ✅Baseline runners (scripts/baselines/):
Scoreboard generation (scripts/baselines/update_scoreboard.py):
Evaluation script enhancements:
Datasets ✅Public mini datasets (datasets/public_mini/):
Documentation ✅READMEs: Clear instructions in:
Prompts: Well-crafted system prompts in:
Makefile Integration ✅New targets:
Code Quality ✅Previous issues fixed:
No remaining issues found RecommendationLGTM - Ready to merge! 🚀 This PR successfully delivers WP2 with:
The baseline_name preservation issue has been properly addressed with both a fix in Note: The roadmap shows WP2 as complete in the header but the checklist still has |
Summary\n- add public mini datasets for E1/E2 with generator script\n- add baseline runners (E1 heuristic, E2 tool-only) and prompt baseline support\n- add baseline Make targets and scoreboard generator\n- document baselines and mini sets\n\n## Testing\n- make lint\n- make format