Implement WP2 baselines and public mini sets by intertwine · Pull Request #61 · intertwine/security-verifiers

intertwine · 2026-02-04T02:54:17Z

Summary\n- add public mini datasets for E1/E2 with generator script\n- add baseline runners (E1 heuristic, E2 tool-only) and prompt baseline support\n- add baseline Make targets and scoreboard generator\n- document baselines and mini sets\n\n## Testing\n- make lint\n- make format

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0f6651c38

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-04T02:57:24Z

+        summary = _load_summary(Path(run_dir))
+        metrics = summary.get("metrics", {})
+        metadata = summary.get("metadata", {}) or {}
+
+        baseline = metadata.get("baseline_name") or metadata.get("baseline") or summary.get("model")
+        model = summary.get("model")


Preserve baseline_name to avoid overwriting scoreboard entries

The scoreboard key is derived from summary['metadata'], but the summaries generated by generate_summary only retain a small set of metadata fields (git_sha/env_version/etc.) and drop baseline_name/baseline. As a result, baseline falls back to summary['model'], and entries_by_key will overwrite results whenever multiple baselines share the same model (e.g., two prompt baselines with different prompts but identical model). This loses runs in the scoreboard and makes the baseline column misleading; you likely need to carry baseline_name through summary.json or load metadata.json directly here.

Useful? React with 👍 / 👎.

claude · 2026-02-04T03:02:11Z

Code Review

I found 2 issues that need to be addressed:

Issue 1: Missing `sys.path` manipulation in baseline scripts

Three new baseline scripts import from local modules (bench.report and scripts.eval_utils) without adding the repository root to sys.path. These scripts will fail with ModuleNotFoundError when executed:

scripts/baselines/run_e1_heuristic.py (lines 14-15)
scripts/baselines/run_e2_tool_only.py (lines 13-14)
scripts/baselines/update_scoreboard.py (line 12)

Problem: The imports come BEFORE REPO_ROOT is defined, and there's no sys.path.insert(0, str(REPO_ROOT)) call.

Fix: Follow the pattern used in existing scripts like scripts/eval_config_verification.py:32-38:

import sys
from pathlib import Path

REPO_ROOT = Path(__file__).resolve().parents[2]
sys.path.insert(0, str(REPO_ROOT))

from bench.report import generate_report_md, generate_summary
from scripts.eval_utils import load_api_key_from_env, resolve_model_name

Apply this pattern to all three scripts.

Issue 2: Dead code in E1 heuristic baseline

In scripts/baselines/run_e1_heuristic.py:69-70, there's unreachable code:

if dest_port == 22 and ("FLAGS [S]" in upper or "AUTH_FAILED" in upper):
    return "Malicious", 0.65

This code can never execute because:

Port 22 is in SUSPICIOUS_PORTS, so line 64-65 returns first with confidence 0.7
"AUTH_FAILED" is in SUSPICIOUS_KEYWORDS, so line 58 returns first with confidence 0.8

Fix options:

Remove port 22 from SUSPICIOUS_PORTS if you want the special case at lines 69-70 to execute
Remove lines 69-70 if the special case isn't needed
Reorder the checks to handle port 22 specially before the general SUSPICIOUS_PORTS check

intertwine · 2026-02-04T23:42:44Z

@claude take another look

claude · 2026-02-04T23:42:57Z

Claude finished @intertwine's task in 2m 22s —— View job

Code Review Complete ✅

Great improvements! The two issues from the previous review have been fixed:

✅ sys.path manipulation - All three baseline scripts now properly add REPO_ROOT to sys.path
✅ Dead code removal - The unreachable port 22 special case has been removed from the E1 heuristic

However, I found one critical issue that needs to be addressed:

Issue: Baseline name lost in summary.json, causing scoreboard overwrites

Location: bench/report.py:882-889 and scripts/baselines/update_scoreboard.py:120

Problem: The baseline_name field from metadata.json is not preserved when generate_summary() creates summary.json. This causes the scoreboard update script to incorrectly key entries by model name instead of baseline name, resulting in overwrites when multiple baselines share the same model.

Current flow:

Baseline runner writes metadata.json with baseline_name field
generate_summary() creates summary.json but only preserves: git_sha, env_version, python_version, verifiers_version, seed
update_scoreboard.py reads summary.json → metadata section missing baseline_name
Line 120 falls back to summary.get("model") → multiple baselines with same model overwrite each other

Example scenario:

# Both baselines use gpt-5-mini as the model
baseline-e1-prompt-20260204 (model: gpt-5-mini, baseline_name: "e1-prompt")
baseline-e1-cot-20260204 (model: gpt-5-mini, baseline_name: "e1-cot")

# Result: Only one entry in scoreboard, keyed by "gpt-5-mini"

Fix: Add baseline_name to the preserved metadata fields in bench/report.py:

metadata_fields = {
    "git_sha": metadata.get("git_sha") or metadata.get("git_commit"),
    "env_version": metadata.get("env_version"),
    "python_version": metadata.get("python_version"),
    "verifiers_version": metadata.get("verifiers_version"),
    "seed": metadata.get("seed"),
    "baseline_name": metadata.get("baseline_name"),  # ADD THIS
    "baseline_type": metadata.get("baseline_type"),  # OPTIONAL: also preserve baseline type
}

Alternatively, if you want to keep summary.json minimal, modify update_scoreboard.py to read baseline_name directly from metadata.json instead of relying on summary.json.

Fix this issue →

Other observations (non-blocking):

Public mini dataset generation (scripts/data/build_public_mini.py) looks solid - deterministic sampling with balanced E1 splits
Baseline runners are well-structured with proper metadata tracking
Makefile targets provide good UX for running baselines
Documentation in READMEs clearly explains baseline purpose and usage

intertwine · 2026-02-05T02:03:41Z

@claude one last look

claude · 2026-02-05T02:03:58Z

Claude finished @intertwine's task in 2m 6s —— View job

Final Code Review ✅

Todo List:

Check if baseline_name preservation issue has been fixed
Review baseline runner implementations
Review scoreboard generation logic
Check public mini dataset generation
Verify documentation completeness
Provide final assessment

Status: One Critical Issue Remains

The baseline_name preservation issue from the previous review is still present. While the previous sys.path and dead code issues have been fixed, the core problem persists.

Issue: Baseline name lost in summary.json (Critical)

Location: bench/report.py:882-889 and scripts/baselines/update_scoreboard.py:120

Current behavior:

✅ Baseline runners write baseline_name to metadata.json (e.g., "e1-heuristic", "e1-prompt")
✅ Eval scripts write baseline_name to metadata.json
❌ generate_summary() in bench/report.py drops baseline_name when creating summary.json
❌ Scoreboard update script can't find baseline_name in summary, falls back to model name
❌ Multiple baselines with same model overwrite each other in scoreboard

Evidence from scoreboards:

// bench/scoreboards/e1_scoreboard.json shows:
{"baseline": "gpt-5-mini", "model": "gpt-5-mini", ...}  // Should be "e1-prompt"
{"baseline": "heuristic", "model": "heuristic", ...}    // Should be "e1-heuristic"

Impact:

Current scoreboards use model names instead of baseline names
If you run two different prompt baselines with the same model (e.g., "e1-prompt-v1" and "e1-prompt-v2", both using gpt-5-mini), the second will overwrite the first
Baseline comparison becomes ambiguous

Fix: Add baseline_name to preserved metadata fields in bench/report.py:882-889:

metadata_fields = {
    "git_sha": metadata.get("git_sha") or metadata.get("git_commit"),
    "env_version": metadata.get("env_version"),
    "python_version": metadata.get("python_version"),
    "verifiers_version": metadata.get("verifiers_version"),
    "seed": metadata.get("seed"),
    "baseline_name": metadata.get("baseline_name"),      # ADD THIS
    "baseline_type": metadata.get("baseline_type"),      # OPTIONAL: also preserve type
    "system_prompt_file": metadata.get("system_prompt_file"),  # OPTIONAL: track prompt files
}

Fix this issue →

What's Working Well ✅

Baseline infrastructure is solid:
- run_e1_heuristic.py: Clean heuristic implementation with keyword/port detection
- run_e2_tool_only.py: Proper tool-only oracle extraction
- Both write complete metadata including baseline_name/baseline_type
Public mini datasets:
- build_public_mini.py: Deterministic balanced sampling
- Good metadata tagging (source_dataset, public_mini flag)
- Proper documentation
Makefile integration:
- Clean baseline-e1/baseline-e2 targets
- Proper timestamp generation and run ID management
- Scoreboard auto-update after each run
Documentation:
- Clear READMEs in baselines/, bench/scoreboards/, datasets/public_mini/
- Usage examples provided
Previous issues fixed:
- ✅ sys.path manipulation added to all baseline scripts
- ✅ Dead port 22 code removed from E1 heuristic

Recommendation

Before merge: Fix the baseline_name preservation issue. This is a one-line change that prevents scoreboard corruption when multiple baselines share the same model.

After fix: The PR is ready to merge. The baseline infrastructure is well-designed and the public mini datasets provide a good foundation for WP2.

intertwine · 2026-02-05T02:20:59Z

@claude check again

claude · 2026-02-05T02:21:15Z

Claude finished @intertwine's task in 2m 34s —— View job

Final Code Review ✅

Todo List:

Check if baseline_name preservation issue has been fixed in bench/report.py
Verify scoreboard generation logic handles baseline_name correctly
Review baseline runner implementations
Check scoreboard files for correct baseline names
Provide final assessment

Status: All Issues Resolved ✅

The critical baseline_name preservation issue from previous reviews has been completely fixed. The PR is now ready for merge.

What Was Fixed Since Last Review

✅ baseline_name preservation in bench/report.py
- Lines 888-890 now preserve: baseline_name, baseline_type, and system_prompt_file
- These fields are correctly included in the metadata_clean dict that goes into summary.json
✅ Scoreboard update script robustness
- Lines 127-128 in update_scoreboard.py provide fallback to metadata.json if baseline_name is missing
- This ensures backwards compatibility with older runs
✅ Scoreboard files show correct baseline names
- E1: "e1-heuristic", "e1-prompt" (not "heuristic", "gpt-5-mini")
- E2: "e2-prompt", "e2-tool-only" (not "gpt-5-mini", "tool-only")

Complete PR Review

Infrastructure ✅

Baseline runners (scripts/baselines/):

run_e1_heuristic.py: Rule-based baseline with keyword/port detection
run_e2_tool_only.py: Oracle extraction from tool findings
Both correctly set baseline_name and baseline_type in metadata

Scoreboard generation (scripts/baselines/update_scoreboard.py):

Merges new runs into existing scoreboard
Proper deduplication by (baseline, model, dataset, n)
Markdown table generation with all key metrics
Robust fallback for missing metadata fields

Evaluation script enhancements:

eval_network_logs.py and eval_config_verification.py both support --baseline-name
Passed through to build_base_metadata() for proper tracking

Datasets ✅

Public mini datasets (datasets/public_mini/):

e1.jsonl: 100 items, balanced benign/malicious sampling
e2.jsonl: 100 items, 80 k8s + 20 terraform
sampling-public-mini.json: Reproducible metadata (seed=42)
build_public_mini.py: Deterministic generator with proper metadata tagging

Documentation ✅

READMEs: Clear instructions in:

baselines/README.md: Overview of all 4 baselines
datasets/public_mini/README.md: Source datasets and regeneration
bench/scoreboards/README.md: Update process

Prompts: Well-crafted system prompts in:

baselines/e1_prompt/system_prompt.txt: Clear classification guidance
baselines/e2_prompt/system_prompt.txt: Tool-first security auditing

Makefile Integration ✅

New targets:

make baseline-e1: Runs heuristic + prompt baselines, updates scoreboard
make baseline-e2: Runs tool-only + prompt baselines, updates scoreboard
Proper timestamp generation and run ID management

Code Quality ✅

Previous issues fixed:

✅ sys.path manipulation added to all baseline scripts
✅ Dead port 22 code removed from E1 heuristic
✅ baseline_name preservation in summary.json

No remaining issues found

Recommendation

LGTM - Ready to merge! 🚀

This PR successfully delivers WP2 with:

✅ Complete baseline infrastructure (4 baselines across E1/E2)
✅ Public mini datasets (100 items each, safe for CI/quick testing)
✅ Automated scoreboard generation
✅ Clean Makefile integration
✅ Comprehensive documentation

The baseline_name preservation issue has been properly addressed with both a fix in generate_summary() and a robust fallback in the scoreboard update script.

Note: The roadmap shows WP2 as complete in the header but the checklist still has [ ] instead of [x]. Consider updating line 95 in plans/ROADMAP-Q1-2026.md to mark WP2 complete.

intertwine added 3 commits February 3, 2026 21:10

Normalize unprefixed tool IDs in E2 scoring

ee7965b

Update Q1 roadmap status notes

abd8172

Add WP2 baselines and public mini sets

b0f6651

chatgpt-codex-connector Bot reviewed Feb 4, 2026

View reviewed changes

Fix baseline script imports

fc60dc0

intertwine and others added 3 commits February 4, 2026 20:47

Populate baseline scoreboards

1c61b42

Mark WP2 complete in roadmap

95c4992

Merge branch 'main' into codex/wp2-baselines

2731eb7

Preserve baseline metadata in summaries

03bea61

Mark WP2 complete in checklist

c26d0b0

intertwine merged commit 5133025 into main Feb 5, 2026
5 checks passed

intertwine deleted the codex/wp2-baselines branch February 5, 2026 02:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement WP2 baselines and public mini sets#61

Implement WP2 baselines and public mini sets#61
intertwine merged 9 commits intomainfrom
codex/wp2-baselines

intertwine commented Feb 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 4, 2026

Uh oh!

claude Bot commented Feb 4, 2026

Uh oh!

intertwine commented Feb 4, 2026

Uh oh!

claude Bot commented Feb 4, 2026 •

edited

Loading

Uh oh!

intertwine commented Feb 5, 2026

Uh oh!

claude Bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

intertwine commented Feb 5, 2026

Uh oh!

claude Bot commented Feb 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

intertwine commented Feb 4, 2026

Summary\n- add public mini datasets for E1/E2 with generator script\n- add baseline runners (E1 heuristic, E2 tool-only) and prompt baseline support\n- add baseline Make targets and scoreboard generator\n- document baselines and mini sets\n\n## Testing\n- make lint\n- make format

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Feb 4, 2026

Code Review

Issue 1: Missing sys.path manipulation in baseline scripts

Issue 2: Dead code in E1 heuristic baseline

Uh oh!

intertwine commented Feb 4, 2026

Uh oh!

claude Bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✅

Issue: Baseline name lost in summary.json, causing scoreboard overwrites

Other observations (non-blocking):

Uh oh!

intertwine commented Feb 5, 2026

Uh oh!

claude Bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Final Code Review ✅

Status: One Critical Issue Remains

Issue: Baseline name lost in summary.json (Critical)

What's Working Well ✅

Recommendation

Uh oh!

intertwine commented Feb 5, 2026

Uh oh!

claude Bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Final Code Review ✅

Status: All Issues Resolved ✅

What Was Fixed Since Last Review

Complete PR Review

Infrastructure ✅

Datasets ✅

Documentation ✅

Makefile Integration ✅

Code Quality ✅

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Issue 1: Missing `sys.path` manipulation in baseline scripts

claude Bot commented Feb 4, 2026 •

edited

Loading

claude Bot commented Feb 5, 2026 •

edited

Loading

claude Bot commented Feb 5, 2026 •

edited

Loading