Skip to content

Implement WP2 baselines and public mini sets#61

Merged
intertwine merged 9 commits intomainfrom
codex/wp2-baselines
Feb 5, 2026
Merged

Implement WP2 baselines and public mini sets#61
intertwine merged 9 commits intomainfrom
codex/wp2-baselines

Conversation

@intertwine
Copy link
Copy Markdown
Owner

Summary\n- add public mini datasets for E1/E2 with generator script\n- add baseline runners (E1 heuristic, E2 tool-only) and prompt baseline support\n- add baseline Make targets and scoreboard generator\n- document baselines and mini sets\n\n## Testing\n- make lint\n- make format

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0f6651c38

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/baselines/update_scoreboard.py Outdated
Comment on lines +114 to +119
summary = _load_summary(Path(run_dir))
metrics = summary.get("metrics", {})
metadata = summary.get("metadata", {}) or {}

baseline = metadata.get("baseline_name") or metadata.get("baseline") or summary.get("model")
model = summary.get("model")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve baseline_name to avoid overwriting scoreboard entries

The scoreboard key is derived from summary['metadata'], but the summaries generated by generate_summary only retain a small set of metadata fields (git_sha/env_version/etc.) and drop baseline_name/baseline. As a result, baseline falls back to summary['model'], and entries_by_key will overwrite results whenever multiple baselines share the same model (e.g., two prompt baselines with different prompts but identical model). This loses runs in the scoreboard and makes the baseline column misleading; you likely need to carry baseline_name through summary.json or load metadata.json directly here.

Useful? React with 👍 / 👎.

@claude
Copy link
Copy Markdown

claude Bot commented Feb 4, 2026

Code Review

I found 2 issues that need to be addressed:

Issue 1: Missing sys.path manipulation in baseline scripts

Three new baseline scripts import from local modules (bench.report and scripts.eval_utils) without adding the repository root to sys.path. These scripts will fail with ModuleNotFoundError when executed:

  • scripts/baselines/run_e1_heuristic.py (lines 14-15)
  • scripts/baselines/run_e2_tool_only.py (lines 13-14)
  • scripts/baselines/update_scoreboard.py (line 12)

Problem: The imports come BEFORE REPO_ROOT is defined, and there's no sys.path.insert(0, str(REPO_ROOT)) call.

Fix: Follow the pattern used in existing scripts like scripts/eval_config_verification.py:32-38:

import sys
from pathlib import Path

REPO_ROOT = Path(__file__).resolve().parents[2]
sys.path.insert(0, str(REPO_ROOT))

from bench.report import generate_report_md, generate_summary
from scripts.eval_utils import load_api_key_from_env, resolve_model_name

Apply this pattern to all three scripts.


Issue 2: Dead code in E1 heuristic baseline

In scripts/baselines/run_e1_heuristic.py:69-70, there's unreachable code:

if dest_port == 22 and ("FLAGS [S]" in upper or "AUTH_FAILED" in upper):
    return "Malicious", 0.65

This code can never execute because:

  1. Port 22 is in SUSPICIOUS_PORTS, so line 64-65 returns first with confidence 0.7
  2. "AUTH_FAILED" is in SUSPICIOUS_KEYWORDS, so line 58 returns first with confidence 0.8

Fix options:

  • Remove port 22 from SUSPICIOUS_PORTS if you want the special case at lines 69-70 to execute
  • Remove lines 69-70 if the special case isn't needed
  • Reorder the checks to handle port 22 specially before the general SUSPICIOUS_PORTS check

@intertwine
Copy link
Copy Markdown
Owner Author

@claude take another look

@claude
Copy link
Copy Markdown

claude Bot commented Feb 4, 2026

Claude finished @intertwine's task in 2m 22s —— View job


Code Review Complete ✅

Great improvements! The two issues from the previous review have been fixed:

  1. sys.path manipulation - All three baseline scripts now properly add REPO_ROOT to sys.path
  2. Dead code removal - The unreachable port 22 special case has been removed from the E1 heuristic

However, I found one critical issue that needs to be addressed:


Issue: Baseline name lost in summary.json, causing scoreboard overwrites

Location: bench/report.py:882-889 and scripts/baselines/update_scoreboard.py:120

Problem: The baseline_name field from metadata.json is not preserved when generate_summary() creates summary.json. This causes the scoreboard update script to incorrectly key entries by model name instead of baseline name, resulting in overwrites when multiple baselines share the same model.

Current flow:

  1. Baseline runner writes metadata.json with baseline_name field
  2. generate_summary() creates summary.json but only preserves: git_sha, env_version, python_version, verifiers_version, seed
  3. update_scoreboard.py reads summary.jsonmetadata section missing baseline_name
  4. Line 120 falls back to summary.get("model") → multiple baselines with same model overwrite each other

Example scenario:

# Both baselines use gpt-5-mini as the model
baseline-e1-prompt-20260204 (model: gpt-5-mini, baseline_name: "e1-prompt")
baseline-e1-cot-20260204 (model: gpt-5-mini, baseline_name: "e1-cot")

# Result: Only one entry in scoreboard, keyed by "gpt-5-mini"

Fix: Add baseline_name to the preserved metadata fields in bench/report.py:

metadata_fields = {
    "git_sha": metadata.get("git_sha") or metadata.get("git_commit"),
    "env_version": metadata.get("env_version"),
    "python_version": metadata.get("python_version"),
    "verifiers_version": metadata.get("verifiers_version"),
    "seed": metadata.get("seed"),
    "baseline_name": metadata.get("baseline_name"),  # ADD THIS
    "baseline_type": metadata.get("baseline_type"),  # OPTIONAL: also preserve baseline type
}

Alternatively, if you want to keep summary.json minimal, modify update_scoreboard.py to read baseline_name directly from metadata.json instead of relying on summary.json.

Fix this issue →


Other observations (non-blocking):

  1. Public mini dataset generation (scripts/data/build_public_mini.py) looks solid - deterministic sampling with balanced E1 splits
  2. Baseline runners are well-structured with proper metadata tracking
  3. Makefile targets provide good UX for running baselines
  4. Documentation in READMEs clearly explains baseline purpose and usage

@intertwine
Copy link
Copy Markdown
Owner Author

@claude one last look

@claude
Copy link
Copy Markdown

claude Bot commented Feb 5, 2026

Claude finished @intertwine's task in 2m 6s —— View job


Final Code Review ✅

Todo List:

  • Check if baseline_name preservation issue has been fixed
  • Review baseline runner implementations
  • Review scoreboard generation logic
  • Check public mini dataset generation
  • Verify documentation completeness
  • Provide final assessment

Status: One Critical Issue Remains

The baseline_name preservation issue from the previous review is still present. While the previous sys.path and dead code issues have been fixed, the core problem persists.


Issue: Baseline name lost in summary.json (Critical)

Location: bench/report.py:882-889 and scripts/baselines/update_scoreboard.py:120

Current behavior:

  1. ✅ Baseline runners write baseline_name to metadata.json (e.g., "e1-heuristic", "e1-prompt")
  2. ✅ Eval scripts write baseline_name to metadata.json
  3. generate_summary() in bench/report.py drops baseline_name when creating summary.json
  4. ❌ Scoreboard update script can't find baseline_name in summary, falls back to model name
  5. ❌ Multiple baselines with same model overwrite each other in scoreboard

Evidence from scoreboards:

// bench/scoreboards/e1_scoreboard.json shows:
{"baseline": "gpt-5-mini", "model": "gpt-5-mini", ...}  // Should be "e1-prompt"
{"baseline": "heuristic", "model": "heuristic", ...}    // Should be "e1-heuristic"

Impact:

  • Current scoreboards use model names instead of baseline names
  • If you run two different prompt baselines with the same model (e.g., "e1-prompt-v1" and "e1-prompt-v2", both using gpt-5-mini), the second will overwrite the first
  • Baseline comparison becomes ambiguous

Fix: Add baseline_name to preserved metadata fields in bench/report.py:882-889:

metadata_fields = {
    "git_sha": metadata.get("git_sha") or metadata.get("git_commit"),
    "env_version": metadata.get("env_version"),
    "python_version": metadata.get("python_version"),
    "verifiers_version": metadata.get("verifiers_version"),
    "seed": metadata.get("seed"),
    "baseline_name": metadata.get("baseline_name"),      # ADD THIS
    "baseline_type": metadata.get("baseline_type"),      # OPTIONAL: also preserve type
    "system_prompt_file": metadata.get("system_prompt_file"),  # OPTIONAL: track prompt files
}

Fix this issue →


What's Working Well ✅

  1. Baseline infrastructure is solid:

    • run_e1_heuristic.py: Clean heuristic implementation with keyword/port detection
    • run_e2_tool_only.py: Proper tool-only oracle extraction
    • Both write complete metadata including baseline_name/baseline_type
  2. Public mini datasets:

    • build_public_mini.py: Deterministic balanced sampling
    • Good metadata tagging (source_dataset, public_mini flag)
    • Proper documentation
  3. Makefile integration:

    • Clean baseline-e1/baseline-e2 targets
    • Proper timestamp generation and run ID management
    • Scoreboard auto-update after each run
  4. Documentation:

    • Clear READMEs in baselines/, bench/scoreboards/, datasets/public_mini/
    • Usage examples provided
  5. Previous issues fixed:

    • ✅ sys.path manipulation added to all baseline scripts
    • ✅ Dead port 22 code removed from E1 heuristic

Recommendation

Before merge: Fix the baseline_name preservation issue. This is a one-line change that prevents scoreboard corruption when multiple baselines share the same model.

After fix: The PR is ready to merge. The baseline infrastructure is well-designed and the public mini datasets provide a good foundation for WP2.


@intertwine
Copy link
Copy Markdown
Owner Author

@claude check again

@claude
Copy link
Copy Markdown

claude Bot commented Feb 5, 2026

Claude finished @intertwine's task in 2m 34s —— View job


Final Code Review ✅

Todo List:

  • Check if baseline_name preservation issue has been fixed in bench/report.py
  • Verify scoreboard generation logic handles baseline_name correctly
  • Review baseline runner implementations
  • Check scoreboard files for correct baseline names
  • Provide final assessment

Status: All Issues Resolved ✅

The critical baseline_name preservation issue from previous reviews has been completely fixed. The PR is now ready for merge.


What Was Fixed Since Last Review

  1. ✅ baseline_name preservation in bench/report.py

    • Lines 888-890 now preserve: baseline_name, baseline_type, and system_prompt_file
    • These fields are correctly included in the metadata_clean dict that goes into summary.json
  2. ✅ Scoreboard update script robustness

    • Lines 127-128 in update_scoreboard.py provide fallback to metadata.json if baseline_name is missing
    • This ensures backwards compatibility with older runs
  3. ✅ Scoreboard files show correct baseline names

    • E1: "e1-heuristic", "e1-prompt" (not "heuristic", "gpt-5-mini")
    • E2: "e2-prompt", "e2-tool-only" (not "gpt-5-mini", "tool-only")

Complete PR Review

Infrastructure ✅

Baseline runners (scripts/baselines/):

  • run_e1_heuristic.py: Rule-based baseline with keyword/port detection
  • run_e2_tool_only.py: Oracle extraction from tool findings
  • Both correctly set baseline_name and baseline_type in metadata

Scoreboard generation (scripts/baselines/update_scoreboard.py):

  • Merges new runs into existing scoreboard
  • Proper deduplication by (baseline, model, dataset, n)
  • Markdown table generation with all key metrics
  • Robust fallback for missing metadata fields

Evaluation script enhancements:

  • eval_network_logs.py and eval_config_verification.py both support --baseline-name
  • Passed through to build_base_metadata() for proper tracking

Datasets ✅

Public mini datasets (datasets/public_mini/):

  • e1.jsonl: 100 items, balanced benign/malicious sampling
  • e2.jsonl: 100 items, 80 k8s + 20 terraform
  • sampling-public-mini.json: Reproducible metadata (seed=42)
  • build_public_mini.py: Deterministic generator with proper metadata tagging

Documentation ✅

READMEs: Clear instructions in:

  • baselines/README.md: Overview of all 4 baselines
  • datasets/public_mini/README.md: Source datasets and regeneration
  • bench/scoreboards/README.md: Update process

Prompts: Well-crafted system prompts in:

  • baselines/e1_prompt/system_prompt.txt: Clear classification guidance
  • baselines/e2_prompt/system_prompt.txt: Tool-first security auditing

Makefile Integration ✅

New targets:

  • make baseline-e1: Runs heuristic + prompt baselines, updates scoreboard
  • make baseline-e2: Runs tool-only + prompt baselines, updates scoreboard
  • Proper timestamp generation and run ID management

Code Quality ✅

Previous issues fixed:

  • ✅ sys.path manipulation added to all baseline scripts
  • ✅ Dead port 22 code removed from E1 heuristic
  • ✅ baseline_name preservation in summary.json

No remaining issues found


Recommendation

LGTM - Ready to merge! 🚀

This PR successfully delivers WP2 with:

  • ✅ Complete baseline infrastructure (4 baselines across E1/E2)
  • ✅ Public mini datasets (100 items each, safe for CI/quick testing)
  • ✅ Automated scoreboard generation
  • ✅ Clean Makefile integration
  • ✅ Comprehensive documentation

The baseline_name preservation issue has been properly addressed with both a fix in generate_summary() and a robust fallback in the scoreboard update script.


Note: The roadmap shows WP2 as complete in the header but the checklist still has [ ] instead of [x]. Consider updating line 95 in plans/ROADMAP-Q1-2026.md to mark WP2 complete.


@intertwine intertwine merged commit 5133025 into main Feb 5, 2026
5 checks passed
@intertwine intertwine deleted the codex/wp2-baselines branch February 5, 2026 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant