Skip to content

v0.1.1: seed sanity check + correction

Latest

Choose a tag to compare

@immu4989 immu4989 released this 26 Jun 13:43
· 4 commits to main since this release

v0.1.1: seed sanity check + correction

This is a methodology release, not a code feature. v0.1.0's small-N (N=5 user tasks) workspace result was a single-seed run. This release adds a 3-seed sanity check and an honest correction note.

What changed

The seed-0 optimizer ordering reported in v0.1.0 (bootstrap > mipro > gepa) does not survive across seeds. Aggregated over seeds {0, 1, 2}, BootstrapFewShot is the lowest on important_instructions security (0.600), and MIPROv2 and GEPA tie at 0.733. Standard deviations are 0.4 to 0.5, so individual rankings here are dominated by noise at this scale.

What does hold across seeds:

  • BootstrapFewShot Pareto-dominates on direct (60% utility, 100% security).
  • unoptimized gets 0% utility on every seed.
  • Every optimizer trends below the unoptimized 80% security baseline on important_instructions (though within the std bars).

New artifacts

  • scripts/run_v02_phase1.py — single-seed GEPA addition to the optimizer comparison.
  • scripts/run_v02_phase1_seeds.py — re-runs the stochastic optimizers with additional seeds and aggregates mean ± std per (optimizer, attack) cell.
  • data/results/workspace_v02_phase1_seed1_results.csv
  • data/results/workspace_v02_phase1_seed2_results.csv
  • data/results/workspace_v02_phase1_seeds_all.csv
  • data/results/workspace_v02_phase1_seeds_summary.csv

Other notes

  • README has an update callout at the top of the v0.1 results section.
  • Substack and Medium versions of the launch blog have matching update notes pinned at the top.
  • The original v0.1 results table, charts, and numbers are preserved unchanged. The sanity check is additive.

What's next

v0.2 phase 2 will scale N from 5 to roughly 20 user tasks per cell across all four AgentDojo suites (workspace, banking, travel, slack), three seeds, and four attacks (direct, important_instructions, tool_knowledge, ignore_previous). That's the experiment that puts any optimizer-ranking claim on defensible statistical ground.