multivon-eval 0.12.0 — persona simulator + scaled gated generation
·
24 commits
to main
since this release
Two features adapted from the synthetic-eval-data space (issues #10/#11) — with the
part vendors leave out: validation, provenance, and labeled uncertainty.
Added
multivon-eval simulate— persona-driven adaptive multi-turn evaluation (#10).
Static scripts assume a fixed conversation path; the simulator drives one live: a
persona LLM (profile, goal, success criteria, behavior traits) converses with your
model_fn, adapting each turn, stopping on goal-reached / refusal /max_turns/
budget. Transcripts become conversation-shapedEvalCases scored by the existing
conversation evaluators plus a goal-completion judge. Personas come from a JSONL or
propose_personas()(one LLM call, always includes an adversarial persona).
Honesty contract, test-pinned: every output carries "simulated personas — measures
behavior under synthetic users, not real traffic"; hardbudget_usdceiling with
pre-spend estimate (personas cut off carrystop_reason="budget_exceeded", partials
never lost); judge model/temperature recorded, NO determinism claim. Recorder
synergy: each conversation binds itscase_uid, so--record-promptsduring
simulation yields observed case→site bindings — simulation with provenance.- Scaled + gated case generation (#11).
bootstrap --n-seed-casesnow works to
500 (batched ≤30/call, later batches steered away from already-accepted inputs),
and every generated case passes gates: well-formed (structural), duplicate
(NFC-loose-normalize identity OR token-Jaccard ≥ 0.85, cross-batch), and — behind
--validate-cases --baseline-model-file— the 0.8.0 hardness band via
auto.validate_adversarial_cases. No silent caps:BootstrapResult.generation_report
and a DISCOVERY_REPORT "Case generation" section print "generated N, accepted M —
dropped k duplicates, j malformed[, i outside hardness band]"; a skipped hardness
gate says so. Per-casemetadata["generation"]carries batch, gates passed, and
hardness. New--budget-usdpre-spend ceiling (estimate checked before any LLM call).
Notes
- 54 new tests (25 simulator, 29 generation/gates); 1092 green on the tracked suite.
- Both features verified live before release: a haiku-driven persona reached its goal
and scored 1.00/0.67/1.00 on the conversation evaluators; a 60-case bootstrap
produced the batch-accounting report end-to-end.