Skip to content

multivon-eval 0.12.0 — persona simulator + scaled gated generation

Choose a tag to compare

@siddharthsrivastava siddharthsrivastava released this 12 Jun 22:09
· 24 commits to main since this release

Two features adapted from the synthetic-eval-data space (issues #10/#11) — with the
part vendors leave out: validation, provenance, and labeled uncertainty.

Added

  • multivon-eval simulate — persona-driven adaptive multi-turn evaluation (#10).
    Static scripts assume a fixed conversation path; the simulator drives one live: a
    persona LLM (profile, goal, success criteria, behavior traits) converses with your
    model_fn, adapting each turn, stopping on goal-reached / refusal / max_turns /
    budget. Transcripts become conversation-shaped EvalCases scored by the existing
    conversation evaluators plus a goal-completion judge. Personas come from a JSONL or
    propose_personas() (one LLM call, always includes an adversarial persona).
    Honesty contract, test-pinned: every output carries "simulated personas — measures
    behavior under synthetic users, not real traffic"; hard budget_usd ceiling with
    pre-spend estimate (personas cut off carry stop_reason="budget_exceeded", partials
    never lost); judge model/temperature recorded, NO determinism claim. Recorder
    synergy: each conversation binds its case_uid, so --record-prompts during
    simulation yields observed case→site bindings — simulation with provenance.
  • Scaled + gated case generation (#11). bootstrap --n-seed-cases now works to
    500 (batched ≤30/call, later batches steered away from already-accepted inputs),
    and every generated case passes gates: well-formed (structural), duplicate
    (NFC-loose-normalize identity OR token-Jaccard ≥ 0.85, cross-batch), and — behind
    --validate-cases --baseline-model-file — the 0.8.0 hardness band via
    auto.validate_adversarial_cases. No silent caps: BootstrapResult.generation_report
    and a DISCOVERY_REPORT "Case generation" section print "generated N, accepted M —
    dropped k duplicates, j malformed[, i outside hardness band]"; a skipped hardness
    gate says so. Per-case metadata["generation"] carries batch, gates passed, and
    hardness. New --budget-usd pre-spend ceiling (estimate checked before any LLM call).

Notes

  • 54 new tests (25 simulator, 29 generation/gates); 1092 green on the tracked suite.
  • Both features verified live before release: a haiku-driven persona reached its goal
    and scored 1.00/0.67/1.00 on the conversation evaluators; a 60-case bootstrap
    produced the batch-accounting report end-to-end.