multivon-eval 0.13.0 — generation toolkit
·
11 commits
to main
since this release
The generation toolkit (#13): five ways to generate eval data, two of them free. Every generator stamps provenance, routes through the dedupe gates, and reports its rejects.
Added
mutate_cases— deterministic, zero-LLM robustness mutations (typo/whitespace/case noise, unicode confusables, punctuation strip, conservative negation flip). Each mutant records its transformation and expectation:invariant, orflipwith the old label cleared rather than silently kept. Byte-deterministic per seed.cases_from_template— parametric grids over named axes; full product (capped 2000) or greedy pairwise covering array. Rows without an expected output are valid for judge evaluators; no label is invented.generate_contrast_pairs— a minimally-edited unfaithful twin per case, judge-verified to actually flip before acceptance; rejected twins counted. Twins share apair_idfor genuinely paired comparisons.- Span-grounded doc-QA —
generate_from_text/from_filerecord each case's source span (offsets + chunk hash);unanswerable_fractiongenerates refusal-bait questions whose expected behavior is refusal. Unlocatable spans recorded as None and counted. simulate --export-cases/results_to_cases— persona transcripts become conversation EvalCases; empty transcripts skipped and counted.- CLI:
multivon-eval generategains--mutate,--template/--axes/--sample,--contrast/--no-verify,--unanswerable-fraction,--per-case,--seed.
Notes
- 73 new tests. Mutation batches dedupe on exact input identity rather than the Jaccard gate (mutants are near-duplicates by construction — the gate would reject the suite it exists to create).