Get genuinely better prompts. Not just different ones.
Mev is an intent-driven prompt-optimization engine that goes beyond what most prompt-tuning tools do. You describe what you want an LLM agent to do in plain English; Mev compiles that into a task spec, synthesizes a stratified evaluation dataset, evolves prompts through a beam search with crossover, scores everything against a held-out generalization set, and locks in the prompt that wins on real metrics — score, cost, latency, and consistency.
Most prompt optimizers do one of two things: a) hill-climb on a single prompt with a single judge, or b) ask GPT-4 to "improve this prompt" in a loop. Both overfit to whatever evaluation cases you have, give you noisy rankings, and stop improving fast.
Mev brings the rigor of modern ML evaluation to prompt engineering:
| Feature | Most tools | Mev |
|---|---|---|
| Train/test split | Same cases for evolution and final score | 70/30 stratified split — final score is on cases the optimizer never saw |
| Search strategy | Single-parent greedy | Beam search (top-K diverse) + crossover between parents |
| Judge reliability | One sample per case | Self-consistency (median of N) with variance-attenuated confidence |
| Few-shot examples | Hand-written or none | Bootstrapped from the prompt's strongest cases automatically |
| Final inference | One pass | Best-of-N for lock-in (configurable) |
| Selection criterion | Train score | Holdout score + score variance + cost + latency (true Pareto) |
| Plateau handling | Stop or keep flailing | Detects via score-band, terminates early to save budget |
Each of these is a known win in the literature (DSPy MIPROv2, OPRO, PromptBreeder, self-consistency). Mev is the first engine to put them all in one no-config pipeline.
-
Compile intent (
mev init→mev.toml) Write a paragraph describing the agent. Mev turns it into a structured task spec with success criteria, failure modes, and difficulty axes. -
Synthesize cases with critic gating, then stratified train/test split Generates evaluation inputs + reference outputs + rubrics, gates them through a critic, dedupes them, then splits into a train set (used during evolution) and a holdout set (locked away until final scoring).
-
Run baseline + evolve Scores the starter prompt on train cases. Then runs beam search: each generation, the top-K most diverse high-scorers spawn children via reflector→editor mutation or a crossover that synthesizes a new prompt from two parents. Children inherit bootstrapped few-shot examples from their parent's strongest cases.
-
Pareto sweep on holdout The top candidates are re-scored on the held-out generalization set with self-consistency judging (median of N samples) and optional best-of-N inference. The winner is the prompt that scores best on cases it never trained against.
-
Lock-in & report Writes the winning prompt to
prompts/locked.mdwith full provenance (train score, holdout score, variance, latency, cost). Generates a rich HTML report showing the Pareto frontier, the evolution timeline (with mutation/crossover labels), and any escalations. -
Resume after interruption Every phase writes a
checkpoint.json. Run with--resumeto pick up where you left off.
- Bun
- Ollama running locally (or API keys for Anthropic / OpenAI / Ollama Cloud)
bun installbun run src/cli.ts initThis creates mev.toml, plus prompts/, cases/, and runs/ directories.
Edit mev.toml:
[[models]]
alias = "qwen-coder"
provider = "ollama-local"
model = "qwen2.5-coder:32b-instruct-q4_K_M"
[judge]
provider = "ollama-local"
model = "gemma4:e4b"
[synthesizer]
provider = "ollama-local"
model = "gemma4:e4b"
[critic]
provider = "ollama-local"
model = "gemma4:e4b"
[optimization]
holdout_fraction = 0.3 # 30% of cases held out for generalization scoring
beam_width = 3 # top-3 diverse beam each generation
judge_samples = 3 # 3-sample self-consistency for judge (recommended for serious runs)
crossover_rate = 0.3 # 30% of children come from crossover, 70% from mutation
max_examples = 3 # bootstrap up to 3 few-shot examples from strong cases
lockin_best_of_n = 3 # at lock-in, run completion 3x and judge picks the bestbun run src/cli.ts optimize --yesOutput during a run:
[A] ✓ Task spec compiled
[B] ✓ 12 / 16 accepted (75%). Filters: 1 schema, 2 dedup, 1 critic, 0 trivial.
[B] ✓ Train/test split: 9 train + 3 holdout (generalization eval).
[Split] train=9 | holdout=3 (held out for generalization)
[C] Running baseline on TRAIN cases...
[C] ✓ Baseline complete (train: 3.42, holdout: 3.50)
[D] Evolving system prompt (beam=3, crossover=0.3, judge_samples=3)...
[E] Evaluating 5 candidates on 3 holdout cases...
[F] Computing Pareto frontier...
[Result] Holdout improvement: 3.50 → 4.67 (+33.4%)
✓ Locked in. Total time: 1923s
bun run src/cli.ts optimize --yes --resumebun run src/cli.ts regress --threshold 1.0| Command | Description |
|---|---|
mev init |
Create a new project interactively |
mev optimize |
Run the full optimization pipeline |
mev optimize --resume |
Resume the latest interrupted run |
mev regress |
Regression test the locked prompt |
mev models |
List available models |
mev diff <runA> <runB> |
Compare two runs |
mev report <run> |
Print the HTML report for a run |
After a successful run:
mev.toml # updated config with locked-in model first
prompts/locked.md # the winning system prompt + provenance header
cases/ # synthesized evaluation cases (with holdout flag)
runs/<timestamp>/ # artifacts:
checkpoint.json # resume state
spec.json # compiled task spec
cases/ # snapshot of cases used in this run
baseline.jsonl # baseline scores (split=train/holdout)
evolution.ndjson # generational history (operator: mutate/crossover)
sweep.jsonl # train + holdout scores per (prompt, model) pair
pareto.json # final Pareto frontier
report.html # rich human-readable report
SUMMARY.md # markdown summary with headline metrics
src/cli.ts— CLI entrypoint (Clipanion)src/phase/optimize.ts— main pipeline with checkpoint/resume + train/test splitsrc/phase/synthesize.ts— case generation, critic gating, stratified holdout splitsrc/phase/compile-intent.ts— Phase A: intent → task specsrc/judge/index.ts— absolute + pairwise judging with self-consistency and best-of-Nsrc/evolve/index.ts— beam search evolution with mutation + crossover + few-shot bootstrapsrc/evolve/archive.ts— Pareto archive + diverse top-K beam selection (MMR)src/escalation/index.ts— calibration drift, plateau, variance, critic-rejection-ratesrc/lockin/index.ts— lock-in writer with full provenancesrc/reporting/index.ts— rich HTML reportssrc/provider/— provider wrappers (Anthropic, OpenAI, Ollama local/cloud) with timeouts
The [optimization] section in mev.toml controls the search strategy. Sensible defaults are baked in; turn them up when you need more rigor:
| Knob | Default | When to increase |
|---|---|---|
holdout_fraction |
0.3 | Larger case set → can afford bigger holdout |
beam_width |
3 | Diverse exploration; slower |
judge_samples |
1 | Set to 3+ for noisy judges or close-call decisions |
crossover_rate |
0.3 | More if your beam looks like clones |
max_examples |
3 | Long-context models can handle more |
lockin_best_of_n |
1 | Set to 3-5 for production locked prompts |
bun test69 tests cover:
- Pareto archive logic + diverse beam selection
- Train/test split helpers and dedup
- Checkpoint round-trips and resume
- Real execution-before-judging
- Escalation triggers (calibration, plateau, rejection rate)
- Provider utils
- Reporting + summary generation
MIT
