git difffor prompts — compare LLM responses across prompt versions.
See token count changes, cost deltas, latency shifts, and a word-level diff of the actual responses — all in one command.
$ llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o
llm-diff openai/gpt-4o
tokens 312 → 289 -23 (-7.4%)
input 45 → 38 -7
output 267 → 251 -16
cost $0.0041 → $0.0038 -$0.0003 (-7.3%)
latency 1247ms → 943ms -304ms (-24.4%)
--- prompt A
+++ prompt B
The capital of France is Paris.
- It is located in northern France and has a population of approximately 2.1 million people...
+ Paris, with ~2.1M residents, serves as the political and cultural center of the country...
npx llm-diff --a v1.txt --b v2.txt --model gpt-4oOr install globally:
npm install -g llm-diff1. Set your API key:
export OPENAI_API_KEY=sk-...
# or ANTHROPIC_API_KEY, GEMINI_API_KEY, GROQ_API_KEY2. Compare two prompts:
# From files
llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o
# Inline text
llm-diff -a "Explain gravity" -b "Explain gravity to a child" -m gpt-4o-mini
# With a system prompt
llm-diff -a v1.txt -b v2.txt -m claude-sonnet-4-20250514 -s "You are a science teacher"llm-diff --a <prompt-a> --b <prompt-b> --model <model> [options]
| Flag | Description |
|---|---|
--a, -a |
Prompt A — file path or inline text |
--b, -b |
Prompt B — file path or inline text |
--model, -m |
Model name (see --models for full list) |
| Flag | Default | Description |
|---|---|---|
--system, -s |
— | System prompt (file path or inline text) |
--base-url |
— | Gateway URL override |
--max-tokens |
2048 | Max output tokens |
--temperature |
0 | Temperature |
--timeout |
60000 | Request timeout (ms) |
--runs |
1 | Number of runs to average |
--no-parallel |
— | Run A and B sequentially |
--full |
— | Show full inline diff with highlighting |
--json |
— | JSON output for scripting |
--models |
— | List supported models and pricing |
llm-diff --modelsgpt-4o · gpt-4o-mini · gpt-4-turbo · gpt-4 · gpt-3.5-turbo · o1 · o1-mini · o3-mini
claude-sonnet-4-20250514 · claude-3.5-haiku · claude-3-opus
gemini-2.0-flash · gemini-2.0-pro · gemini-1.5-pro · gemini-1.5-flash
llama-3.3-70b · llama-3.1-8b · mixtral-8x7b · gemma2-9b
Route requests through a custom gateway (like llmhut) instead of direct API calls:
llm-diff --a v1.txt --b v2.txt -m gpt-4o --base-url https://gw.llmhut.com/v1The gateway handles authentication, so you don't need provider-specific API keys.
LLM responses vary. Average over multiple runs for stable comparisons:
llm-diff --a v1.txt --b v2.txt -m gpt-4o --runs 5Token counts and latency are averaged. The last response text is used for the diff.
Pipe results into scripts, dashboards, or eval pipelines:
llm-diff --a v1.txt --b v2.txt -m gpt-4o --json | jq '.delta'{
"totalTokens": -23,
"totalTokensPct": -7.4,
"cost": -0.000293,
"costPct": -7.1,
"latencyMs": -304,
"latencyPct": -24.4
}import { runDiff } from 'llm-diff';
const result = await runDiff({
promptA: 'Explain gravity',
promptB: 'Explain gravity to a 5-year-old',
model: 'gpt-4o-mini',
});
console.log(result.delta);
// { totalTokens: -23, cost: -0.0003, latencyMs: -304, ... }- Resolves the model → provider, pricing, API adapter
- Reads prompt A and B (from files or inline text)
- Fires both requests in parallel (or sequentially with
--no-parallel) - Collects token counts, cost, and latency from the API response
- Computes deltas between A and B
- Generates a word-level diff of the response text
- Renders everything to the terminal (or as JSON)
- Eval pipeline integration (named experiments, history)
- Side-by-side diff view
- Cross-model comparison (
--model-a gpt-4o --model-b claude-sonnet-4-20250514) - HTML report output
- Config file support (
.llm-diff.json) - Streaming output with live token counting
- Mistral, Cohere, Together AI providers
See CONTRIBUTING.md.
Apache License — see LICENSE.