Stop losing your best prompts.
Lightweight prompt versioning & evaluation tracker for LLM engineers.
One decorator. Automatic versioning. Local SQLite. Beautiful dashboard.
Quick StartΒ Β β’Β Β FeaturesΒ Β β’Β Β DashboardΒ Β β’Β Β API ReferenceΒ Β β’Β Β Configuration
You iterate on prompts 50 times a day. You had a great system prompt last Tuesday that got 92% accuracy β but you lost it. You changed one word and everything broke, but you can't remember which word.
Your eval scores live in scattered notebooks and print() statements.
PromptTrace fixes this. β pip install prompttrace β done.
pip install prompttraceRequirements: Python 3.9+ Β· Single dependency:
rich
from prompttrace import trace
@trace(experiment="my-chatbot", model="gpt-4o")
def generate(prompt, temperature=0.7):
response = openai.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=temperature,
)
return response.choices[0].message.content
# Every call is now automatically tracked
generate("Explain quantum computing in one sentence.", temperature=0.3)
generate("Explain quantum computing in one sentence.", temperature=0.9)from prompttrace import dashboard
dashboard() # β http://127.0.0.1:8777Or from the terminal:
prompttraceThat's it. Every prompt, output, latency, model, and generation parameter is logged and visualized.
| Feature | Description | |
|---|---|---|
| π― | @trace decorator |
Wrap any LLM call β auto-logs prompt, output, latency, params |
| π | log_call() function |
Manual logging for when you can't use a decorator |
| π | Auto eval | Pass an eval_fn to score outputs automatically |
| π | Prompt versioning | Every unique prompt gets a hash β see how changes affect results |
| βοΈ | Side-by-side compare | Diff two prompts word-by-word, see outputs and metrics |
| π₯οΈ | Web dashboard | Modern UI with animated charts, tables, filters β zero JS deps |
| π | Local-only | Everything in SQLite. No cloud. No API keys. No telemetry |
| π¨ | Rich terminal logs | Colorful, emoji-powered console output via rich |
| π | Real-time updates | Dashboard auto-refreshes every 2s β no manual reload |
| ποΈ | Experiment management | Delete experiments, filter dashboard by experiment |
| π€ | CSV export | One-click export of all traces for external analysis |
Launch with prompttrace or from prompttrace import dashboard; dashboard().
Three views:
| View | What it does |
|---|---|
| Dashboard | Stats cards, latency chart, status donut, model usage β filterable by experiment |
| Traces | Full table of all logged calls with search, filter, delete, and CSV export |
| Compare | Select two prompts β word-level diff highlighting with outputs side-by-side |
from prompttrace import trace
@trace(
experiment="summarizer", # Group related traces
model="claude-3-sonnet", # Model identifier
tags=["prod", "v2"], # Optional tags
description="Q3 summary bot", # Optional experiment description
)
def summarize(prompt, temperature=0.5, max_tokens=500):
# Your LLM call here
return llm_responseWhat gets logged automatically:
Prompt text Β· Output Β· Latency Β· Generation parameters (
temperature,top_p,max_tokens, etc.) Β· Input variables Β· Status (success/error) Β· Error messages Β· Approximate token counts
Return a dict to include token counts:
@trace(experiment="qa", model="gpt-4o")
def answer(prompt):
resp = openai.chat.completions.create(...)
return {
"output": resp.choices[0].message.content,
"token_count_input": resp.usage.prompt_tokens,
"token_count_output": resp.usage.completion_tokens,
}Pass an eval_fn to score every output automatically:
def my_eval(prompt, output):
"""Return a dict of metric_name: score."""
return {
"relevance": compute_relevance(prompt, output),
"length_ok": 1.0 if 50 < len(output) < 500 else 0.0,
"has_citation": 1.0 if "[source]" in output else 0.0,
}
@trace(experiment="research-bot", model="gpt-4o", eval_fn=my_eval)
def research(prompt):
return call_llm(prompt)Metrics appear in the terminal and the dashboard.
For cases where a decorator doesn't fit:
from prompttrace import log_call
import time
start = time.perf_counter()
output = my_llm_pipeline(prompt)
elapsed = (time.perf_counter() - start) * 1000
log_call(
prompt="Translate to French: Hello world",
output="Bonjour le monde",
experiment="translation",
model="gpt-4o-mini",
generation_params={"temperature": 0.2},
latency_ms=elapsed,
token_count_input=8,
token_count_output=5,
tags=["translation", "french"],
eval_metrics={"bleu": 0.95, "fluency": 0.88},
)# Default (localhost:8777)
prompttrace
# Custom port
prompttrace --port 9000
# Accessible from network
prompttrace --host 0.0.0.0 --port 8777| Parameter | Type | Default | Description |
|---|---|---|---|
experiment |
str |
"default" |
Experiment name for grouping |
model |
str |
"unknown" |
Model identifier |
tags |
list[str] |
None |
Optional tags |
eval_fn |
callable |
None |
fn(prompt, output) β dict[str, float] |
description |
str |
"" |
Experiment description |
| Parameter | Type | Default | Description |
|---|---|---|---|
prompt |
str |
required | The prompt template |
output |
str |
required | The LLM output |
experiment |
str |
"default" |
Experiment name |
model |
str |
"unknown" |
Model identifier |
generation_params |
dict |
None |
e.g. {"temperature": 0.7} |
input_variables |
dict |
None |
Template variables |
latency_ms |
float |
0 |
Response time in ms |
token_count_input |
int |
0 |
Input token count |
token_count_output |
int |
0 |
Output token count |
status |
str |
"success" |
"success" or "error" |
error_message |
str |
"" |
Error details |
tags |
list[str] |
None |
Optional tags |
eval_metrics |
dict |
None |
{"metric": score} |
Launches the web UI. Blocks until Ctrl+C.
By default, traces are stored in .prompttrace/traces.db in the current directory.
# Override via environment variable
export PROMPTTRACE_DB=/path/to/my/traces.db# Override programmatically
from prompttrace import set_db_path
set_db_path("/path/to/my/traces.db")your-project/
βββ pyproject.toml
βββ README.md
βββ example.py
βββ prompttrace/
βββ __init__.py # Public API exports
βββ core.py # @trace decorator, log_call, dashboard launcher
βββ db.py # SQLite database layer
βββ server.py # Built-in HTTP server + JSON API
βββ cli.py # CLI entry point
βββ dashboard.html # Single-file web dashboard (zero JS deps)
βββ logo.png # App logo
MIT β use it however you want.