prompteval

A lightweight, zero-dependency Python framework for scoring and comparing LLM prompt outputs.

pip install git+https://github.com/hoeberigs/prompteval.git

Why prompteval?

Evaluating LLM outputs shouldn't require heavy ML frameworks or paid APIs. prompteval gives you a fast, composable toolkit to score, compare, and report on prompt performance — all in pure Python.

Zero core dependencies — runs anywhere Python runs
10+ built-in scorers — exact match, regex, JSON validation, sentiment, fluency, and more
Composable — combine scorers with weights, or write your own in one line
Model-agnostic — plug in OpenAI, Anthropic, or any LLM via simple runner adapters
Compare across models — side-by-side benchmarking with variance analysis
Multiple report formats — ASCII tables, Markdown, JSON export

Quick Start

from prompteval import Evaluation, ExactMatch, LengthRange

ev = Evaluation(scorers=[ExactMatch(case_sensitive=False), LengthRange(1, 50)])

result = ev.run(
    prompt="What is the capital of France?",
    output="Paris",
    expected="paris",
)

print(result.passed)     # True
print(result.avg_score)  # 1.0

Evaluation Suite

from prompteval import EvalSuite, ExactMatch, MockRunner, Report

suite = (
    EvalSuite("geography")
    .add("Capital of France?", expected="Paris")
    .add("Capital of Japan?", expected="Tokyo")
    .add("Capital of Brazil?", expected="Brasília")
)

runner = MockRunner(responses={
    "Capital of France?": "Paris",
    "Capital of Japan?": "Tokyo",
    "Capital of Brazil?": "São Paulo",
})

results = suite.run(scorers=[ExactMatch()], runner=runner)
print(Report(results).to_table())

+---+----------------------+------+------------+-------+---------+
| # | Prompt               | Pass | ExactMatch | Avg   | Latency |
+---+----------------------+------+------------+-------+---------+
| 1 | Capital of France?   | PASS | 1.00       | 1.000 | 0ms     |
| 2 | Capital of Japan?    | PASS | 1.00       | 1.000 | 0ms     |
| 3 | Capital of Brazil?   | FAIL | 0.00       | 0.000 | 0ms     |
+---+----------------------+------+------------+-------+---------+

Total: 3 cases | Pass rate: 67% | Avg score: 0.667

Compare Models

from prompteval import Comparator

comp = Comparator()
result = comp.compare({
    "gpt-4o": results_gpt4o,
    "claude-sonnet": results_claude,
})
print(result.to_table())

Built-in Scorers

Scorer	What it checks
`ExactMatch`	Output equals expected (case-insensitive option)
`Contains`	Output contains all specified substrings
`RegexMatch`	Output matches a regex pattern
`LengthRange`	Output length within min/max characters
`JsonValid`	Valid JSON with optional required keys
`SentimentPositive`	Positive sentiment via keyword heuristics
`Fluency`	Sentence structure, vocabulary richness
`CustomScorer`	Wrap any `(output, expected) -> float` function
`CompositeScorer`	Weighted combination of multiple scorers

Custom Scorers

from prompteval import CustomScorer

def politeness(output, expected=None, **kwargs):
    polite = {"please", "thank", "thanks", "kindly"}
    words = set(output.lower().split())
    return min(1.0, len(words & polite) / 2)

scorer = CustomScorer(politeness, name="politeness")

LLM Runners

# OpenAI
from prompteval import OpenAIRunner
runner = OpenAIRunner(model="gpt-4o-mini", temperature=0)

# Anthropic
from prompteval import AnthropicRunner
runner = AnthropicRunner(model="claude-sonnet-4-20250514")

# Mock (for testing)
from prompteval import MockRunner
runner = MockRunner(responses={"prompt": "response"})

Install provider extras:

pip install "git+https://github.com/hoeberigs/prompteval.git#egg=prompteval[openai]"     # OpenAI support
pip install "git+https://github.com/hoeberigs/prompteval.git#egg=prompteval[anthropic]"  # Anthropic support
pip install "git+https://github.com/hoeberigs/prompteval.git#egg=prompteval[all]"        # Both

Report Formats

report = Report(results)
report.to_table()       # ASCII table
report.to_markdown()    # Markdown table
report.to_json("out.json")  # JSON file
report.to_dict()        # Python dict

Development

git clone https://github.com/hoeberigs/prompteval.git
cd prompteval
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
examples		examples
prompteval		prompteval
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

prompteval

Why prompteval?

Quick Start

Evaluation Suite

Compare Models

Built-in Scorers

Custom Scorers

LLM Runners

Report Formats

Development

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

prompteval

Why prompteval?

Quick Start

Evaluation Suite

Compare Models

Built-in Scorers

Custom Scorers

LLM Runners

Report Formats

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages