A lightweight platform for evaluating and comparing LLM prompts using objective metrics. Treat prompts as testable, version-controlled artifacts.
Run without API keys or dependencies:
python demo_standalone.pyThis evaluates 3 prompt versions using built-in heuristics and displays a leaderboard.
# 1. Setup environment
python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txt
# 2. Add API key
copy .env.example .env
# Edit .env: OPENAI_API_KEY=your_key_here
# 3. Run CLI or web dashboard
python src/runner.py # CLI mode
python app.py # Web UI at http://localhost:5000Prompts are scored on:
- Semantic Similarity - Closeness to reference answers
- Accuracy - Factual correctness
- Faithfulness - Avoids hallucinations
- Completeness - Covers all key points
Prompt_Eval_Lab/
├── demo_standalone.py ⭐ Zero-dependency demo
├── app.py Web dashboard
├── datasets/qa_test.json Sample Q&A dataset (15 questions)
├── prompts/ prompt_v1, v2, v3 to compare
├── src/ Evaluation engine
│ ├── evaluator.py Core evaluation logic
│ ├── metrics.py Scoring functions
│ └── runner.py CLI runner
├── static/templates/ Web UI
└── tests/ 30+ pytest tests
# Quick start
docker-compose up -d
# Production
docker build -t prompt-eval:latest .
docker run -d -p 5000:5000 -e OPENAI_API_KEY=key prompt-eval:latestStandalone Demo:
- ✅ Zero setup - works immediately
- ✅ No API costs
- ✅ Heuristic-based scoring
Full Version:
- ✅ Real LLM API integration
- ✅ Embeddings-based similarity
- ✅ GPT-4 as judge
- ✅ Web dashboard
- ✅ Rate limiting & CORS security
- Create
prompts/prompt_v4.txt - Use placeholders:
{question}and{context} - Run evaluation to compare:
python demo_standalone.py
# Install dev dependencies
pip install -r requirements-dev.txt
# Run tests
pytest tests/ -v --cov=src
# Linting
flake8 src/ app.py
black src/ app.py| Variable | Required | Default | Description |
|---|---|---|---|
OPENAI_API_KEY |
No | - | Falls back to demo mode without it |
FLASK_DEBUG |
No | False | Enable debug mode |
FLASK_PORT |
No | 5000 | Server port |
CORS_ORIGINS |
No | * | Allowed origins |
Most teams judge prompt quality subjectively. This platform provides objective, repeatable measurements to:
- Track improvements over time
- A/B test different approaches
- Catch quality regressions
- Make data-driven decisions
Think of it as unit tests for prompts.
MIT - Use freely in your projects!
Try it now: python demo_standalone.py 🚀