Skip to content

A lightweight Streamlit platform to benchmark, score, and regression-test LLM prompts using automated semantic and faithfulness evaluation.

Notifications You must be signed in to change notification settings

invo-coder19/Prompt_Eval_Lab

Repository files navigation

🧠 Prompt Evaluation Lab

A lightweight platform for evaluating and comparing LLM prompts using objective metrics. Treat prompts as testable, version-controlled artifacts.

🚀 Quick Start

Option 1: Standalone Demo (No Setup)

Run without API keys or dependencies:

python demo_standalone.py

This evaluates 3 prompt versions using built-in heuristics and displays a leaderboard.

Option 2: Full Version (With OpenAI)

# 1. Setup environment
python -m venv venv
.\venv\Scripts\activate
pip install -r requirements.txt

# 2. Add API key
copy .env.example .env
# Edit .env: OPENAI_API_KEY=your_key_here

# 3. Run CLI or web dashboard
python src/runner.py           # CLI mode
python app.py                  # Web UI at http://localhost:5000

📊 What Gets Evaluated

Prompts are scored on:

  • Semantic Similarity - Closeness to reference answers
  • Accuracy - Factual correctness
  • Faithfulness - Avoids hallucinations
  • Completeness - Covers all key points

🏗️ Project Structure

Prompt_Eval_Lab/
├── demo_standalone.py       ⭐ Zero-dependency demo
├── app.py                   Web dashboard
├── datasets/qa_test.json    Sample Q&A dataset (15 questions)
├── prompts/                 prompt_v1, v2, v3 to compare
├── src/                     Evaluation engine
│   ├── evaluator.py         Core evaluation logic
│   ├── metrics.py           Scoring functions
│   └── runner.py            CLI runner
├── static/templates/        Web UI
└── tests/                   30+ pytest tests

🔧 Docker Deployment

# Quick start
docker-compose up -d

# Production
docker build -t prompt-eval:latest .
docker run -d -p 5000:5000 -e OPENAI_API_KEY=key prompt-eval:latest

✨ Features

Standalone Demo:

  • ✅ Zero setup - works immediately
  • ✅ No API costs
  • ✅ Heuristic-based scoring

Full Version:

  • ✅ Real LLM API integration
  • ✅ Embeddings-based similarity
  • ✅ GPT-4 as judge
  • ✅ Web dashboard
  • ✅ Rate limiting & CORS security

📝 Adding Your Prompts

  1. Create prompts/prompt_v4.txt
  2. Use placeholders: {question} and {context}
  3. Run evaluation to compare: python demo_standalone.py

🧪 Testing & Development

# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest tests/ -v --cov=src

# Linting
flake8 src/ app.py
black src/ app.py

🔒 Environment Variables

Variable Required Default Description
OPENAI_API_KEY No - Falls back to demo mode without it
FLASK_DEBUG No False Enable debug mode
FLASK_PORT No 5000 Server port
CORS_ORIGINS No * Allowed origins

💡 Why This Matters

Most teams judge prompt quality subjectively. This platform provides objective, repeatable measurements to:

  • Track improvements over time
  • A/B test different approaches
  • Catch quality regressions
  • Make data-driven decisions

Think of it as unit tests for prompts.

📄 License

MIT - Use freely in your projects!


Try it now: python demo_standalone.py 🚀

About

A lightweight Streamlit platform to benchmark, score, and regression-test LLM prompts using automated semantic and faithfulness evaluation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published