A comprehensive evaluation framework for testing and benchmarking various AI systems including prompts, agents, agentic workflows, RAG pipelines, and generative models.
This project provides modular evaluation tools for different AI system components, from simple prompt testing to complex multi-agent RAG pipelines. Each evaluation module is designed to be independent yet follows consistent patterns for metrics, datasets, and experimentation.
genai_evals/
├── ai/ # Core AI implementations
│ ├── agents.py # Agent definitions
│ ├── agentic_workflow.py # Multi-step workflow logic
│ └── call_llm.py # LLM invocation utilities
│
├── datasets/ # Evaluation datasets (CSV format)
│
├── evals_intro/ # Simple evaluation examples
│ ├── simple_prompt_eval_playground.py
│ ├── simple_agent_eval_playground.py
│ └── simple_agentic_workflow_eval_playground.py
│
├── rag_evals/ # RAG system evaluations
│ ├── context_evals/ # Context retrieval quality
│ ├── generation_evals/ # Response generation quality
│ └── e2e_rag_eval_pipe/ # End-to-end RAG pipeline
│ ├── pdf2image.py
│ ├── image2markdown.py
│ ├── ingest2vectorstore.py
│ └── execute_rag_pipe.py
│
└── experiments/ # Experiment results and configs
Basic evaluation playgrounds for prompts, agents, and workflows using the Ragas framework.
Key Features:
- Binary pass/fail metrics for prompts
- Numerical correctness for computational agents
- LLM-as-judge for workflow quality assessment
Comprehensive RAG pipeline evaluation from document ingestion to response generation.
Components:
- Context Evals: Retrieval quality and relevancy metrics
- Generation Evals: Response quality and hallucination detection
- E2E Pipeline: Complete PDF → Vector Store → Agent → Evaluation workflow
Key Features:
- PDF document processing with vision models
- Vector store ingestion with semantic chunking
- Multi-metric evaluation (context relevancy, answer relevancy, faithfulness)
- Agent performance benchmarking
Structured experiment tracking and result storage.
📚 Important: Each major module contains its own detailed README.md file with specific setup instructions, usage examples, and API documentation.
Navigate to individual module directories to find:
evals_intro/README.md- Simple evaluation playground documentationrag_evals/e2e_rag_eval_pipe/README.md- End-to-end RAG pipeline guiderag_evals/context_evals/README.md- Context evaluation metricsrag_evals/generation_evals/README.md- Generation quality metrics
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txtCreate a .env file with required API keys:
OPENAI_API_KEY=your_openai_key
ANTHROPIC_API_KEY=your_anthropic_key
GEMINI_API_KEY=your_gemini_key
QDRANT_URL=http://localhost:6333
QDRANT_API_KEY=your_qdrant_key
COLLECTION_NAME=your_collection_name# Simple prompt evaluation
python evals_intro/simple_prompt_eval_playground.py
# Agent evaluation
python evals_intro/simple_agent_eval_playground.py
# Workflow evaluation
python evals_intro/simple_agentic_workflow_eval_playground.py
# Complete RAG pipeline
python rag_evals/e2e_rag_eval_pipe/execute_rag_pipe.py- Frameworks: Ragas, Agno, DeepEval, LlamaIndex
- LLMs: OpenAI GPT-4, Anthropic Claude, Google Gemini, Ollama
- Vector Stores: Qdrant
- Document Processing: Docling, PyMuPDF
- Embeddings: FastEmbed, Sentence Transformers
- Discrete metrics (pass/fail)
- Numeric metrics (precision-based)
- LLM-as-judge evaluations
- Context Relevancy - Are retrieved documents relevant?
- Answer Relevancy - Is the response on-topic?
- Faithfulness - Is the answer grounded in context?
- Tool Call Reliability - Did the agent use tools correctly?
- Performance Benchmarks - Response time and throughput
All evaluations expect CSV datasets with specific schemas. Examples:
# Prompt evaluation
text,label
"This is great!",positive
"This is terrible!",negative
# Agent evaluation
expression,expected
"2 + 2",4
"sqrt(16)",4
# Workflow evaluation
email,pass_criteria
"Subject: Meeting Request...","Professional tone and clear action items"Each module is designed to be extended independently. Refer to individual module READMEs for contribution guidelines specific to that component.
Apache 2.0 License. See LICENSE file for details.
For detailed documentation on specific modules, please refer to the README.md files in each subdirectory:
- 📖 Simple Evaluations →
evals_intro/README.md - 📖 RAG Pipeline →
rag_evals/e2e_rag_eval_pipe/README.md - 📖 Context Evals →
rag_evals/context_evals/README.md - 📖 Generation Evals →
rag_evals/generation_evals/README.md
Note: This is a modular evaluation framework where each component has comprehensive documentation. Start with the module-specific READMEs for implementation details and best practices.