Eval-Driven-Development (EDD) is a methodology for guiding the development of LLM-backed apps via a set of task-specific evals (i.e. prompt, context, expected outputs as references).*
These evals guide prompt engineering, model selection, fine-tuning, and so on. We can then run these evals to quickly measure improvements or regressions as the app changes.
It's Test Driven Development (TDD) for LLM-backed apps.
Name | Description |
---|---|
Auto Evaluator | Evaluation tool for LLM QA chains |
DeepEval | Evaluation and Unit Testing for LLMs |
Evals | A framework for evaluating LLMs and LLM systems |
Phoenix | Evaluate, troubleshoot, and fine tune your LLM in a notebook |
Ragas | Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines |
Uptrain | Your open-source LLM evaluation toolkit |
Name | Distribution | Maturity | Self-service signup |
---|---|---|---|
Freeplay | SaaS | Private Beta | No |
Patronus AI | SaaS | Released | No |
*- Definition adapted from Patterns for Building LLM-based Systems & Products by Eugene Yan.