Welcome to the AI Evaluation Frameworks repository! This project serves as a curated, continuously updated knowledge base for understanding, benchmarking, and governing cutting-edge Artificial Intelligence systems.
As AI transitions from discrete, classic machine learning models to autonomous, multi-agentic workflows, the methods for ensuring their safety, reasoning accuracy, and regulatory compliance have drastically shifted—requiring completely new toolchains and paradigms.
The primary resource in this repository is the comprehensive evaluation guide:
This continually growing document is broken down into four major pillars:
- Agentic AI & LLMs:
- Top open-source evaluation frameworks (Promptfoo, RAGAS, TruLens, DeepEval).
- Agent-specific capability benchmarks (SWE-bench, WebArena, GAIA).
- Classical ML Frameworks:
- MLOps observability heavy-weights (MLflow, Evidently AI, Deepchecks, W&B).
- Fundamental supervised and unsupervised evaluation metrics.
- Top Tier Academia & Research:
- Bleeding-edge preprints and publications across arXiv, ACM, and IEEE.
- Deep dives into Pervasive Computing, IoT edge agents, and seminal PhD theories from MIT Media Lab, CMU, Stanford, and UC Berkeley SKY Lab.
- Enterprise AI Governance (Financial/Regulated AI):
- Insights on safely deploying multi-agent systems in strict compliance environments (deriving insights from Capital One AI Research).
- Understanding the Model Risk Office (MRO), policy-bound runtime checks, conversational observability traces, and Human-in-the-Loop (HITL) integration.
This repository is designed for AI Researchers, ML Engineers, and Enterprise Architects who need a high-signal reference point for scaling AI. It is designed to bridge the gap between building proof-of-concept AI bots and deploying auditable, governed "Agentic Ecosystems" into production.
Repository maintained in the nbajpai-code organization.