AI Evaluation Frameworks

Welcome to the AI Evaluation Frameworks repository! This project serves as a curated, continuously updated knowledge base for understanding, benchmarking, and governing cutting-edge Artificial Intelligence systems.

As AI transitions from discrete, classic machine learning models to autonomous, multi-agentic workflows, the methods for ensuring their safety, reasoning accuracy, and regulatory compliance have drastically shifted—requiring completely new toolchains and paradigms.

📚 What's Inside

The primary resource in this repository is the comprehensive evaluation guide:

👉 AGENTIC_AND_CLASSICAL_ML_EVALS.md

This continually growing document is broken down into four major pillars:

Agentic AI & LLMs:
- Top open-source evaluation frameworks (Promptfoo, RAGAS, TruLens, DeepEval).
- Agent-specific capability benchmarks (SWE-bench, WebArena, GAIA).
Classical ML Frameworks:
- MLOps observability heavy-weights (MLflow, Evidently AI, Deepchecks, W&B).
- Fundamental supervised and unsupervised evaluation metrics.
Top Tier Academia & Research:
- Bleeding-edge preprints and publications across arXiv, ACM, and IEEE.
- Deep dives into Pervasive Computing, IoT edge agents, and seminal PhD theories from MIT Media Lab, CMU, Stanford, and UC Berkeley SKY Lab.
Enterprise AI Governance (Financial/Regulated AI):
- Insights on safely deploying multi-agent systems in strict compliance environments (deriving insights from Capital One AI Research).
- Understanding the Model Risk Office (MRO), policy-bound runtime checks, conversational observability traces, and Human-in-the-Loop (HITL) integration.

🎯 Motivation

This repository is designed for AI Researchers, ML Engineers, and Enterprise Architects who need a high-signal reference point for scaling AI. It is designed to bridge the gap between building proof-of-concept AI bots and deploying auditable, governed "Agentic Ecosystems" into production.

Repository maintained in the nbajpai-code organization.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
AGENTIC_AND_CLASSICAL_ML_EVALS.md		AGENTIC_AND_CLASSICAL_ML_EVALS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Evaluation Frameworks

📚 What's Inside

👉 AGENTIC_AND_CLASSICAL_ML_EVALS.md

🎯 Motivation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

AI Evaluation Frameworks

📚 What's Inside

👉 AGENTIC_AND_CLASSICAL_ML_EVALS.md

🎯 Motivation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages