Skip to content

nbajpai-code/evalFrameworks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 

Repository files navigation

AI Evaluation Frameworks

Welcome to the AI Evaluation Frameworks repository! This project serves as a curated, continuously updated knowledge base for understanding, benchmarking, and governing cutting-edge Artificial Intelligence systems.

As AI transitions from discrete, classic machine learning models to autonomous, multi-agentic workflows, the methods for ensuring their safety, reasoning accuracy, and regulatory compliance have drastically shifted—requiring completely new toolchains and paradigms.

📚 What's Inside

The primary resource in this repository is the comprehensive evaluation guide:

This continually growing document is broken down into four major pillars:

  1. Agentic AI & LLMs:
    • Top open-source evaluation frameworks (Promptfoo, RAGAS, TruLens, DeepEval).
    • Agent-specific capability benchmarks (SWE-bench, WebArena, GAIA).
  2. Classical ML Frameworks:
    • MLOps observability heavy-weights (MLflow, Evidently AI, Deepchecks, W&B).
    • Fundamental supervised and unsupervised evaluation metrics.
  3. Top Tier Academia & Research:
    • Bleeding-edge preprints and publications across arXiv, ACM, and IEEE.
    • Deep dives into Pervasive Computing, IoT edge agents, and seminal PhD theories from MIT Media Lab, CMU, Stanford, and UC Berkeley SKY Lab.
  4. Enterprise AI Governance (Financial/Regulated AI):
    • Insights on safely deploying multi-agent systems in strict compliance environments (deriving insights from Capital One AI Research).
    • Understanding the Model Risk Office (MRO), policy-bound runtime checks, conversational observability traces, and Human-in-the-Loop (HITL) integration.

🎯 Motivation

This repository is designed for AI Researchers, ML Engineers, and Enterprise Architects who need a high-signal reference point for scaling AI. It is designed to bridge the gap between building proof-of-concept AI bots and deploying auditable, governed "Agentic Ecosystems" into production.


Repository maintained in the nbajpai-code organization.

About

eval frameworks for Agentic AI workloads and classical ML workloads

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors