Skip to content

TypeScript library for evaluating prompts against coding agents (Claude Code, Cursor, etc.) with multi-iteration testing and scoring

License

Notifications You must be signed in to change notification settings

pffigueiredo/code-agent-eval

Repository files navigation

code-agent-eval

npm version License: MIT TypeScript

Evaluate coding agent prompts (Claude Code, Cursor, etc.) by running them multiple times and scoring outputs. Test reliability, capture changes, measure success rates.

Key Principle: Your codebase stays untouched. All modifications happen in isolated temp directories.

Features

  • 🔄 Multi-iteration runs with aggregate metrics (pass rate, mean/min/max, std dev)
  • ⚡ Sequential, parallel, or rate-limited execution
  • 🔒 Isolated temp directories per iteration
  • ✅ Built-in scorers (build/test/lint) + custom scorer support
  • 📊 Git diff capture + markdown results export
  • 🔧 Environment variable injection (static/dynamic)

Installation

npm install code-agent-eval
# or
pnpm add code-agent-eval
# or
yarn add code-agent-eval
# or
bun add code-agent-eval

Quick Start

import { runClaudeCodeEval, scorers } from 'code-agent-eval';

const result = await runClaudeCodeEval({
  name: 'add-feature',
  prompts: [{ id: 'v1', prompt: 'Add a health check endpoint' }],
  projectDir: './my-app',
  iterations: 10,
  execution: { mode: 'parallel' }, // or 'sequential' (default), 'parallel-limit'
  scorers: [scorers.buildSuccess(), scorers.testSuccess()],
});

console.log(`Pass rate: ${result.aggregateScores._overall.passRate * 100}%`);

Development

npm install              # Install dependencies
npm run build            # Build library
npm run test             # Run tests

# Examples
npx tsx examples/phase1-single-run.ts
npx tsx examples/phase2-multi-iteration.ts
npx tsx examples/parallel-execution.ts
npx tsx examples/results-export.ts

Documentation

See CLAUDE.md for detailed architecture and development guide.

Requirements

  • Node.js 18+
  • Claude Code login in the host machine

License

MIT

About

TypeScript library for evaluating prompts against coding agents (Claude Code, Cursor, etc.) with multi-iteration testing and scoring

Resources

License

Stars

Watchers

Forks

Packages

No packages published