@hasna/evals

Open source AI evaluation framework — LLM-as-judge + assertion-based evals for any AI app.

CLI (evals) · MCP server (evals-mcp) · TypeScript SDK

Install

bun install -g @hasna/evals
# or
npm install -g @hasna/evals

5-minute quickstart

1. Write a dataset (datasets/smoke.jsonl):

{"id":"q-001","input":"What is 2+2?","assertions":[{"type":"contains","value":"4"}],"judge":{"rubric":"Must answer 4 correctly."}}
{"id":"q-002","input":"Say hello","assertions":[{"type":"min_length","value":2}],"judge":{"rubric":"Should respond with a greeting."}}

2. Run evals against your app:

evals run datasets/smoke.jsonl --adapter http --url http://localhost:3000/api/chat

3. Output:

✓ PASS  q-001    124ms
✓ PASS  q-002     89ms
────────────────────────────────
  2/2 passed (100%)  0.2s  $0.0012

Eval case format

Single-turn

{
  "id": "greeting-001",
  "input": "Hello, what can you do?",
  "expected": "A welcoming response listing capabilities",
  "adapter": { "type": "http", "url": "http://localhost:3000/api/chat" },
  "assertions": [
    { "type": "min_length", "value": 20 },
    { "type": "not_contains", "value": "I cannot" },
    { "type": "max_length", "value": 500 }
  ],
  "judge": {
    "rubric": "Should be welcoming and list 2-3 capabilities. PASS if friendly and informative.",
    "model": "claude-sonnet-4-6"
  },
  "tags": ["smoke", "greeting"]
}

Multi-turn

{
  "id": "refund-flow-001",
  "turns": [
    { "role": "user", "content": "I want a refund." },
    { "role": "assistant", "expected": "asks for order ID" },
    { "role": "user", "content": "Order #1234" },
    { "role": "assistant", "expected": "confirms refund process" }
  ],
  "judge": {
    "rubric": "Should collect order ID before processing. Should not promise instant refund."
  }
}

Pass^k (consistency testing)

{
  "id": "booking-001",
  "input": "Book a flight to Paris",
  "repeat": 5,
  "passThreshold": 0.8,
  "judge": { "rubric": "Should ask for dates and destination confirmation." }
}

Assertion types

Type	What it checks	Example
`contains`	Output contains string	`{"type":"contains","value":"hello"}`
`not_contains`	Output does NOT contain string	`{"type":"not_contains","value":"error"}`
`starts_with` / `ends_with`	Prefix/suffix match	`{"type":"starts_with","value":"Sure"}`
`equals`	Exact match	`{"type":"equals","value":"4"}`
`regex` / `not_regex`	Regex match	`{"type":"regex","value":"\\d{4}"}`
`max_length` / `min_length`	Character count	`{"type":"max_length","value":500}`
`json_valid`	Response is valid JSON	`{"type":"json_valid"}`
`json_schema`	Response matches JSON schema	`{"type":"json_schema","value":{...}}`
`tool_called`	Specific tool was invoked	`{"type":"tool_called","value":"search"}`
`tool_not_called`	Tool was NOT invoked	`{"type":"tool_not_called","value":"delete"}`
`tool_call_count`	Number of tool calls in range	`{"type":"tool_call_count","min":1,"max":3}`
`tool_args_match`	Tool arguments match expected	`{"type":"tool_args_match","value":{"tool":"search","args":{"query":"AI"}}}`
`response_time_ms`	Response under time limit	`{"type":"response_time_ms","max":3000}`
`token_count`	Token count in range	`{"type":"token_count","min":10,"max":500}`
`cost_usd`	Cost under budget	`{"type":"cost_usd","max":0.01}`
`semantic_similarity`	Meaning matches expected	`{"type":"semantic_similarity","value":"acknowledge frustration","threshold":0.8}`

Assertions run cheapest-first — deterministic checks before embeddings. The LLM judge only runs if all assertions pass.

Adapters

Configure which adapter connects the eval runner to your app:

# HTTP (any REST endpoint)
evals run dataset.jsonl --adapter http --url http://localhost:3000/api/chat

# Direct Anthropic API
evals run dataset.jsonl --adapter anthropic --model claude-sonnet-4-6

# Direct OpenAI API (also works with Ollama)
evals run dataset.jsonl --adapter openai --model gpt-4o --url http://localhost:11434

# MCP tool (eval your MCP server directly)
evals run dataset.jsonl --adapter mcp --mcp-command "node dist/mcp/index.js" --tool my_tool

# JS function (fastest, no network)
evals run dataset.jsonl --adapter function --module ./src/handler.js

# CLI command (pipe stdin, capture stdout)
evals run dataset.jsonl --adapter cli --command "my-cli-tool --input '{{input}}'"

LLM judge

PASS / FAIL / UNKNOWN — no numeric scales
Chain-of-thought before verdict — judge always reasons first
temperature=0 — deterministic judgments
Configurable model — default claude-sonnet-4-6, supports any Anthropic or OpenAI model

"judge": {
  "rubric": "Should answer in Romanian. Should reference at least one feature. Under 100 words.",
  "model": "claude-opus-4-6",
  "provider": "anthropic"
}

CLI reference

# Run a dataset
evals run datasets/smoke.jsonl --adapter http --url http://localhost:3000/api/chat

# CI mode — exit 1 on regression
evals ci run datasets/smoke.jsonl --adapter http --url http://localhost:3000/api/chat --baseline main --fail-if-regression 5

# Set baseline for CI comparison
evals ci set-baseline main

# Cost estimate before running (no API calls)
evals estimate datasets/smoke.jsonl --model claude-sonnet-4-6

# Compare two runs
evals compare <run-id-before> <run-id-after>
evals compare main latest --markdown

# One-shot judge
evals judge --input "What is AI?" --output "AI is..." --rubric "Should define AI clearly"

# Generate eval cases from a description
evals generate --description "users asking about refund policies" --count 20 --output datasets/refunds.jsonl

# Calibrate your judge against gold labels
evals calibrate gold-50.jsonl --model claude-sonnet-4-6

# Capture production traffic as eval cases
evals capture --app http://localhost:3000 --rate 0.1 --output datasets/captured.jsonl

# Health check
evals doctor

# Register MCP server with Claude Code / Codex / Gemini
evals mcp register --claude      # Claude Code (~/.claude/mcp.json)
evals mcp register --codex       # Codex (~/.codex/config.json)
evals mcp register --gemini      # Gemini (~/.gemini/settings.json)
evals mcp register --all         # all three at once

CI / GitHub Actions

- name: Run evals
  run: |
    evals ci run datasets/smoke.jsonl \
      --adapter http \
      --url ${{ env.APP_URL }} \
      --baseline main \
      --fail-if-regression 5
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

MCP tools (for agents)

Register with your agent: evals mcp register --claude (or --codex, --gemini, --all)

Tool	Description
`evals_run`	Run a full eval dataset
`evals_run_single`	Judge a single response mid-session
`evals_judge`	One-shot LLM judge call
`evals_list_datasets`	List available datasets
`evals_get_results`	Get past run results
`evals_compare`	Compare two runs
`evals_create_case`	Add a case to a dataset
`evals_generate_cases`	Auto-generate cases from a description

Key agent pattern — self-check before responding:

evals_run_single(
  input: "What is the capital of France?",
  output: "The capital of France is Paris.",
  rubric: "Must correctly identify Paris as the capital."
)
→ PASS — The response correctly identifies Paris.

License

Apache 2.0 — see LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
dashboard		dashboard
datasets/examples		datasets/examples
src		src
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

@hasna/evals

Install

5-minute quickstart

Eval case format

Single-turn

Multi-turn

Pass^k (consistency testing)

Assertion types

Adapters

LLM judge

CLI reference

CI / GitHub Actions

MCP tools (for agents)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

@hasna/evals

Install

5-minute quickstart

Eval case format

Single-turn

Multi-turn

Pass^k (consistency testing)

Assertion types

Adapters

LLM judge

CLI reference

CI / GitHub Actions

MCP tools (for agents)

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages