feat: AI eval framework for benchmarking MCP servers#2414
feat: AI eval framework for benchmarking MCP servers#2414brandon-pereira wants to merge 21 commits into
Conversation
- leverage kv rollup mvs - allow claude access to read only in temp dir - tweaks to sysprompt
The runs/ gitignore pattern was matching src/runs/ (source code) in addition to the intended /runs/ (data directory). Anchor the pattern so only the top-level runs/ directory is ignored, and commit the three missing source files: instrument.ts, path.ts, store.ts.
Remove files that are dev-specific or experimental scratch work: - ablation/ directory (REPORT.md, manifest.tsv) - scripts/ablation.sh, ablation-report.ts - scripts/compare-prompt-variants.sh, fast-eval.sh - MCP_IMPROVEMENTS.md Also clean up README references and ablation .gitignore patterns.
- Extract spreadTimestamp() and normalizeSeverityText() shared helpers - Collapse 14 identical phase functions into streamLogPhase() generic - Compact ground-truth programmatic checks to tuple format [id, weight, pattern, neg?] - Delete dead checkoutEventLog export, unexport internal logfmtBody/jsonEventBody
- Generalize from fixed HyperDX+ClickHouse pair to config-driven MCP registry - Add dual-slot eval setup docs for A/B branch comparison - Add baseline+challengers reporting model with delta columns - Expand README with MCP config reference, field tables, and examples - Improve viewer with comparison dashboard and drill-down - Update blinding to handle arbitrary brand terms per MCP - Add --baseline, --ch-url, --no-grade, --no-judge CLI flags
🦋 Changeset detectedLatest commit: ad7c588 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
E2E Test Results✅ All tests passed • 197 passed • 3 skipped • 1361s
Tests ran across 4 shards in parallel. |
- Add packages/hdx-eval workspace to knip.json - Remove unused @hyperdx/common-utils dependency - Remove export from internal-only symbols (SOURCE_TRACES_TABLE, SOURCE_LOGS_TABLE, DENIED_BUILT_IN_TOOLS_BASE, CONFIG_FILENAME, loadGradedPairs, instrumentRun, HyperdxConnection, MeResponse) - Remove dead code (getScenarioGroundTruth, ensureConfigDir) - Add minor changeset for @hyperdx/hdx-eval
- Fix judge error silently penalizing combined score by 60% (grade.ts) - Fix path traversal in viewer /api/batches/:batch route (server.js) - Fix off-by-one in background operation selection (latency-spike) - Fix worker pool crash-on-error losing all in-flight results (cli.ts) - Fix claudeSpawn timer leaks on spawn error and SIGTERM escalation - Fix listRunsInBatch including .grade.json/.timing.json sidecars - Remove unused innerHTML attribute from viewer el() helper (XSS vector) - Bind viewer server to 127.0.0.1 instead of all interfaces
…idation, temp cleanup, compression, type safety
- Fix normalizeSeverityText case bug: 'FATAL' now correctly returns 'ERROR'
- Add AbortSignal.timeout to all HyperDX API and health check fetch calls
- Add identifier validation in scenarioSlug to reject unsafe characters
- Clean up temp directories after subprocess exits (leaked API keys)
- Enable ClickHouse request compression for batch inserts
- Replace per-row buildResourceAttrs with pre-built pool in noisy-signals
- Narrow groundTruth type from unknown to Record<string, unknown>
- Update AGENTS.md to list hdx-eval as sixth package
- Remove dead pickSeverity import and ScenarioOutput type alias
- Replace inline require('fs') with top-level import in cli.ts
- Add anchorTime field to EvalConfig, auto-generated and saved on first run so subsequent runs reuse the same anchor automatically - Default to skip reseed (old --no-reseed behavior); add --reseed to opt in - Add --live flag to opt out of saved anchor (wall-clock now, implies --reseed) - --anchor-time <iso> now overrides and saves to eval.config.json - Update README with new CLI flags and Anchor Time section
- Add scenarioIsSeeded() check (queries traces table for any row) - run command now checks for existing data before running; auto-seeds if the scenario tables are empty or missing - Update README to document auto-seed behavior
Lead with export HDX_DEV_SLOT in the Quick Start so eval commands connect to the correct ClickHouse instance regardless of which worktree they are run from. Replace --ch-url examples in the dual-slot seeding section with HDX_DEV_SLOT for consistency. Also clarify that .env.local must be at the monorepo root.
🔴 Tier 4 — CriticalTouches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD. Why this tier:
Review process: Deep review from a domain expert. Synchronous walkthrough may be required. Stats
|
Greptile SummaryThis PR introduces
Confidence Score: 5/5This PR adds a self-contained new package with no changes to production application code; the only modified files outside the new package are AGENTS.md and knip.json. All changes are additive — the new packages/hdx-eval package is fully isolated from the rest of the monorepo. Previously reported correctness issues (LIKE pattern escaping, token accumulation, .timing.json exclusion) have been addressed. The remaining findings are cosmetic: per-tool-call timestamps derived during post-processing rather than from stream events (so individual durationMs values are ~0ms), and an anonymous-label overflow past 26 MCPs. Neither affects scoring, grading logic, or any existing functionality. No files require special attention; the two findings are in packages/hdx-eval/src/harness/runRun.ts and packages/hdx-eval/src/grading/blind.ts. Important Files Changed
Sequence DiagramsequenceDiagram
participant CLI as hdx-eval CLI
participant CH as ClickHouse
participant Spawn as claudeSpawn
participant Claude as Claude Code (agent)
participant Grader as gradeBatch
participant Judge as LLM Judge
CLI->>CH: seed scenario data (mulberry32 PRNG)
CLI->>Spawn: runCell(scenario, mcp, runIndex)
Spawn->>Spawn: write mcp-config.json + settings.json to tmpdir
Spawn->>Claude: spawn claude -p --mcp-config ... --stream-json
Claude-->>Spawn: stream-json events (tool_use / tool_result / result)
Spawn->>Spawn: rmSync(tempdir) cleanup
Spawn-->>CLI: RunRecord (events, finalAnswer, tokens)
CLI->>CLI: writeRun to runs/batch/scenario/mcp/idx.json
CLI->>CH: query system.query_log (instrumentBatch)
CH-->>CLI: QueryLogRow[]
CLI->>CLI: matchQueriesToCalls to timing.json sidecar
CLI->>Grader: gradeBatch(batchDir)
Grader->>Grader: runProgrammaticChecks(finalAnswer, rubric)
Grader->>Judge: judgeTrajectory (blind answer to Anthropic API)
Judge-->>Grader: JudgeResult (scores, tokens)
Grader->>Grader: writeGradeFile to grade.json
Grader-->>CLI: GradeBatchSummary
CLI->>CLI: buildAggregate to _summary.json
CLI->>CLI: viewer server serves runs/ to browser
Reviews (3): Last reviewed commit: "Exclude .timing.json sidecars from listR..." | Re-trigger Greptile |
- Escape underscores and percent signs in SQL LIKE patterns for query attribution to avoid false matches on table names - Accumulate token counts from both judge attempts when retry succeeds, fixing understated cost reporting
Matches the existing exclusion pattern in listRunsInBatch (store.ts). Without this, timing sidecars are picked up as run records and produce garbage grade files.
Summary
Adds
packages/hdx-eval— an eval framework for benchmarking AI agents against observability MCP servers. The framework generates deterministic synthetic telemetry with planted anomalies, spawns Claude Code as an SRE agent, records full trajectories, and grades answers using programmatic checks + LLM-as-judge.Key Features
What is included
packages/hdx-eval/— full eval package (CLI, harness, generators, grading, reports, viewer)packages/hdx-eval/README.md— comprehensive docs covering setup, config, scenarios, CLI reference, and scoring.opencode/commands/eval-summary.md— eval analysis skill for reviewing resultsAGENTS.md— minor addition documenting common utility locations to check before writing new functionsUsage
See
packages/hdx-eval/README.mdfor full documentation.