feat: AI eval framework for benchmarking MCP servers by brandon-pereira · Pull Request #2414 · hyperdxio/hyperdx

brandon-pereira · 2026-06-03T22:00:49Z

Summary

Adds packages/hdx-eval — an eval framework for benchmarking AI agents against observability MCP servers. The framework generates deterministic synthetic telemetry with planted anomalies, spawns Claude Code as an SRE agent, records full trajectories, and grades answers using programmatic checks + LLM-as-judge.

Key Features

MCP-agnostic — compare any combination of MCPs (HyperDX vs ClickHouse, feature-branch vs main, two HyperDX instances, or any N-way comparison)
5 scenarios covering error root-cause, latency spikes, noisy signal triage, segmented regression, and service health checks
Deterministic seeding — mulberry32 PRNG produces byte-identical data for fair comparison
Blinded LLM judging — brand terms and tool names are redacted so the judge cannot tell which MCP produced the answer
Baseline + challengers reporting model with delta columns
Web viewer for browsing comparison dashboards, per-scenario breakdowns, and individual run trajectories
Dual-slot setup for A/B comparison of two HyperDX branches running simultaneously

What is included

packages/hdx-eval/ — full eval package (CLI, harness, generators, grading, reports, viewer)
packages/hdx-eval/README.md — comprehensive docs covering setup, config, scenarios, CLI reference, and scoring
.opencode/commands/eval-summary.md — eval analysis skill for reviewing results
AGENTS.md — minor addition documenting common utility locations to check before writing new functions

Usage

yarn workspace @hyperdx/hdx-eval dev setup-hyperdx
yarn workspace @hyperdx/hdx-eval dev seed error-root-cause --volume-factor 0.1
yarn workspace @hyperdx/hdx-eval dev run error-root-cause
yarn workspace @hyperdx/hdx-eval viewer

See packages/hdx-eval/README.md for full documentation.

- leverage kv rollup mvs - allow claude access to read only in temp dir - tweaks to sysprompt

The runs/ gitignore pattern was matching src/runs/ (source code) in addition to the intended /runs/ (data directory). Anchor the pattern so only the top-level runs/ directory is ignored, and commit the three missing source files: instrument.ts, path.ts, store.ts.

Remove files that are dev-specific or experimental scratch work: - ablation/ directory (REPORT.md, manifest.tsv) - scripts/ablation.sh, ablation-report.ts - scripts/compare-prompt-variants.sh, fast-eval.sh - MCP_IMPROVEMENTS.md Also clean up README references and ablation .gitignore patterns.

- Extract spreadTimestamp() and normalizeSeverityText() shared helpers - Collapse 14 identical phase functions into streamLogPhase() generic - Compact ground-truth programmatic checks to tuple format [id, weight, pattern, neg?] - Delete dead checkoutEventLog export, unexport internal logfmtBody/jsonEventBody

- Generalize from fixed HyperDX+ClickHouse pair to config-driven MCP registry - Add dual-slot eval setup docs for A/B branch comparison - Add baseline+challengers reporting model with delta columns - Expand README with MCP config reference, field tables, and examples - Improve viewer with comparison dashboard and drill-down - Update blinding to handle arbitrary brand terms per MCP - Add --baseline, --ch-url, --no-grade, --no-judge CLI flags

changeset-bot · 2026-06-03T22:00:55Z

🦋 Changeset detected

Latest commit: ad7c588

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@hyperdx/hdx-eval	Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-06-03T22:00:56Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ignored	Preview	Jun 5, 2026 8:22pm
hyperdx-storybook	Skipped		Jun 5, 2026 8:22pm

github-actions · 2026-06-03T22:10:21Z

E2E Test Results

✅ All tests passed • 197 passed • 3 skipped • 1361s

Status	Count
✅ Passed	197
❌ Failed	0
⚠️ Flaky	5
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

- Add packages/hdx-eval workspace to knip.json - Remove unused @hyperdx/common-utils dependency - Remove export from internal-only symbols (SOURCE_TRACES_TABLE, SOURCE_LOGS_TABLE, DENIED_BUILT_IN_TOOLS_BASE, CONFIG_FILENAME, loadGradedPairs, instrumentRun, HyperdxConnection, MeResponse) - Remove dead code (getScenarioGroundTruth, ensureConfigDir) - Add minor changeset for @hyperdx/hdx-eval

- Fix judge error silently penalizing combined score by 60% (grade.ts) - Fix path traversal in viewer /api/batches/:batch route (server.js) - Fix off-by-one in background operation selection (latency-spike) - Fix worker pool crash-on-error losing all in-flight results (cli.ts) - Fix claudeSpawn timer leaks on spawn error and SIGTERM escalation - Fix listRunsInBatch including .grade.json/.timing.json sidecars - Remove unused innerHTML attribute from viewer el() helper (XSS vector) - Bind viewer server to 127.0.0.1 instead of all interfaces

…idation, temp cleanup, compression, type safety - Fix normalizeSeverityText case bug: 'FATAL' now correctly returns 'ERROR' - Add AbortSignal.timeout to all HyperDX API and health check fetch calls - Add identifier validation in scenarioSlug to reject unsafe characters - Clean up temp directories after subprocess exits (leaked API keys) - Enable ClickHouse request compression for batch inserts - Replace per-row buildResourceAttrs with pre-built pool in noisy-signals - Narrow groundTruth type from unknown to Record<string, unknown> - Update AGENTS.md to list hdx-eval as sixth package - Remove dead pickSeverity import and ScenarioOutput type alias - Replace inline require('fs') with top-level import in cli.ts

- Add anchorTime field to EvalConfig, auto-generated and saved on first run so subsequent runs reuse the same anchor automatically - Default to skip reseed (old --no-reseed behavior); add --reseed to opt in - Add --live flag to opt out of saved anchor (wall-clock now, implies --reseed) - --anchor-time <iso> now overrides and saves to eval.config.json - Update README with new CLI flags and Anchor Time section

- Add scenarioIsSeeded() check (queries traces table for any row) - run command now checks for existing data before running; auto-seeds if the scenario tables are empty or missing - Update README to document auto-seed behavior

Lead with export HDX_DEV_SLOT in the Quick Start so eval commands connect to the correct ClickHouse instance regardless of which worktree they are run from. Replace --ch-url examples in the dual-slot seeding section with HDX_DEV_SLOT for consistency. Also clarify that .env.local must be at the monorepo root.

github-actions · 2026-06-05T15:47:36Z

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

Large diff: 11998 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats

Production files changed: 59
Production lines changed: 11998 (+ 2479 in test files, excluded from tier calculation)
Branch: brandon/ai-evals
Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

greptile-apps · 2026-06-05T15:52:02Z

Greptile Summary

This PR introduces packages/hdx-eval, a standalone benchmark framework for evaluating AI agents against observability MCP servers. It generates deterministic synthetic telemetry via a seeded PRNG, spawns Claude Code as an SRE agent in an isolated temp directory, records full trajectories, and grades answers with both programmatic checks and an LLM-as-judge.

New package (packages/hdx-eval) covering five evaluation scenarios, a CLI, ClickHouse seed/schema utilities, a grading pipeline with retry and token accounting, and a local web viewer for browsing results.
Blinded LLM judging redacts MCP tool prefixes and brand terms before showing answers to the judge, and token costs from both the first attempt and any retry are now accumulated correctly.
Instrumentation sidecar derives per-tool-call wall-clock windows from raw stream events and correlates them with ClickHouse system.query_log entries; both .grade.json and .timing.json sidecars are excluded when listing run files for grading.

Confidence Score: 5/5

This PR adds a self-contained new package with no changes to production application code; the only modified files outside the new package are AGENTS.md and knip.json.

All changes are additive — the new packages/hdx-eval package is fully isolated from the rest of the monorepo. Previously reported correctness issues (LIKE pattern escaping, token accumulation, .timing.json exclusion) have been addressed. The remaining findings are cosmetic: per-tool-call timestamps derived during post-processing rather than from stream events (so individual durationMs values are ~0ms), and an anonymous-label overflow past 26 MCPs. Neither affects scoring, grading logic, or any existing functionality.

No files require special attention; the two findings are in packages/hdx-eval/src/harness/runRun.ts and packages/hdx-eval/src/grading/blind.ts.

Important Files Changed

Filename	Overview
packages/hdx-eval/src/harness/claudeSpawn.ts	Spawns Claude Code in an isolated temp dir and streams JSONL output; cleanup of credential files is not in a try/finally so they survive if spawn() itself errors (pre-existing comment, not addressed).
packages/hdx-eval/src/harness/runRun.ts	Assembles RunRecord from parsed events; per-tool-call startedAt/endedAt/durationMs are generated with wall-clock new Date() during post-processing rather than from actual stream timestamps, so individual durationMs values are always ~0ms.
packages/hdx-eval/src/grading/judge.ts	LLM judge with retry; token accumulation from both attempts is correctly applied before returning, fixing the previously reported issue.
packages/hdx-eval/src/grading/grade.ts	Batch grading pipeline; correctly excludes .timing.json sidecars from run file enumeration (previously reported issue now addressed).
packages/hdx-eval/src/runs/instrument.ts	Correlates ClickHouse query_log entries with per-tool-call wall-clock windows; LIKE patterns now correctly escape underscores with backslash and ESCAPE clause (previously reported issue now addressed).
packages/hdx-eval/src/grading/blind.ts	Anonymizes MCP brand terms and tool prefixes before judge evaluation; anonymous labels wrap past Z (26+ MCPs) producing non-letter characters.
packages/hdx-eval/src/reports/aggregate.ts	Aggregates run+grade pairs into per-scenario cell summaries with delta computation; null-safe baseline handling is correct.
packages/hdx-eval/scripts/viewer/server.js	Minimal local HTTP server for the eval viewer; path traversal is prevented via safeJoin and the server binds only to 127.0.0.1.

Sequence Diagram

sequenceDiagram
    participant CLI as hdx-eval CLI
    participant CH as ClickHouse
    participant Spawn as claudeSpawn
    participant Claude as Claude Code (agent)
    participant Grader as gradeBatch
    participant Judge as LLM Judge

    CLI->>CH: seed scenario data (mulberry32 PRNG)
    CLI->>Spawn: runCell(scenario, mcp, runIndex)
    Spawn->>Spawn: write mcp-config.json + settings.json to tmpdir
    Spawn->>Claude: spawn claude -p --mcp-config ... --stream-json
    Claude-->>Spawn: stream-json events (tool_use / tool_result / result)
    Spawn->>Spawn: rmSync(tempdir) cleanup
    Spawn-->>CLI: RunRecord (events, finalAnswer, tokens)
    CLI->>CLI: writeRun to runs/batch/scenario/mcp/idx.json
    CLI->>CH: query system.query_log (instrumentBatch)
    CH-->>CLI: QueryLogRow[]
    CLI->>CLI: matchQueriesToCalls to timing.json sidecar
    CLI->>Grader: gradeBatch(batchDir)
    Grader->>Grader: runProgrammaticChecks(finalAnswer, rubric)
    Grader->>Judge: judgeTrajectory (blind answer to Anthropic API)
    Judge-->>Grader: JudgeResult (scores, tokens)
    Grader->>Grader: writeGradeFile to grade.json
    Grader-->>CLI: GradeBatchSummary
    CLI->>CLI: buildAggregate to _summary.json
    CLI->>CLI: viewer server serves runs/ to browser

_{Reviews (3): Last reviewed commit: "Exclude .timing.json sidecars from listR..." | Re-trigger Greptile}

- Escape underscores and percent signs in SQL LIKE patterns for query attribution to avoid false matches on table names - Accumulate token counts from both judge attempts when retry succeeds, fixing understated cost reporting

Matches the existing exclusion pattern in listRunsInBatch (store.ts). Without this, timing sidecars are picked up as run records and produce garbage grade files.

brandon-pereira added 10 commits June 1, 2026 14:15

initial commit

7bd6d60

stream seeding results

d93aa98

general setup fixes

37f9abe

general eval improvements

e07e4c7

improvements to framework

2fdc6d7

- leverage kv rollup mvs - allow claude access to read only in temp dir - tweaks to sysprompt

add eval skill

019db69

brandon-pereira added 2 commits June 3, 2026 16:26

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 4, 2026 16:14 Inactive

brandon-pereira added 2 commits June 4, 2026 16:19

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 4, 2026 22:44 Inactive

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 14:52 Inactive

chore: fix knip — unexport file-internal helpers in templates.ts

5337706

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 15:14 Inactive

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 15:45 Inactive

brandon-pereira marked this pull request as ready for review June 5, 2026 15:46

Merge branch 'main' into brandon/ai-evals

30ecd94

github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Jun 5, 2026

vercel Bot deployed to Preview – hyperdx-storybook June 5, 2026 15:49 View deployment

vercel Bot deployed to Preview – hyperdx-oss June 5, 2026 15:51 View deployment

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/hdx-eval/src/harness/claudeSpawn.ts

Comment thread packages/hdx-eval/src/runs/instrument.ts

Comment thread packages/hdx-eval/src/harness/runRun.ts

Comment thread packages/hdx-eval/src/grading/judge.ts

Merge branch 'main' into brandon/ai-evals

9c8e626

vercel Bot deployed to Preview – hyperdx-storybook June 5, 2026 20:16 View deployment

vercel Bot deployed to Preview – hyperdx-oss June 5, 2026 20:17 View deployment

greptile-apps Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread packages/hdx-eval/src/grading/grade.ts

Address PR review feedback (#2414)

adb76ea

- Escape underscores and percent signs in SQL LIKE patterns for query attribution to avoid false matches on table names - Accumulate token counts from both judge attempts when retry succeeds, fixing understated cost reporting

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 20:19 Inactive

Exclude .timing.json sidecars from listRunFiles (#2414)

ad7c588

Matches the existing exclusion pattern in listRunsInBatch (store.ts). Without this, timing sidecars are picked up as run records and produce garbage grade files.

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 5, 2026 20:22 Inactive

brandon-pereira requested review from a team and wrn14897 and removed request for a team June 5, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AI eval framework for benchmarking MCP servers#2414

feat: AI eval framework for benchmarking MCP servers#2414
brandon-pereira wants to merge 21 commits into
mainfrom
brandon/ai-evals

brandon-pereira commented Jun 3, 2026

Uh oh!

changeset-bot Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 3, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brandon-pereira commented Jun 3, 2026

Summary

Key Features

What is included

Usage

Uh oh!

changeset-bot Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Tier 4 — Critical

Uh oh!

greptile-apps Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented Jun 3, 2026 •

edited

Loading

vercel Bot commented Jun 3, 2026 •

edited

Loading

github-actions Bot commented Jun 3, 2026 •

edited

Loading

github-actions Bot commented Jun 5, 2026 •

edited

Loading

greptile-apps Bot commented Jun 5, 2026 •

edited

Loading