feat(hdx-eval): add dashboard-build eval scenario#2571
feat(hdx-eval): add dashboard-build eval scenario#2571brandon-pereira wants to merge 18 commits into
Conversation
🦋 Changeset detectedLatest commit: 7ba3d30 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
Greptile SummaryAdds a
Confidence Score: 5/5Safe to merge — this is new eval tooling with no changes to production paths; all issues are in heuristic evidence signals and prompt wording. All findings are in the eval harness itself: heuristic judge signals and a mild prompt contradiction in the system prompt. None affect production code, correctness of seeded data, or the grading rubric. The scenario generator, API client additions, and test suite are solid.
Important Files Changed
Sequence Diagram%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
participant CLI as cli.ts (grade)
participant GT as dashboardInspection.ts
participant API as HyperdxApiClient
participant MCP as /mcp endpoint
participant Judge as LLM Judge
CLI->>GT: inspectDashboards(toolCalls, apiUrl, ...)
GT->>GT: extractDashboardIds(toolCalls)
GT->>GT: extractIntendedTileConfigs [save_dashboard only]
GT->>API: login(email, password)
loop for each dashboardId
GT->>API: getDashboardV2(id, accessKey)
API-->>GT: "{ tiles[], containers[] }"
loop for each tile
GT->>MCP: JSON-RPC tools/call clickstack_query_tile
MCP-->>GT: JSON-RPC or SSE response
GT->>GT: extractMcpContent → TileEvidence
end
end
GT->>GT: analyzeDistractorAwareness(result)
GT->>API: deleteDashboard(id) [cleanup]
GT-->>CLI: DashboardInspectionResult
CLI->>CLI: formatDashboardEvidence(result)
CLI->>Judge: evidence + rubric + agent answer
Judge-->>CLI: "{ scores: { criterion: { score, rationale } } }"
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
participant CLI as cli.ts (grade)
participant GT as dashboardInspection.ts
participant API as HyperdxApiClient
participant MCP as /mcp endpoint
participant Judge as LLM Judge
CLI->>GT: inspectDashboards(toolCalls, apiUrl, ...)
GT->>GT: extractDashboardIds(toolCalls)
GT->>GT: extractIntendedTileConfigs [save_dashboard only]
GT->>API: login(email, password)
loop for each dashboardId
GT->>API: getDashboardV2(id, accessKey)
API-->>GT: "{ tiles[], containers[] }"
loop for each tile
GT->>MCP: JSON-RPC tools/call clickstack_query_tile
MCP-->>GT: JSON-RPC or SSE response
GT->>GT: extractMcpContent → TileEvidence
end
end
GT->>GT: analyzeDistractorAwareness(result)
GT->>API: deleteDashboard(id) [cleanup]
GT-->>CLI: DashboardInspectionResult
CLI->>CLI: formatDashboardEvidence(result)
CLI->>Judge: evidence + rubric + agent answer
Judge-->>CLI: "{ scores: { criterion: { score, rationale } } }"
Reviews (4): Last reviewed commit: "chore: add changeset" | Re-trigger Greptile |
E2E Test Results✅ All tests passed • 223 passed • 3 skipped • 1424s
Tests ran across 4 shards in parallel. |
…mework Add a new eval scenario that tests programmatic dashboard creation via MCP tools. The agent must build a 12-tile dashboard with containers, tabs, dashboard-level filters, onClick drill-downs, asRatio tiles, numberFormat, raw SQL, heatmap, search, and multi-source (trace + log) tiles. Scoring (~75% on current branch): - Programmatic checks (22) on agent's text answer - LLM judge (6 criteria) evaluates actual dashboard artifact - Post-run inspection fetches tiles via API and queries each for data - Tool error penalty for failed MCP calls - Automatic dashboard cleanup after grading Framework refactor — scenario hooks replace hardcoded ScenarioKind: - Scenario.buildSystemPrompt: custom system prompt builder - Scenario.allowedToolPatterns: selectively unblock denied tools - Scenario.judgeSystemPreamble: custom LLM judge instructions - Scenario.postRunInspection: inspect artifacts, collect evidence, cleanup Adding a new scenario kind (alert-build, saved-search-build) now requires only the scenario files + one import line — zero framework file changes.
…or data, impossible requests Dashboard eval improvements: - Vague user prompt: describes desired outcomes, not implementation details (no more 'configType sql', 'asRatio', 'if() expression' hints) - Impossible requests: asks for CPU/memory metrics that don't exist — agent should report unavailability, not create broken tiles - Distractor services: 4 noisy internal services (health-checker, cron-scheduler, internal-metrics, debug-proxy) that clutter the data. debug-proxy has misleading 15% error rate (it's debug traffic). Agent should focus on user-facing services. - Minimal system prompt: 6 lines, no workflow coaching — agent learns everything from MCP tool schemas - Fixed 7 programmatic regex bugs (parenthetical labels like '(line)') - V2 API for dashboard inspection (proper tile names + configs) - Intent extraction from save_dashboard tool calls for judge evidence - Cross-dashboard onClick validation in inspection hook - Judge criteria includes data_awareness (distractor handling), impossible request detection, and tool_efficiency
… all services When the saved anchor time is >12 hours old and the user didn't explicitly set --anchor-time, refresh it to Date.now() and force a reseed. This ensures describe_source's 24-hour lookback window can see the eval data, including distractor services in dashboard scenarios. Without this, distractor services (health-checker, cron-scheduler, etc.) were invisible to the agent because describe_source's value sampling queried a time range that didn't contain the stale anchored data.
…tale anchor When the config anchor time is stale (>12h old), check if the actual data in ClickHouse is still fresh before triggering a re-seed. If the data's max timestamp is within 12h, just update the config anchor to match and skip the re-seed. This avoids unnecessary 2-minute re-seeds when the user copies a stale backup config (eval.config.branch.json) before each run but the ClickHouse data is already fresh from a recent run.
…ocs, add variant to hooks - Restore removed comments in grade.ts (tool-error penalty math, needsJudge decision, resolveBatchDir path resolution) - Restore removed comments in systemPrompt.ts (schema reference, anchor time explanation, hypothesis playbook description) - Fix cleanupIds JSDoc: clarify cleanup is the hook's responsibility via PostRunInspectionContext.cleanup, not a framework step - Add variant to SystemPromptContext so custom hooks can adapt to hypothesis-mode runs - Add anchorTimeIso to grade command's inspectionConfig (was missing, hooks received undefined on standalone re-grades)
Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>
Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>
Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>
…(6-10)" This reverts commit 6bc2b49.
This reverts commit c599487.
…-4661)" This reverts commit 1560a9c.
…ading Revert the tile-height guidance changes from PR 2554 (schemas.ts, content.ts, prompts.test.ts) — API changes should not live in the eval PR. Instead, add tile sizing evaluation to the dashboard-build scenario: - Add w/h layout dimensions to TileEvidence type and evidence formatting so the judge can see actual tile proportions - Update structure_and_design judge criterion (weight 2→3) to penalize lazy default 12x4 layouts — number tiles should be compact, tables taller, etc. - The eval now measures whether the agent sizes tiles appropriately, giving signal for PRs like 2554 to improve against
Eval prompt improvements:
- Add DATA REVIEW instruction to system prompt — nudges the agent to
inspect data before building (count by ServiceName, check
lowCardinalityValues for mixed casing/environments)
- Add "note data quality caveats" to agent prompt — agent now flags
misleading signals, internal services, severity inconsistencies
- Add conciseness instruction — cuts output tokens from 22K to 17K
and shaves ~70s per run
Programmatic rubric fixes:
- Widen has_number_tile regex to match "Number (" and "displayType number"
- Widen two_dashboards regex to match "Service Health Overview...Service Detail"
- Widen has_dashboard_filter regex to match "Service filter/dropdown"
Supporting fixes (pre-existing on branch):
- Fix FATAL severity number to OTel-correct 21 (distinct from ERROR 17)
- Disambiguate tile configs across dashboards in inspection
- Improve API client error handling with status codes
- Add dashboard-build to README scenario table
- Fix scopesToUserFacing false-negative: drop distractor-name guard that rejected dashboards mentioning distractors in exclusion context (e.g. ServiceName != 'debug-proxy'). filtersOutDistractors already captures that pattern. - Fix systemPrompt test: replace brittle character count assertion with relative comparison against the investigation prompt length. - Remove dead indexed keys from extractIntendedTileConfigs — the downstream lookup only uses plain tile names. - Un-export TileEvidence and ContainerEvidence (only used internally, flagged by knip). Issue 4 (hardcoded clickstack_query_tile) is not a bug: queryTileWithEvidence calls the MCP server directly via JSON-RPC POST, where the tool name is always clickstack_* (server-side registration). The hyperdx_* prefix only appears in claude CLI's mcp__<server>__* client-side wrapping.
a431842 to
1b88cfe
Compare
🔴 Tier 4 — CriticalTouches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD. Why this tier:
Review process: Deep review from a domain expert. Synchronous walkthrough may be required. Stats
|
Deep Review✅ No critical issues found. This is an offline eval/test-harness package ( 🟡 P2 — recommended
🔵 P3 nitpicks (10)
Reviewers (9): correctness, adversarial, reliability, testing, maintainability, project-standards, kieran-typescript, agent-native, learnings-researcher. Testing gaps:
|
Summary
Adds a
dashboard-buildeval scenario that tests an agent's ability to create multi-tile observability dashboards via MCP tools. Unlike the existing investigation scenarios (error-root-cause, latency-spike, etc.) which evaluate text answers, this evaluates created artifacts — the dashboards themselves are inspected post-run via the API.The scenario is intentionally hard. The agent must handle vague user prompts, distractor services, impossible metric requests, messy severity data, and cross-dashboard drill-downs — all within a 15-turn budget. Current baseline: 75% combined score (92% programmatic, 78% judge), leaving room for improvement as the MCP tools and prompts evolve.
What the scenario tests
The agent receives a realistic user request for two dashboards ("Service Health Overview" + "Service Detail") and must:
list_sources/describe_sourcequery_tileWhat's in this PR
dashboard-build/generate.ts) — seeds 2M traces + 4M logs across 7 services (3 user-facing + 4 distractors) with deliberate data trapsdashboardInspection.ts) — fetches created dashboards via the v2 API, queries every tile, and formats structured evidence for the LLM judge including heuristic distractor-awareness signalsground-truth.json) — 24 programmatic regex checks and 7 weighted judge criteriaapi.ts) — dashboard CRUD, tile querying, and MCP JSON-RPC for post-run inspectionUses the scenario hooks framework merged in PR #2547 (
buildSystemPrompt,allowedToolPatterns,judgeSystemPreamble,postRunInspection).Eval results (75% baseline)
data_awareness(2.7/5) is the main gap — the agent partially flags data traps but doesn't consistently scope dashboards to user-facing services or handle all the planted red herrings. This is by design: the scenario is meant to be hard enough that MCP prompt improvements show measurable gains.