feat(hdx-eval): add dashboard-build eval scenario by brandon-pereira · Pull Request #2571 · hyperdxio/hyperdx

brandon-pereira · 2026-07-01T21:51:37Z

Summary

Adds a dashboard-build eval scenario that tests an agent's ability to create multi-tile observability dashboards via MCP tools. Unlike the existing investigation scenarios (error-root-cause, latency-spike, etc.) which evaluate text answers, this evaluates created artifacts — the dashboards themselves are inspected post-run via the API.

The scenario is intentionally hard. The agent must handle vague user prompts, distractor services, impossible metric requests, messy severity data, and cross-dashboard drill-downs — all within a 15-turn budget. Current baseline: 75% combined score (92% programmatic, 78% judge), leaving room for improvement as the MCP tools and prompts evolve.

What the scenario tests

The agent receives a realistic user request for two dashboards ("Service Health Overview" + "Service Detail") and must:

Discover sources and schemas via list_sources / describe_source
Create dashboards with 18-22 tiles across 8 display types (line, stacked_bar, table, number, heatmap, pie, search, markdown, raw SQL)
Wire cross-dashboard onClick drill-downs with ServiceName filters
Handle data traps: 4 of 7 services are internal infrastructure (not user-facing), debug-proxy has a fake 15% error rate, inventory-service has a misleading latency red herring, SeverityText has mixed casing, and CPU/memory metrics don't exist
Verify tiles return data via query_tile
Report what was built and flag data quality caveats

What's in this PR

Scenario generator (dashboard-build/generate.ts) — seeds 2M traces + 4M logs across 7 services (3 user-facing + 4 distractors) with deliberate data traps
Post-run inspection (dashboardInspection.ts) — fetches created dashboards via the v2 API, queries every tile, and formats structured evidence for the LLM judge including heuristic distractor-awareness signals
Ground truth + rubric (ground-truth.json) — 24 programmatic regex checks and 7 weighted judge criteria
HyperDX API client extensions (api.ts) — dashboard CRUD, tile querying, and MCP JSON-RPC for post-run inspection
System prompt — minimal dashboard-building instructions with DATA REVIEW guidance and conciseness constraint
Unit tests — volume targets, determinism, service distribution, error spikes, messy severity, latency red herring

Uses the scenario hooks framework merged in PR #2547 (buildSystemPrompt, allowedToolPatterns, judgeSystemPreamble, postRunInspection).

Eval results (75% baseline)

Metric	Value
Combined score	75%
Programmatic score	92%
Judge mean	78%
Tool calls (mean)	23
Tool-error penalty	8pp
Duration (mean)	282s
Completion rate	3/3 final_answer

Judge Criterion	Score
tool_efficiency	5.0/5
cross_dashboard_drill	4.7/5
computed_expressions	4.0/5
structure_and_design	4.0/5
verification	4.0/5
tile_correctness	3.7/5
data_awareness	2.7/5

data_awareness (2.7/5) is the main gap — the agent partially flags data traps but doesn't consistently scope dashboards to user-facing services or handle all the planted red herrings. This is by design: the scenario is meant to be hard enough that MCP prompt improvements show measurable gains.

changeset-bot · 2026-07-01T21:51:42Z

🦋 Changeset detected

Latest commit: 7ba3d30

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@hyperdx/hdx-eval	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

vercel · 2026-07-01T21:51:43Z

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ignored	Preview	Jul 1, 2026 10:42pm
hyperdx-storybook	Ignored	Preview	Jul 1, 2026 10:42pm

greptile-apps · 2026-07-01T21:57:49Z

Greptile Summary

Adds a dashboard-build eval scenario that tests an agent's ability to create multi-tile observability dashboards via MCP tools, evaluating created artifacts post-run rather than text answers. The scenario seeds 2M traces + 4M logs across 7 services (3 user-facing + 4 distractor), plants multiple misleading-data traps, and grades the result with 24 programmatic regex checks and 7 weighted LLM judge criteria.

Scenario generator (generate.ts) — seeds traces and logs with deliberate traps: a debug-proxy fake error rate, an inventory-service latency red herring, messy mixed-case SeverityText, and staging traffic blended into production.
Post-run inspection (dashboardInspection.ts) — fetches dashboards via the v2 API, queries every tile, and formats structured heuristic signals (scopesToUserFacing, filtersOutDistractors, etc.) for the LLM judge.
API client extensions (api.ts) — adds dashboard CRUD, getDashboardV2, and queryTileWithEvidence with dual JSON-RPC/SSE response parsing.

Confidence Score: 5/5

Safe to merge — this is new eval tooling with no changes to production paths; all issues are in heuristic evidence signals and prompt wording.

All findings are in the eval harness itself: heuristic judge signals and a mild prompt contradiction in the system prompt. None affect production code, correctness of seeded data, or the grading rubric. The scenario generator, API client additions, and test suite are solid.

dashboardInspection.ts for the two heuristic signal issues; generate.ts for the contradictory lowCardinalityValues guidance and the allowedToolPatterns question.

Important Files Changed

Filename	Overview
packages/hdx-eval/src/grading/dashboardInspection.ts	New post-run inspection module; two heuristic signal issues: `extractIntendedTileConfigs` misses `patch_dashboard` configs and the doc-comment overstates coverage, and `scopesToUserFacing`'s second condition is too broad.
packages/hdx-eval/src/scenarios/dashboard-build/generate.ts	New scenario generator and hooks; DATA REVIEW and SAMPLING_CAVEAT_BLOCK give contradictory guidance about `lowCardinalityValues` in the anchored case; `allowedToolPatterns` needs clarification.
packages/hdx-eval/src/hyperdx/api.ts	Adds dashboard CRUD and MCP JSON-RPC tile-query methods with dual JSON-RPC/SSE parsing and proper error status attachment.
packages/hdx-eval/src/generators/logs.ts	Adds `deriveSeverityNumber` fallback for messy severity variants; handles all known OTel aliases correctly.
packages/hdx-eval/src/generators/types.ts	Widens `severityText` to `CanonicalSeverity
packages/hdx-eval/src/tests/dashboard-build.test.ts	Comprehensive unit tests covering volume, determinism, service distribution, error spike, severity variants, and latency red-herring.
packages/hdx-eval/src/harness/systemPrompt.ts	Adds `SAMPLING_CAVEAT_BLOCK` injected alongside anchor time; correctly omitted when no anchor is set.
packages/hdx-eval/src/scenarios/dashboard-build/ground-truth.json	24 programmatic checks and 7 weighted judge criteria covering all tile types, cross-dashboard drill-down, and data-trap awareness.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as cli.ts (grade)
    participant GT as dashboardInspection.ts
    participant API as HyperdxApiClient
    participant MCP as /mcp endpoint
    participant Judge as LLM Judge

    CLI->>GT: inspectDashboards(toolCalls, apiUrl, ...)
    GT->>GT: extractDashboardIds(toolCalls)
    GT->>GT: extractIntendedTileConfigs [save_dashboard only]
    GT->>API: login(email, password)
    loop for each dashboardId
        GT->>API: getDashboardV2(id, accessKey)
        API-->>GT: "{ tiles[], containers[] }"
        loop for each tile
            GT->>MCP: JSON-RPC tools/call clickstack_query_tile
            MCP-->>GT: JSON-RPC or SSE response
            GT->>GT: extractMcpContent → TileEvidence
        end
    end
    GT->>GT: analyzeDistractorAwareness(result)
    GT->>API: deleteDashboard(id) [cleanup]
    GT-->>CLI: DashboardInspectionResult
    CLI->>CLI: formatDashboardEvidence(result)
    CLI->>Judge: evidence + rubric + agent answer
    Judge-->>CLI: "{ scores: { criterion: { score, rationale } } }"

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as cli.ts (grade)
    participant GT as dashboardInspection.ts
    participant API as HyperdxApiClient
    participant MCP as /mcp endpoint
    participant Judge as LLM Judge

    CLI->>GT: inspectDashboards(toolCalls, apiUrl, ...)
    GT->>GT: extractDashboardIds(toolCalls)
    GT->>GT: extractIntendedTileConfigs [save_dashboard only]
    GT->>API: login(email, password)
    loop for each dashboardId
        GT->>API: getDashboardV2(id, accessKey)
        API-->>GT: "{ tiles[], containers[] }"
        loop for each tile
            GT->>MCP: JSON-RPC tools/call clickstack_query_tile
            MCP-->>GT: JSON-RPC or SSE response
            GT->>GT: extractMcpContent → TileEvidence
        end
    end
    GT->>GT: analyzeDistractorAwareness(result)
    GT->>API: deleteDashboard(id) [cleanup]
    GT-->>CLI: DashboardInspectionResult
    CLI->>CLI: formatDashboardEvidence(result)
    CLI->>Judge: evidence + rubric + agent answer
    Judge-->>CLI: "{ scores: { criterion: { score, rationale } } }"

_{Reviews (4): Last reviewed commit: "chore: add changeset" | Re-trigger Greptile}

github-actions · 2026-07-01T22:01:30Z

E2E Test Results

✅ All tests passed • 223 passed • 3 skipped • 1424s

Status	Count
✅ Passed	223
❌ Failed	0
⚠️ Flaky	4
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

…mework Add a new eval scenario that tests programmatic dashboard creation via MCP tools. The agent must build a 12-tile dashboard with containers, tabs, dashboard-level filters, onClick drill-downs, asRatio tiles, numberFormat, raw SQL, heatmap, search, and multi-source (trace + log) tiles. Scoring (~75% on current branch): - Programmatic checks (22) on agent's text answer - LLM judge (6 criteria) evaluates actual dashboard artifact - Post-run inspection fetches tiles via API and queries each for data - Tool error penalty for failed MCP calls - Automatic dashboard cleanup after grading Framework refactor — scenario hooks replace hardcoded ScenarioKind: - Scenario.buildSystemPrompt: custom system prompt builder - Scenario.allowedToolPatterns: selectively unblock denied tools - Scenario.judgeSystemPreamble: custom LLM judge instructions - Scenario.postRunInspection: inspect artifacts, collect evidence, cleanup Adding a new scenario kind (alert-build, saved-search-build) now requires only the scenario files + one import line — zero framework file changes.

…or data, impossible requests Dashboard eval improvements: - Vague user prompt: describes desired outcomes, not implementation details (no more 'configType sql', 'asRatio', 'if() expression' hints) - Impossible requests: asks for CPU/memory metrics that don't exist — agent should report unavailability, not create broken tiles - Distractor services: 4 noisy internal services (health-checker, cron-scheduler, internal-metrics, debug-proxy) that clutter the data. debug-proxy has misleading 15% error rate (it's debug traffic). Agent should focus on user-facing services. - Minimal system prompt: 6 lines, no workflow coaching — agent learns everything from MCP tool schemas - Fixed 7 programmatic regex bugs (parenthetical labels like '(line)') - V2 API for dashboard inspection (proper tile names + configs) - Intent extraction from save_dashboard tool calls for judge evidence - Cross-dashboard onClick validation in inspection hook - Judge criteria includes data_awareness (distractor handling), impossible request detection, and tool_efficiency

… all services When the saved anchor time is >12 hours old and the user didn't explicitly set --anchor-time, refresh it to Date.now() and force a reseed. This ensures describe_source's 24-hour lookback window can see the eval data, including distractor services in dashboard scenarios. Without this, distractor services (health-checker, cron-scheduler, etc.) were invisible to the agent because describe_source's value sampling queried a time range that didn't contain the stale anchored data.

…tale anchor When the config anchor time is stale (>12h old), check if the actual data in ClickHouse is still fresh before triggering a re-seed. If the data's max timestamp is within 12h, just update the config anchor to match and skip the re-seed. This avoids unnecessary 2-minute re-seeds when the user copies a stale backup config (eval.config.branch.json) before each run but the ClickHouse data is already fresh from a recent run.

…ocs, add variant to hooks - Restore removed comments in grade.ts (tool-error penalty math, needsJudge decision, resolveBatchDir path resolution) - Restore removed comments in systemPrompt.ts (schema reference, anchor time explanation, hypothesis playbook description) - Fix cleanupIds JSDoc: clarify cleanup is the hook's responsibility via PostRunInspectionContext.cleanup, not a framework step - Add variant to SystemPromptContext so custom hooks can adapt to hypothesis-mode runs - Add anchorTimeIso to grade command's inspectionConfig (was missing, hooks received undefined on standalone re-grades)

Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>

…(6-10)" This reverts commit 6bc2b49.

This reverts commit c599487.

…-4661)" This reverts commit 1560a9c.

…ading Revert the tile-height guidance changes from PR 2554 (schemas.ts, content.ts, prompts.test.ts) — API changes should not live in the eval PR. Instead, add tile sizing evaluation to the dashboard-build scenario: - Add w/h layout dimensions to TileEvidence type and evidence formatting so the judge can see actual tile proportions - Update structure_and_design judge criterion (weight 2→3) to penalize lazy default 12x4 layouts — number tiles should be compact, tables taller, etc. - The eval now measures whether the agent sizes tiles appropriately, giving signal for PRs like 2554 to improve against

Eval prompt improvements: - Add DATA REVIEW instruction to system prompt — nudges the agent to inspect data before building (count by ServiceName, check lowCardinalityValues for mixed casing/environments) - Add "note data quality caveats" to agent prompt — agent now flags misleading signals, internal services, severity inconsistencies - Add conciseness instruction — cuts output tokens from 22K to 17K and shaves ~70s per run Programmatic rubric fixes: - Widen has_number_tile regex to match "Number (" and "displayType number" - Widen two_dashboards regex to match "Service Health Overview...Service Detail" - Widen has_dashboard_filter regex to match "Service filter/dropdown" Supporting fixes (pre-existing on branch): - Fix FATAL severity number to OTel-correct 21 (distinct from ERROR 17) - Disambiguate tile configs across dashboards in inspection - Improve API client error handling with status codes - Add dashboard-build to README scenario table

- Fix scopesToUserFacing false-negative: drop distractor-name guard that rejected dashboards mentioning distractors in exclusion context (e.g. ServiceName != 'debug-proxy'). filtersOutDistractors already captures that pattern. - Fix systemPrompt test: replace brittle character count assertion with relative comparison against the investigation prompt length. - Remove dead indexed keys from extractIntendedTileConfigs — the downstream lookup only uses plain tile names. - Un-export TileEvidence and ContainerEvidence (only used internally, flagged by knip). Issue 4 (hardcoded clickstack_query_tile) is not a bug: queryTileWithEvidence calls the MCP server directly via JSON-RPC POST, where the tool name is always clickstack_* (server-side registration). The hyperdx_* prefix only appears in claude CLI's mcp__<server>__* client-side wrapping.

…-build

github-actions · 2026-07-01T22:38:58Z

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

Large diff: 1758 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats

Production files changed: 11
Production lines changed: 1758 (+ 317 in test files, excluded from tier calculation)
Branch: brandon/brandon-dashboard-evals
Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

github-actions · 2026-07-01T22:53:32Z

Deep Review

✅ No critical issues found. This is an offline eval/test-harness package (packages/hdx-eval), not production request-serving code: there is no auth boundary, no user data, and the only deletion (deleteDashboard) targets dashboards the harness itself created for cleanup. Correctness cross-checked the risky parsing against the real v2 API serializer and MCP tool envelope and confirmed the happy path is sound (inner.result double-parse is correct, deriveSeverityNumber matches the existing table, pickService weights sum to 1.0). The findings below are all scoring-integrity and coverage risks — real, but none are ship-blockers.

🟡 P2 — recommended

packages/hdx-eval/src/grading/dashboardInspection.ts:270 — inspectDashboards never collects dashboard.filters and formatDashboardEvidence emits only containers and tiles, yet the judge is instructed to score the Dashboard-2 ServiceName filter and dashboard-level filters from the artifact, so a core requested feature is graded blind and scopesToUserFacing (which scans tile configs only) misses agents that scope via a dashboard-level allow-list filter.
- Fix: Read dashboard.filters from the v2 response and include filter expressions in the evidence blob, and extend scopesToUserFacing to consider dashboard-level filters.
- _{correctness, adversarial}
packages/hdx-eval/src/grading/dashboardInspection.ts:429 — handlesMessySeverity matches lower()/upper()/IN-list/'fatal'/LIKE but never SeverityNumber >= 17, even though the ground-truth facts and the data_awareness criterion both list that as valid robust handling, so an agent using the recommended numeric filter is signaled to the judge as a failure.
- Fix: Add a SeverityNumber comparison clause (e.g. match severitynumber\s*(>=|>|=)) to the handlesMessySeverity heuristic and update the evidence label.
- _correctness
packages/hdx-eval/src/grading/dashboardInspection.ts:436 — latencyBrokenDownByEndpoint is computed over the concatenated blob of every tile's config, so spanname from an unrelated top-endpoints table plus the always-present duration from the required heatmap makes the signal fire true even when the actual latency tile is not broken down by endpoint — the exact failure the trap is meant to catch — inflating data_awareness.
- Fix: Evaluate this signal per latency-type tile rather than over the whole-dashboard concatenated config text.
- _{adversarial, correctness}
packages/hdx-eval/src/grading/dashboardInspection.ts:122 — extractIntendedTileConfigs keys the map by tile name across both dashboards (last-write-wins), and since the scenario requests same-named tiles ("Top error messages", "Recent error logs") in both dashboards, formatDashboardEvidence (which prefers intendedConfig ?? config) shows the judge the wrong dashboard's config for the colliding tile, skewing tile_correctness and computed_expressions; the code comment claiming the judge still sees both configs is inaccurate.
- Fix: Key the intended-config map by a (dashboardId, tileName) pair or match by tile id instead of name alone.
- _adversarial
packages/hdx-eval/src/scenarios/dashboard-build/ground-truth.json:135 — Several high-weight programmatic checks match generic tokens in the agent's free-text answer: handles_impossible_request matches the bare phrase "not available", flags_internal_services matches a bare distractor service name (so an agent that wrongly includes it still scores), and flags_misleading_data matches bare "caveat"/"data quality" — terms the prompt itself primes — inflating the programmatic component of the score.
- Fix: Require co-occurrence of the concept and a distractor/impossible-metric term, or move these semantic checks to the artifact-based judge.
- _adversarial
packages/hdx-eval/src/grading/dashboardInspection.ts:354 — DISTRACTOR_SERVICES and USER_FACING_SERVICES are re-declared here independently of SERVICE_WEIGHTS/ENDPOINTS in generate.ts with no shared source, so renaming or adding a service in the generator silently desyncs the grading heuristics (signals quietly return false rather than erroring).
- Fix: Export the service-name lists from generate.ts and import them into dashboardInspection.ts so the generator and grader share one definition.
- _{maintainability}
packages/hdx-eval/src/grading/dashboardInspection.ts:81 — The parsing and heuristic-scoring functions that decide the eval outcome (extractDashboardIds, extractIntendedTileConfigs, extractMcpContent, queryTileWithEvidence row/group extraction, getDashboardV2 shape validation, and analyzeDistractorAwareness) have no unit tests, so a regression in any regex/parse branch silently mis-scores runs with nothing to catch it.
- Fix: Export these functions and add table-driven unit tests over realistic tool-output/API/MCP-response fixtures, including a two-dashboard same-tile-name collision and a latency tile with no SpanName groupBy.
- _{testing, correctness, kieran-typescript, reliability, adversarial}

🔵 P3 nitpicks (10)

packages/hdx-eval/src/grading/dashboardInspection.ts:424 — scopesToUserFacing's \bin\s*\( also matches inside NOT IN(, and the accompanying every(includes) is satisfied across the whole blob, so per-service tiles that never use a 3-service allow-list can be credited.
- Fix: Require all three names inside one IN(...) group and exclude the NOT IN case.
packages/hdx-eval/src/grading/dashboardInspection.ts:440 — filtersByEnvironment is true whenever deployment.environment appears, so a tile that groups by environment (surfacing the staging/prod blend rather than filtering it) is credited as filtering.
- Fix: Scope the attribute-name branch to require it appear inside a where/filter, not a groupBy.
packages/hdx-eval/src/grading/dashboardInspection.ts:404 — filtersOutDistractors only inspects the first indexOf occurrence of each distractor name, so an exclusion clause on a later occurrence is missed.
- Fix: Scan all occurrences of each distractor name for a nearby exclusion operator.
packages/hdx-eval/src/grading/dashboardInspection.ts:99 — When a save_dashboard payload is not valid JSON, the non-global ID_REGEX returns the first "id" match, which could be a tile/trace id preceding the dashboard id, causing inspection of the wrong id and leaving the real dashboard un-cleaned.
- Fix: Prefer the parsed top-level id path and anchor the regex fallback to a dashboard-scoped key.
packages/hdx-eval/src/hyperdx/api.ts:372 — On inner-JSON parse failure queryTileWithEvidence returns {success:true, hasData: text.length>0}, so a non-JSON payload counts as data; the normal envelope parses fine so this is an edge, but it can inflate tilesWithData.
- Fix: Return hasData:false (or success:false) when the tool result text cannot be parsed.
packages/hdx-eval/src/generators/logs.ts:22 — deriveSeverityNumber re-implements OTel severity numbers via prefix matching (if (u === 'FATAL') return 21) that duplicates SEVERITY_NUMBER_BY_TEXT in templates.ts; the two agree today but are independently maintained and can drift.
- Fix: Delegate deriveSeverityNumber to the exported SEVERITY_NUMBER_BY_TEXT table.
- _{maintainability, kieran-typescript}
packages/hdx-eval/src/hyperdx/api.ts:256 — getDashboardV2 casts the unvalidated response to HyperdxDashboard whose tile id is typed string, so tile.id ?? tile._id is typed string while it can be undefined at runtime and would be silently dropped by JSON.stringify into the MCP call; correctness verified the current v2 serializer always emits id, so it is not reachable today.
- Fix: Narrow the tile type to optional fields or guard/skip tiles with a missing id.
- _{kieran-typescript, correctness}
packages/hdx-eval/src/grading/dashboardInspection.ts:208 — HyperdxDashboard omits the containers field the code depends on, forcing a dashboard as Record<string, unknown> plus a secondary array cast at the call site with no type checking of the container shape.
- Fix: Add an optional containers field to HyperdxDashboard and drop the casts.
- _{kieran-typescript}
packages/hdx-eval/src/grading/dashboardInspection.ts:219 — The tile filter callback re-annotates the already-known tile element type as Record<string, unknown>, discarding structural typing on containerId in the same function that otherwise relies on the real tile shape.
- Fix: Drop the annotation and let TS infer the parameter from dashboard.tiles.
- _{kieran-typescript}
packages/hdx-eval/src/scenarios/dashboard-build/generate.ts:500 — crossDashboardOnClickValid is computed into summary but never included in the evidence string sent to the judge, and it assumes an onClick.target.mode==='id' shape; if the real onClick schema differs it is silently always-false as a persisted metric.
- Fix: Verify the onClick schema and either surface this signal in the judge evidence or document it as diagnostic-only.
- _adversarial

Reviewers (9): correctness, adversarial, reliability, testing, maintainability, project-standards, kieran-typescript, agent-native, learnings-researcher.

Testing gaps:

No unit coverage for dashboardInspection.ts parsing/heuristic functions or api.ts getDashboardV2/queryTileWithEvidence/extractMcpContent JSON-RPC/SSE branches — the highest-risk new logic.
deriveSeverityNumber's prefix-matching fallback branches are untested; all current callers supply canonical text or an explicit severityNumber.
Generator error-path/red-herring assertions run against a single fixed seed (42); the debug-proxy and order-service-spike error-rate bounds have the tightest statistical margins and would need re-tuning if the RNG or volume factor changes.

greptile-apps Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread packages/hdx-eval/src/grading/dashboardInspection.ts

Comment thread packages/hdx-eval/src/__tests__/systemPrompt.test.ts Outdated

Comment thread packages/hdx-eval/src/grading/dashboardInspection.ts

Comment thread packages/hdx-eval/src/hyperdx/api.ts

vercel Bot temporarily deployed to Preview – hyperdx-storybook July 1, 2026 22:02 Inactive

brandon-pereira and others added 16 commits July 1, 2026 16:05

merge: integrate eval-scenario-hooks review fixes into dashboard branch

e215df9

fix(mcp): guide agents to size dashboard tiles correctly (HDX-4661)

89fec2c

Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>

fix(mcp): align heatmap tile width guidance and trim changeset

e0f23aa

Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>

fix(mcp): reconcile search tile height guidance with rule 14 (6-10)

34a84aa

Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>

Revert "fix(mcp): reconcile search tile height guidance with rule 14 …

da4926c

…(6-10)" This reverts commit 6bc2b49.

Revert "fix(mcp): align heatmap tile width guidance and trim changeset"

5b47686

This reverts commit c599487.

Revert "fix(mcp): guide agents to size dashboard tiles correctly (HDX…

764fc61

…-4661)" This reverts commit 1560a9c.

fix(hdx-eval): un-export internal-only types flagged by knip

1b88cfe

brandon-pereira force-pushed the brandon/brandon-dashboard-evals branch from a431842 to 1b88cfe Compare July 1, 2026 22:08

vercel Bot temporarily deployed to Preview – hyperdx-storybook July 1, 2026 22:08 Inactive

fix(hdx-eval): re-export PostRunInspectionContext needed by dashboard…

bd02946

…-build

vercel Bot temporarily deployed to Preview – hyperdx-storybook July 1, 2026 22:22 Inactive

brandon-pereira marked this pull request as ready for review July 1, 2026 22:38

github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Jul 1, 2026

chore: add changeset

7ba3d30

brandon-pereira requested review from a team and wrn14897 and removed request for a team July 1, 2026 22:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(hdx-eval): add dashboard-build eval scenario#2571

feat(hdx-eval): add dashboard-build eval scenario#2571
brandon-pereira wants to merge 18 commits into
mainfrom
brandon/brandon-dashboard-evals

brandon-pereira commented Jul 1, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

brandon-pereira commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What the scenario tests

What's in this PR

Eval results (75% baseline)

Uh oh!

changeset-bot Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

vercel Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

github-actions Bot commented Jul 1, 2026

🔴 Tier 4 — Critical

Uh oh!

github-actions Bot commented Jul 1, 2026

Deep Review

🟡 P2 — recommended

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brandon-pereira commented Jul 1, 2026 •

edited

Loading

changeset-bot Bot commented Jul 1, 2026 •

edited

Loading

vercel Bot commented Jul 1, 2026 •

edited

Loading

greptile-apps Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading