Skip to content

feat(hdx-eval): add dashboard-build eval scenario#2571

Open
brandon-pereira wants to merge 18 commits into
mainfrom
brandon/brandon-dashboard-evals
Open

feat(hdx-eval): add dashboard-build eval scenario#2571
brandon-pereira wants to merge 18 commits into
mainfrom
brandon/brandon-dashboard-evals

Conversation

@brandon-pereira

@brandon-pereira brandon-pereira commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Adds a dashboard-build eval scenario that tests an agent's ability to create multi-tile observability dashboards via MCP tools. Unlike the existing investigation scenarios (error-root-cause, latency-spike, etc.) which evaluate text answers, this evaluates created artifacts — the dashboards themselves are inspected post-run via the API.

The scenario is intentionally hard. The agent must handle vague user prompts, distractor services, impossible metric requests, messy severity data, and cross-dashboard drill-downs — all within a 15-turn budget. Current baseline: 75% combined score (92% programmatic, 78% judge), leaving room for improvement as the MCP tools and prompts evolve.

What the scenario tests

The agent receives a realistic user request for two dashboards ("Service Health Overview" + "Service Detail") and must:

  • Discover sources and schemas via list_sources / describe_source
  • Create dashboards with 18-22 tiles across 8 display types (line, stacked_bar, table, number, heatmap, pie, search, markdown, raw SQL)
  • Wire cross-dashboard onClick drill-downs with ServiceName filters
  • Handle data traps: 4 of 7 services are internal infrastructure (not user-facing), debug-proxy has a fake 15% error rate, inventory-service has a misleading latency red herring, SeverityText has mixed casing, and CPU/memory metrics don't exist
  • Verify tiles return data via query_tile
  • Report what was built and flag data quality caveats

What's in this PR

  • Scenario generator (dashboard-build/generate.ts) — seeds 2M traces + 4M logs across 7 services (3 user-facing + 4 distractors) with deliberate data traps
  • Post-run inspection (dashboardInspection.ts) — fetches created dashboards via the v2 API, queries every tile, and formats structured evidence for the LLM judge including heuristic distractor-awareness signals
  • Ground truth + rubric (ground-truth.json) — 24 programmatic regex checks and 7 weighted judge criteria
  • HyperDX API client extensions (api.ts) — dashboard CRUD, tile querying, and MCP JSON-RPC for post-run inspection
  • System prompt — minimal dashboard-building instructions with DATA REVIEW guidance and conciseness constraint
  • Unit tests — volume targets, determinism, service distribution, error spikes, messy severity, latency red herring

Uses the scenario hooks framework merged in PR #2547 (buildSystemPrompt, allowedToolPatterns, judgeSystemPreamble, postRunInspection).

Eval results (75% baseline)

Metric Value
Combined score 75%
Programmatic score 92%
Judge mean 78%
Tool calls (mean) 23
Tool-error penalty 8pp
Duration (mean) 282s
Completion rate 3/3 final_answer
Judge Criterion Score
tool_efficiency 5.0/5
cross_dashboard_drill 4.7/5
computed_expressions 4.0/5
structure_and_design 4.0/5
verification 4.0/5
tile_correctness 3.7/5
data_awareness 2.7/5

data_awareness (2.7/5) is the main gap — the agent partially flags data traps but doesn't consistently scope dashboards to user-facing services or handle all the planted red herrings. This is by design: the scenario is meant to be hard enough that MCP prompt improvements show measurable gains.

@changeset-bot

changeset-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 7ba3d30

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@hyperdx/hdx-eval Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented Jul 1, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
hyperdx-oss Ignored Ignored Preview Jul 1, 2026 10:42pm
hyperdx-storybook Ignored Ignored Preview Jul 1, 2026 10:42pm

Request Review

@greptile-apps

greptile-apps Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds a dashboard-build eval scenario that tests an agent's ability to create multi-tile observability dashboards via MCP tools, evaluating created artifacts post-run rather than text answers. The scenario seeds 2M traces + 4M logs across 7 services (3 user-facing + 4 distractor), plants multiple misleading-data traps, and grades the result with 24 programmatic regex checks and 7 weighted LLM judge criteria.

  • Scenario generator (generate.ts) — seeds traces and logs with deliberate traps: a debug-proxy fake error rate, an inventory-service latency red herring, messy mixed-case SeverityText, and staging traffic blended into production.
  • Post-run inspection (dashboardInspection.ts) — fetches dashboards via the v2 API, queries every tile, and formats structured heuristic signals (scopesToUserFacing, filtersOutDistractors, etc.) for the LLM judge.
  • API client extensions (api.ts) — adds dashboard CRUD, getDashboardV2, and queryTileWithEvidence with dual JSON-RPC/SSE response parsing.

Confidence Score: 5/5

Safe to merge — this is new eval tooling with no changes to production paths; all issues are in heuristic evidence signals and prompt wording.

All findings are in the eval harness itself: heuristic judge signals and a mild prompt contradiction in the system prompt. None affect production code, correctness of seeded data, or the grading rubric. The scenario generator, API client additions, and test suite are solid.

dashboardInspection.ts for the two heuristic signal issues; generate.ts for the contradictory lowCardinalityValues guidance and the allowedToolPatterns question.

Important Files Changed

Filename Overview
packages/hdx-eval/src/grading/dashboardInspection.ts New post-run inspection module; two heuristic signal issues: extractIntendedTileConfigs misses patch_dashboard configs and the doc-comment overstates coverage, and scopesToUserFacing's second condition is too broad.
packages/hdx-eval/src/scenarios/dashboard-build/generate.ts New scenario generator and hooks; DATA REVIEW and SAMPLING_CAVEAT_BLOCK give contradictory guidance about lowCardinalityValues in the anchored case; allowedToolPatterns needs clarification.
packages/hdx-eval/src/hyperdx/api.ts Adds dashboard CRUD and MCP JSON-RPC tile-query methods with dual JSON-RPC/SSE parsing and proper error status attachment.
packages/hdx-eval/src/generators/logs.ts Adds deriveSeverityNumber fallback for messy severity variants; handles all known OTel aliases correctly.
packages/hdx-eval/src/generators/types.ts Widens severityText to `CanonicalSeverity
packages/hdx-eval/src/tests/dashboard-build.test.ts Comprehensive unit tests covering volume, determinism, service distribution, error spike, severity variants, and latency red-herring.
packages/hdx-eval/src/harness/systemPrompt.ts Adds SAMPLING_CAVEAT_BLOCK injected alongside anchor time; correctly omitted when no anchor is set.
packages/hdx-eval/src/scenarios/dashboard-build/ground-truth.json 24 programmatic checks and 7 weighted judge criteria covering all tile types, cross-dashboard drill-down, and data-trap awareness.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as cli.ts (grade)
    participant GT as dashboardInspection.ts
    participant API as HyperdxApiClient
    participant MCP as /mcp endpoint
    participant Judge as LLM Judge

    CLI->>GT: inspectDashboards(toolCalls, apiUrl, ...)
    GT->>GT: extractDashboardIds(toolCalls)
    GT->>GT: extractIntendedTileConfigs [save_dashboard only]
    GT->>API: login(email, password)
    loop for each dashboardId
        GT->>API: getDashboardV2(id, accessKey)
        API-->>GT: "{ tiles[], containers[] }"
        loop for each tile
            GT->>MCP: JSON-RPC tools/call clickstack_query_tile
            MCP-->>GT: JSON-RPC or SSE response
            GT->>GT: extractMcpContent → TileEvidence
        end
    end
    GT->>GT: analyzeDistractorAwareness(result)
    GT->>API: deleteDashboard(id) [cleanup]
    GT-->>CLI: DashboardInspectionResult
    CLI->>CLI: formatDashboardEvidence(result)
    CLI->>Judge: evidence + rubric + agent answer
    Judge-->>CLI: "{ scores: { criterion: { score, rationale } } }"
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as cli.ts (grade)
    participant GT as dashboardInspection.ts
    participant API as HyperdxApiClient
    participant MCP as /mcp endpoint
    participant Judge as LLM Judge

    CLI->>GT: inspectDashboards(toolCalls, apiUrl, ...)
    GT->>GT: extractDashboardIds(toolCalls)
    GT->>GT: extractIntendedTileConfigs [save_dashboard only]
    GT->>API: login(email, password)
    loop for each dashboardId
        GT->>API: getDashboardV2(id, accessKey)
        API-->>GT: "{ tiles[], containers[] }"
        loop for each tile
            GT->>MCP: JSON-RPC tools/call clickstack_query_tile
            MCP-->>GT: JSON-RPC or SSE response
            GT->>GT: extractMcpContent → TileEvidence
        end
    end
    GT->>GT: analyzeDistractorAwareness(result)
    GT->>API: deleteDashboard(id) [cleanup]
    GT-->>CLI: DashboardInspectionResult
    CLI->>CLI: formatDashboardEvidence(result)
    CLI->>Judge: evidence + rubric + agent answer
    Judge-->>CLI: "{ scores: { criterion: { score, rationale } } }"
Loading

Reviews (4): Last reviewed commit: "chore: add changeset" | Re-trigger Greptile

Comment thread packages/hdx-eval/src/grading/dashboardInspection.ts
Comment thread packages/hdx-eval/src/__tests__/systemPrompt.test.ts Outdated
Comment thread packages/hdx-eval/src/grading/dashboardInspection.ts
Comment thread packages/hdx-eval/src/hyperdx/api.ts
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 223 passed • 3 skipped • 1424s

Status Count
✅ Passed 223
❌ Failed 0
⚠️ Flaky 4
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook July 1, 2026 22:02 Inactive
brandon-pereira and others added 16 commits July 1, 2026 16:05
…mework

Add a new eval scenario that tests programmatic dashboard creation via
MCP tools. The agent must build a 12-tile dashboard with containers,
tabs, dashboard-level filters, onClick drill-downs, asRatio tiles,
numberFormat, raw SQL, heatmap, search, and multi-source (trace + log)
tiles.

Scoring (~75% on current branch):
- Programmatic checks (22) on agent's text answer
- LLM judge (6 criteria) evaluates actual dashboard artifact
- Post-run inspection fetches tiles via API and queries each for data
- Tool error penalty for failed MCP calls
- Automatic dashboard cleanup after grading

Framework refactor — scenario hooks replace hardcoded ScenarioKind:
- Scenario.buildSystemPrompt: custom system prompt builder
- Scenario.allowedToolPatterns: selectively unblock denied tools
- Scenario.judgeSystemPreamble: custom LLM judge instructions
- Scenario.postRunInspection: inspect artifacts, collect evidence, cleanup

Adding a new scenario kind (alert-build, saved-search-build) now requires
only the scenario files + one import line — zero framework file changes.
…or data, impossible requests

Dashboard eval improvements:
- Vague user prompt: describes desired outcomes, not implementation details
  (no more 'configType sql', 'asRatio', 'if() expression' hints)
- Impossible requests: asks for CPU/memory metrics that don't exist —
  agent should report unavailability, not create broken tiles
- Distractor services: 4 noisy internal services (health-checker,
  cron-scheduler, internal-metrics, debug-proxy) that clutter the data.
  debug-proxy has misleading 15% error rate (it's debug traffic).
  Agent should focus on user-facing services.
- Minimal system prompt: 6 lines, no workflow coaching — agent learns
  everything from MCP tool schemas
- Fixed 7 programmatic regex bugs (parenthetical labels like '(line)')
- V2 API for dashboard inspection (proper tile names + configs)
- Intent extraction from save_dashboard tool calls for judge evidence
- Cross-dashboard onClick validation in inspection hook
- Judge criteria includes data_awareness (distractor handling),
  impossible request detection, and tool_efficiency
… all services

When the saved anchor time is >12 hours old and the user didn't explicitly
set --anchor-time, refresh it to Date.now() and force a reseed. This
ensures describe_source's 24-hour lookback window can see the eval data,
including distractor services in dashboard scenarios.

Without this, distractor services (health-checker, cron-scheduler, etc.)
were invisible to the agent because describe_source's value sampling
queried a time range that didn't contain the stale anchored data.
…tale anchor

When the config anchor time is stale (>12h old), check if the actual
data in ClickHouse is still fresh before triggering a re-seed. If the
data's max timestamp is within 12h, just update the config anchor to
match and skip the re-seed.

This avoids unnecessary 2-minute re-seeds when the user copies a stale
backup config (eval.config.branch.json) before each run but the
ClickHouse data is already fresh from a recent run.
…ocs, add variant to hooks

- Restore removed comments in grade.ts (tool-error penalty math,
  needsJudge decision, resolveBatchDir path resolution)
- Restore removed comments in systemPrompt.ts (schema reference,
  anchor time explanation, hypothesis playbook description)
- Fix cleanupIds JSDoc: clarify cleanup is the hook's responsibility
  via PostRunInspectionContext.cleanup, not a framework step
- Add variant to SystemPromptContext so custom hooks can adapt to
  hypothesis-mode runs
- Add anchorTimeIso to grade command's inspectionConfig (was missing,
  hooks received undefined on standalone re-grades)
Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>
Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>
Co-authored-by: Brandon Pereira <brandon-pereira@users.noreply.github.com>
…ading

Revert the tile-height guidance changes from PR 2554 (schemas.ts,
content.ts, prompts.test.ts) — API changes should not live in the eval
PR. Instead, add tile sizing evaluation to the dashboard-build scenario:

- Add w/h layout dimensions to TileEvidence type and evidence formatting
  so the judge can see actual tile proportions
- Update structure_and_design judge criterion (weight 2→3) to penalize
  lazy default 12x4 layouts — number tiles should be compact, tables
  taller, etc.
- The eval now measures whether the agent sizes tiles appropriately,
  giving signal for PRs like 2554 to improve against
Eval prompt improvements:
- Add DATA REVIEW instruction to system prompt — nudges the agent to
  inspect data before building (count by ServiceName, check
  lowCardinalityValues for mixed casing/environments)
- Add "note data quality caveats" to agent prompt — agent now flags
  misleading signals, internal services, severity inconsistencies
- Add conciseness instruction — cuts output tokens from 22K to 17K
  and shaves ~70s per run

Programmatic rubric fixes:
- Widen has_number_tile regex to match "Number (" and "displayType number"
- Widen two_dashboards regex to match "Service Health Overview...Service Detail"
- Widen has_dashboard_filter regex to match "Service filter/dropdown"

Supporting fixes (pre-existing on branch):
- Fix FATAL severity number to OTel-correct 21 (distinct from ERROR 17)
- Disambiguate tile configs across dashboards in inspection
- Improve API client error handling with status codes
- Add dashboard-build to README scenario table
- Fix scopesToUserFacing false-negative: drop distractor-name guard
  that rejected dashboards mentioning distractors in exclusion context
  (e.g. ServiceName != 'debug-proxy'). filtersOutDistractors already
  captures that pattern.
- Fix systemPrompt test: replace brittle character count assertion
  with relative comparison against the investigation prompt length.
- Remove dead indexed keys from extractIntendedTileConfigs — the
  downstream lookup only uses plain tile names.
- Un-export TileEvidence and ContainerEvidence (only used internally,
  flagged by knip).

Issue 4 (hardcoded clickstack_query_tile) is not a bug:
queryTileWithEvidence calls the MCP server directly via JSON-RPC POST,
where the tool name is always clickstack_* (server-side registration).
The hyperdx_* prefix only appears in claude CLI's mcp__<server>__*
client-side wrapping.
@brandon-pereira brandon-pereira force-pushed the brandon/brandon-dashboard-evals branch from a431842 to 1b88cfe Compare July 1, 2026 22:08
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook July 1, 2026 22:08 Inactive
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook July 1, 2026 22:22 Inactive
@brandon-pereira brandon-pereira marked this pull request as ready for review July 1, 2026 22:38
@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 1758 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 11
  • Production lines changed: 1758 (+ 317 in test files, excluded from tier calculation)
  • Branch: brandon/brandon-dashboard-evals
  • Author: brandon-pereira

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@brandon-pereira brandon-pereira requested review from a team and wrn14897 and removed request for a team July 1, 2026 22:49
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Deep Review

No critical issues found. This is an offline eval/test-harness package (packages/hdx-eval), not production request-serving code: there is no auth boundary, no user data, and the only deletion (deleteDashboard) targets dashboards the harness itself created for cleanup. Correctness cross-checked the risky parsing against the real v2 API serializer and MCP tool envelope and confirmed the happy path is sound (inner.result double-parse is correct, deriveSeverityNumber matches the existing table, pickService weights sum to 1.0). The findings below are all scoring-integrity and coverage risks — real, but none are ship-blockers.

🟡 P2 — recommended

  • packages/hdx-eval/src/grading/dashboardInspection.ts:270inspectDashboards never collects dashboard.filters and formatDashboardEvidence emits only containers and tiles, yet the judge is instructed to score the Dashboard-2 ServiceName filter and dashboard-level filters from the artifact, so a core requested feature is graded blind and scopesToUserFacing (which scans tile configs only) misses agents that scope via a dashboard-level allow-list filter.
    • Fix: Read dashboard.filters from the v2 response and include filter expressions in the evidence blob, and extend scopesToUserFacing to consider dashboard-level filters.
    • correctness, adversarial
  • packages/hdx-eval/src/grading/dashboardInspection.ts:429handlesMessySeverity matches lower()/upper()/IN-list/'fatal'/LIKE but never SeverityNumber >= 17, even though the ground-truth facts and the data_awareness criterion both list that as valid robust handling, so an agent using the recommended numeric filter is signaled to the judge as a failure.
    • Fix: Add a SeverityNumber comparison clause (e.g. match severitynumber\s*(>=|>|=)) to the handlesMessySeverity heuristic and update the evidence label.
    • correctness
  • packages/hdx-eval/src/grading/dashboardInspection.ts:436latencyBrokenDownByEndpoint is computed over the concatenated blob of every tile's config, so spanname from an unrelated top-endpoints table plus the always-present duration from the required heatmap makes the signal fire true even when the actual latency tile is not broken down by endpoint — the exact failure the trap is meant to catch — inflating data_awareness.
    • Fix: Evaluate this signal per latency-type tile rather than over the whole-dashboard concatenated config text.
    • adversarial, correctness
  • packages/hdx-eval/src/grading/dashboardInspection.ts:122extractIntendedTileConfigs keys the map by tile name across both dashboards (last-write-wins), and since the scenario requests same-named tiles ("Top error messages", "Recent error logs") in both dashboards, formatDashboardEvidence (which prefers intendedConfig ?? config) shows the judge the wrong dashboard's config for the colliding tile, skewing tile_correctness and computed_expressions; the code comment claiming the judge still sees both configs is inaccurate.
    • Fix: Key the intended-config map by a (dashboardId, tileName) pair or match by tile id instead of name alone.
    • adversarial
  • packages/hdx-eval/src/scenarios/dashboard-build/ground-truth.json:135 — Several high-weight programmatic checks match generic tokens in the agent's free-text answer: handles_impossible_request matches the bare phrase "not available", flags_internal_services matches a bare distractor service name (so an agent that wrongly includes it still scores), and flags_misleading_data matches bare "caveat"/"data quality" — terms the prompt itself primes — inflating the programmatic component of the score.
    • Fix: Require co-occurrence of the concept and a distractor/impossible-metric term, or move these semantic checks to the artifact-based judge.
    • adversarial
  • packages/hdx-eval/src/grading/dashboardInspection.ts:354DISTRACTOR_SERVICES and USER_FACING_SERVICES are re-declared here independently of SERVICE_WEIGHTS/ENDPOINTS in generate.ts with no shared source, so renaming or adding a service in the generator silently desyncs the grading heuristics (signals quietly return false rather than erroring).
    • Fix: Export the service-name lists from generate.ts and import them into dashboardInspection.ts so the generator and grader share one definition.
    • maintainability
  • packages/hdx-eval/src/grading/dashboardInspection.ts:81 — The parsing and heuristic-scoring functions that decide the eval outcome (extractDashboardIds, extractIntendedTileConfigs, extractMcpContent, queryTileWithEvidence row/group extraction, getDashboardV2 shape validation, and analyzeDistractorAwareness) have no unit tests, so a regression in any regex/parse branch silently mis-scores runs with nothing to catch it.
    • Fix: Export these functions and add table-driven unit tests over realistic tool-output/API/MCP-response fixtures, including a two-dashboard same-tile-name collision and a latency tile with no SpanName groupBy.
    • testing, correctness, kieran-typescript, reliability, adversarial
🔵 P3 nitpicks (10)
  • packages/hdx-eval/src/grading/dashboardInspection.ts:424scopesToUserFacing's \bin\s*\( also matches inside NOT IN(, and the accompanying every(includes) is satisfied across the whole blob, so per-service tiles that never use a 3-service allow-list can be credited.
    • Fix: Require all three names inside one IN(...) group and exclude the NOT IN case.
  • packages/hdx-eval/src/grading/dashboardInspection.ts:440filtersByEnvironment is true whenever deployment.environment appears, so a tile that groups by environment (surfacing the staging/prod blend rather than filtering it) is credited as filtering.
    • Fix: Scope the attribute-name branch to require it appear inside a where/filter, not a groupBy.
  • packages/hdx-eval/src/grading/dashboardInspection.ts:404filtersOutDistractors only inspects the first indexOf occurrence of each distractor name, so an exclusion clause on a later occurrence is missed.
    • Fix: Scan all occurrences of each distractor name for a nearby exclusion operator.
  • packages/hdx-eval/src/grading/dashboardInspection.ts:99 — When a save_dashboard payload is not valid JSON, the non-global ID_REGEX returns the first "id" match, which could be a tile/trace id preceding the dashboard id, causing inspection of the wrong id and leaving the real dashboard un-cleaned.
    • Fix: Prefer the parsed top-level id path and anchor the regex fallback to a dashboard-scoped key.
  • packages/hdx-eval/src/hyperdx/api.ts:372 — On inner-JSON parse failure queryTileWithEvidence returns {success:true, hasData: text.length>0}, so a non-JSON payload counts as data; the normal envelope parses fine so this is an edge, but it can inflate tilesWithData.
    • Fix: Return hasData:false (or success:false) when the tool result text cannot be parsed.
  • packages/hdx-eval/src/generators/logs.ts:22deriveSeverityNumber re-implements OTel severity numbers via prefix matching (if (u === 'FATAL') return 21) that duplicates SEVERITY_NUMBER_BY_TEXT in templates.ts; the two agree today but are independently maintained and can drift.
    • Fix: Delegate deriveSeverityNumber to the exported SEVERITY_NUMBER_BY_TEXT table.
    • maintainability, kieran-typescript
  • packages/hdx-eval/src/hyperdx/api.ts:256getDashboardV2 casts the unvalidated response to HyperdxDashboard whose tile id is typed string, so tile.id ?? tile._id is typed string while it can be undefined at runtime and would be silently dropped by JSON.stringify into the MCP call; correctness verified the current v2 serializer always emits id, so it is not reachable today.
    • Fix: Narrow the tile type to optional fields or guard/skip tiles with a missing id.
    • kieran-typescript, correctness
  • packages/hdx-eval/src/grading/dashboardInspection.ts:208HyperdxDashboard omits the containers field the code depends on, forcing a dashboard as Record<string, unknown> plus a secondary array cast at the call site with no type checking of the container shape.
    • Fix: Add an optional containers field to HyperdxDashboard and drop the casts.
    • kieran-typescript
  • packages/hdx-eval/src/grading/dashboardInspection.ts:219 — The tile filter callback re-annotates the already-known tile element type as Record<string, unknown>, discarding structural typing on containerId in the same function that otherwise relies on the real tile shape.
    • Fix: Drop the annotation and let TS infer the parameter from dashboard.tiles.
    • kieran-typescript
  • packages/hdx-eval/src/scenarios/dashboard-build/generate.ts:500crossDashboardOnClickValid is computed into summary but never included in the evidence string sent to the judge, and it assumes an onClick.target.mode==='id' shape; if the real onClick schema differs it is silently always-false as a persisted metric.
    • Fix: Verify the onClick schema and either surface this signal in the judge evidence or document it as diagnostic-only.
    • adversarial

Reviewers (9): correctness, adversarial, reliability, testing, maintainability, project-standards, kieran-typescript, agent-native, learnings-researcher.

Testing gaps:

  • No unit coverage for dashboardInspection.ts parsing/heuristic functions or api.ts getDashboardV2/queryTileWithEvidence/extractMcpContent JSON-RPC/SSE branches — the highest-risk new logic.
  • deriveSeverityNumber's prefix-matching fallback branches are untested; all current callers supply canonical text or an explicit severityNumber.
  • Generator error-path/red-herring assertions run against a single fixed seed (42); the debug-proxy and order-service-spike error-rate bounds have the tightest statistical margins and would need re-tuning if the RNG or volume factor changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants