feat(hdx-eval): add scenario hooks for custom prompts, tools, grading, and inspection by brandon-pereira · Pull Request #2547 · hyperdxio/hyperdx

brandon-pereira · 2026-06-29T21:25:57Z

Summary

Adds optional hooks to the Scenario type so new eval scenario kinds can customize harness and grading behavior without modifying framework files. All existing scenarios are unaffected — hooks are optional with defaults preserved.

Hooks

Hook	Purpose
`buildSystemPrompt(ctx)`	Replace the default SRE-investigation system prompt
`allowedToolPatterns`	Selectively unblock denied MCP tools (e.g. dashboard tools)
`judgeSystemPreamble`	Replace the default LLM judge preamble
`postRunInspection(ctx)`	Inspect artifacts post-run, collect evidence for the judge, clean up

Why

Adding the dashboard-build eval (separate PR) required changes to 10+ framework files because ScenarioKind was threaded through the entire call stack. With hooks, adding a new scenario kind requires only the scenario files + one import line in index.ts.

Other changes

runCell takes a Scenario object instead of a scenario name string (needed for hook dispatch)
GradeRecord has generic inspectionSummary and inspectionEvidence fields
Re-grade safety: cached inspection summary + evidence are reused on --rerun-judge instead of re-firing the hook (artifacts may have been cleaned up)
SystemPromptContext includes variant so custom hooks can adapt to hypothesis-mode runs
buildInspectionConfig() helper extracted (used by both run and grade commands)
CLI: --email/--password options on run and grade commands (extracted to DEFAULT_EVAL_EMAIL/DEFAULT_EVAL_PASSWORD constants)
.gitignore: ignore eval.config.*.json backup configs

Test plan

149 unit tests pass (16 suites)
TypeScript compiles clean
All existing scenarios produce identical results (hooks are optional, defaults preserved)
New test: buildSystemPrompt hook dispatch verified with a fake scenario

…, and inspection Add optional hooks to the Scenario type so new scenario kinds can customize harness and grading behavior without modifying framework files. Hooks: - buildSystemPrompt(ctx): replace the default investigation system prompt - allowedToolPatterns: selectively unblock denied MCP tools - judgeSystemPreamble: replace the default LLM judge preamble - postRunInspection(ctx): inspect artifacts post-run, collect evidence for the judge, and clean up When a hook is not provided, the existing investigation behavior is preserved — all existing scenarios are unaffected. Adding a new scenario kind (e.g. dashboard-build, alert-build) now requires only the scenario files + one import line in index.ts. Zero changes to harness, grading, CLI, or type files. Other changes: - runCell takes a Scenario object instead of a scenario name string - GradeRecord has a generic inspectionSummary field (opaque JSON) - CLI: --email/--password options for post-run inspection auth - CLI: auto-refresh stale anchor time (>12h) with ClickHouse data freshness check to avoid unnecessary re-seeds

…kup configs - Extract DEFAULT_EVAL_EMAIL and DEFAULT_EVAL_PASSWORD constants in cli.ts — previously hardcoded in 3 places (setup-hyperdx, run, grade) - Add eval.config.*.json to .gitignore so backup configs (eval.config.branch.json, eval.config.main.json) are not tracked

vercel · 2026-06-29T21:26:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
hyperdx-oss	Ready	Preview, Comment	Jul 1, 2026 2:52pm
hyperdx-storybook	Ready	Preview, Comment	Jul 1, 2026 2:52pm

github-actions · 2026-06-29T21:26:22Z

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

Diff size: 391 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats

Production files changed: 11
Production lines changed: 391 (+ 33 in test files, excluded from tier calculation)
Branch: brandon/eval-scenario-hooks
Author: brandon-pereira

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

greptile-apps · 2026-06-29T21:30:01Z

Greptile Summary

This PR introduces an optional hook system on the Scenario type (buildSystemPrompt, allowedToolPatterns, judgeSystemPreamble, postRunInspection) so new eval scenario kinds can customize harness and grading behavior without touching framework files. All existing scenarios are unaffected — hooks default to the existing SRE-investigation behavior.

Hook dispatch: systemPrompt.ts, settingsFile.ts, judge.ts, and grade.ts each check for the relevant hook and fall back to the original logic when absent; variant is correctly forwarded to SystemPromptContext and anchorTimeIso is now included in the grade command's inspection config.
Re-grade safety: gradeOne reuses the cached inspectionSummary/inspectionEvidence from an existing grade record before calling the hook, so --rerun-judge works even after artifacts have been cleaned up.
CLI ergonomics: DEFAULT_EVAL_EMAIL/DEFAULT_EVAL_PASSWORD are extracted as constants; --email/--password options are added to both the run and grade commands; buildInspectionConfig is extracted to avoid duplication.

Confidence Score: 4/5

Safe to merge for existing scenarios; new hook infrastructure works correctly for the baseline case.

The hook dispatch, re-grade cache, and CLI wiring are all correct. The one remaining concern is that formatInspectionLogBit in grade.ts is dashboard-specific and will silently produce empty log output for any future non-dashboard inspection hook, making the console output misleading as the hook system is adopted by more scenario types.

packages/hdx-eval/src/grading/grade.ts — formatInspectionLogBit and the silent skip when hyperdxApi is absent.

Important Files Changed

Filename	Overview
packages/hdx-eval/src/scenarios/types.ts	Adds four optional hook fields to the Scenario type plus supporting context/result types; design is well-documented and backward-compatible.
packages/hdx-eval/src/grading/grade.ts	Adds post-run inspection logic with cache-reuse guard for re-grades; formatInspectionLogBit is coupled to dashboard-specific summary fields.
packages/hdx-eval/src/harness/settingsFile.ts	Adds allowedToolPatterns filtering to deniedToolsFor and buildSettings; correctly applies only to per-MCP denied tools.
packages/hdx-eval/src/cli.ts	Extracts credential constants, adds --email/--password to run and grade commands, and adds buildInspectionConfig helper; anchorTimeIso is no longer dropped in the grade command.
packages/hdx-eval/src/harness/systemPrompt.ts	Dispatches to scenario.buildSystemPrompt hook when present; variant is correctly forwarded to the custom hook via SystemPromptContext.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as cli.ts
    participant RR as runRun.ts
    participant SP as systemPrompt.ts
    participant CS as claudeSpawn.ts
    participant SF as settingsFile.ts
    participant GB as gradeBatch
    participant GO as gradeOne
    participant JT as judge.ts

    CLI->>RR: runCell(scenario, ...)
    RR->>SP: buildSystemPrompt(scenario, anchorTimeIso, variant)
    SP-->>SP: scenario.buildSystemPrompt?(ctx) or default
    SP-->>RR: systemPromptAppend
    RR->>CS: "runClaude({allowedToolPatterns, ...})"
    CS->>SF: deniedToolsFor(variant, def, allowedToolPatterns)
    SF-->>CS: filtered deny list
    CS-->>RR: SpawnResult
    RR-->>CLI: RunRecord

    CLI->>GB: "gradeBatch(batchDir, {inspectionConfig})"
    GB->>GO: gradeOne(record, existing, opts)
    alt existing.inspectionSummary present
        GO-->>GO: reuse cached inspectionResult
    else scenario.postRunInspection and inspectionConfig
        GO->>GO: scenario.postRunInspection(ctx)
    end
    GO->>JT: "judgeTrajectory({judgeSystemPreamble, inspectionEvidence})"
    JT-->>GO: JudgeResult
    GO-->>GB: GradeRecord
    GB-->>CLI: GradeBatchSummary

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as cli.ts
    participant RR as runRun.ts
    participant SP as systemPrompt.ts
    participant CS as claudeSpawn.ts
    participant SF as settingsFile.ts
    participant GB as gradeBatch
    participant GO as gradeOne
    participant JT as judge.ts

    CLI->>RR: runCell(scenario, ...)
    RR->>SP: buildSystemPrompt(scenario, anchorTimeIso, variant)
    SP-->>SP: scenario.buildSystemPrompt?(ctx) or default
    SP-->>RR: systemPromptAppend
    RR->>CS: "runClaude({allowedToolPatterns, ...})"
    CS->>SF: deniedToolsFor(variant, def, allowedToolPatterns)
    SF-->>CS: filtered deny list
    CS-->>RR: SpawnResult
    RR-->>CLI: RunRecord

    CLI->>GB: "gradeBatch(batchDir, {inspectionConfig})"
    GB->>GO: gradeOne(record, existing, opts)
    alt existing.inspectionSummary present
        GO-->>GO: reuse cached inspectionResult
    else scenario.postRunInspection and inspectionConfig
        GO->>GO: scenario.postRunInspection(ctx)
    end
    GO->>JT: "judgeTrajectory({judgeSystemPreamble, inspectionEvidence})"
    JT-->>GO: JudgeResult
    GO-->>GB: GradeRecord
    GB-->>CLI: GradeBatchSummary

_{Reviews (6): Last reviewed commit: "Merge branch 'main' into brandon/eval-sc..." | Re-trigger Greptile}

…ocs, add variant to hooks - Restore removed comments in grade.ts (tool-error penalty math, needsJudge decision, resolveBatchDir path resolution) - Restore removed comments in systemPrompt.ts (schema reference, anchor time explanation, hypothesis playbook description) - Fix cleanupIds JSDoc: clarify cleanup is the hook's responsibility via PostRunInspectionContext.cleanup, not a framework step - Add variant to SystemPromptContext so custom hooks can adapt to hypothesis-mode runs - Add anchorTimeIso to grade command's inspectionConfig (was missing, hooks received undefined on standalone re-grades)

changeset-bot · 2026-06-29T21:35:09Z

🦋 Changeset detected

Latest commit: b190e46

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
@hyperdx/hdx-eval	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

github-actions · 2026-06-29T21:35:53Z

E2E Test Results

✅ All tests passed • 223 passed • 3 skipped • 1509s

Status	Count
✅ Passed	223
❌ Failed	0
⚠️ Flaky	3
⏭️ Skipped	3

Tests ran across 4 shards in parallel.

View full report →

The anchor staleness auto-refresh (12h check + ClickHouse data freshness query) is dashboard-scenario-specific behavior that belongs in the dashboard-build PR, not the framework hooks PR.

…ion on re-grade - Extract buildInspectionConfig() helper used by both run and grade commands (was inlined in 2 places with inconsistent fields) - Fix P1: re-grading (e.g. --rerun-judge) no longer re-fires the postRunInspection hook. On first grade, the evidence string and summary are persisted in the grade record. On re-grade, both are reused from the cached record — artifacts may have been cleaned up so re-running the hook would fail or produce empty evidence. - Add inspectionEvidence field to GradeRecord so the judge can see the artifact evidence even on re-grades

brandon-pereira added 2 commits June 29, 2026 15:18

github-actions Bot added the review/tier-3 Standard — full human review required label Jun 29, 2026

greptile-apps Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread packages/hdx-eval/src/cli.ts Outdated

Comment thread packages/hdx-eval/src/scenarios/types.ts

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 29, 2026 21:35 Inactive

brandon-pereira added 2 commits June 29, 2026 15:36

chore: add changeset for eval scenario hooks

a424f03

fix: remove anchor staleness logic from framework PR

2ec80e2

The anchor staleness auto-refresh (12h check + ClickHouse data freshness query) is dashboard-scenario-specific behavior that belongs in the dashboard-build PR, not the framework hooks PR.

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 29, 2026 21:51 Inactive

greptile-apps Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread packages/hdx-eval/src/grading/grade.ts Outdated

vercel Bot temporarily deployed to Preview – hyperdx-storybook June 29, 2026 21:59 Inactive

brandon-pereira requested review from a team and fleon and removed request for a team June 30, 2026 20:31

fleon approved these changes Jul 1, 2026

View reviewed changes

brandon-pereira added the automerge label Jul 1, 2026

Merge branch 'main' into brandon/eval-scenario-hooks

a927a61

vercel Bot deployed to Preview – hyperdx-storybook July 1, 2026 12:44 View deployment

vercel Bot deployed to Preview – hyperdx-oss July 1, 2026 12:46 View deployment

Merge branch 'main' into brandon/eval-scenario-hooks

b190e46

vercel Bot deployed to Preview – hyperdx-storybook July 1, 2026 14:51 View deployment

vercel Bot deployed to Preview – hyperdx-oss July 1, 2026 14:52 View deployment

kodiakhq Bot merged commit 64d0bbe into main Jul 1, 2026
28 of 30 checks passed

kodiakhq Bot deleted the brandon/eval-scenario-hooks branch July 1, 2026 15:30

brandon-pereira mentioned this pull request Jul 1, 2026

feat(hdx-eval): add dashboard-build eval scenario #2571

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(hdx-eval): add scenario hooks for custom prompts, tools, grading, and inspection#2547

feat(hdx-eval): add scenario hooks for custom prompts, tools, grading, and inspection#2547
kodiakhq[bot] merged 8 commits into
mainfrom
brandon/eval-scenario-hooks

brandon-pereira commented Jun 29, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

changeset-bot Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

brandon-pereira commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Hooks

Why

Other changes

Test plan

Uh oh!

vercel Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Tier 3 — Standard

Uh oh!

greptile-apps Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

changeset-bot Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

github-actions Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

E2E Test Results

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brandon-pereira commented Jun 29, 2026 •

edited

Loading

vercel Bot commented Jun 29, 2026 •

edited

Loading

github-actions Bot commented Jun 29, 2026 •

edited

Loading

greptile-apps Bot commented Jun 29, 2026 •

edited

Loading

changeset-bot Bot commented Jun 29, 2026 •

edited

Loading

github-actions Bot commented Jun 29, 2026 •

edited

Loading