Skip to content

feat(hdx-eval): add scenario hooks for custom prompts, tools, grading, and inspection#2547

Merged
kodiakhq[bot] merged 8 commits into
mainfrom
brandon/eval-scenario-hooks
Jul 1, 2026
Merged

feat(hdx-eval): add scenario hooks for custom prompts, tools, grading, and inspection#2547
kodiakhq[bot] merged 8 commits into
mainfrom
brandon/eval-scenario-hooks

Conversation

@brandon-pereira

@brandon-pereira brandon-pereira commented Jun 29, 2026

Copy link
Copy Markdown
Member

Summary

Adds optional hooks to the Scenario type so new eval scenario kinds can customize harness and grading behavior without modifying framework files. All existing scenarios are unaffected — hooks are optional with defaults preserved.

Hooks

Hook Purpose
buildSystemPrompt(ctx) Replace the default SRE-investigation system prompt
allowedToolPatterns Selectively unblock denied MCP tools (e.g. dashboard tools)
judgeSystemPreamble Replace the default LLM judge preamble
postRunInspection(ctx) Inspect artifacts post-run, collect evidence for the judge, clean up

Why

Adding the dashboard-build eval (separate PR) required changes to 10+ framework files because ScenarioKind was threaded through the entire call stack. With hooks, adding a new scenario kind requires only the scenario files + one import line in index.ts.

Other changes

  • runCell takes a Scenario object instead of a scenario name string (needed for hook dispatch)
  • GradeRecord has generic inspectionSummary and inspectionEvidence fields
  • Re-grade safety: cached inspection summary + evidence are reused on --rerun-judge instead of re-firing the hook (artifacts may have been cleaned up)
  • SystemPromptContext includes variant so custom hooks can adapt to hypothesis-mode runs
  • buildInspectionConfig() helper extracted (used by both run and grade commands)
  • CLI: --email/--password options on run and grade commands (extracted to DEFAULT_EVAL_EMAIL/DEFAULT_EVAL_PASSWORD constants)
  • .gitignore: ignore eval.config.*.json backup configs

Test plan

  • 149 unit tests pass (16 suites)
  • TypeScript compiles clean
  • All existing scenarios produce identical results (hooks are optional, defaults preserved)
  • New test: buildSystemPrompt hook dispatch verified with a fake scenario

…, and inspection

Add optional hooks to the Scenario type so new scenario kinds can
customize harness and grading behavior without modifying framework files.

Hooks:
- buildSystemPrompt(ctx): replace the default investigation system prompt
- allowedToolPatterns: selectively unblock denied MCP tools
- judgeSystemPreamble: replace the default LLM judge preamble
- postRunInspection(ctx): inspect artifacts post-run, collect evidence
  for the judge, and clean up

When a hook is not provided, the existing investigation behavior is
preserved — all existing scenarios are unaffected.

Adding a new scenario kind (e.g. dashboard-build, alert-build) now
requires only the scenario files + one import line in index.ts. Zero
changes to harness, grading, CLI, or type files.

Other changes:
- runCell takes a Scenario object instead of a scenario name string
- GradeRecord has a generic inspectionSummary field (opaque JSON)
- CLI: --email/--password options for post-run inspection auth
- CLI: auto-refresh stale anchor time (>12h) with ClickHouse data
  freshness check to avoid unnecessary re-seeds
…kup configs

- Extract DEFAULT_EVAL_EMAIL and DEFAULT_EVAL_PASSWORD constants in
  cli.ts — previously hardcoded in 3 places (setup-hyperdx, run, grade)
- Add eval.config.*.json to .gitignore so backup configs
  (eval.config.branch.json, eval.config.main.json) are not tracked
@vercel

vercel Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment Jul 1, 2026 2:52pm
hyperdx-storybook Ready Ready Preview, Comment Jul 1, 2026 2:52pm

Request Review

@github-actions github-actions Bot added the review/tier-3 Standard — full human review required label Jun 29, 2026
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

  • Diff size: 391 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats
  • Production files changed: 11
  • Production lines changed: 391 (+ 33 in test files, excluded from tier calculation)
  • Branch: brandon/eval-scenario-hooks
  • Author: brandon-pereira

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@greptile-apps

greptile-apps Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR introduces an optional hook system on the Scenario type (buildSystemPrompt, allowedToolPatterns, judgeSystemPreamble, postRunInspection) so new eval scenario kinds can customize harness and grading behavior without touching framework files. All existing scenarios are unaffected — hooks default to the existing SRE-investigation behavior.

  • Hook dispatch: systemPrompt.ts, settingsFile.ts, judge.ts, and grade.ts each check for the relevant hook and fall back to the original logic when absent; variant is correctly forwarded to SystemPromptContext and anchorTimeIso is now included in the grade command's inspection config.
  • Re-grade safety: gradeOne reuses the cached inspectionSummary/inspectionEvidence from an existing grade record before calling the hook, so --rerun-judge works even after artifacts have been cleaned up.
  • CLI ergonomics: DEFAULT_EVAL_EMAIL/DEFAULT_EVAL_PASSWORD are extracted as constants; --email/--password options are added to both the run and grade commands; buildInspectionConfig is extracted to avoid duplication.

Confidence Score: 4/5

Safe to merge for existing scenarios; new hook infrastructure works correctly for the baseline case.

The hook dispatch, re-grade cache, and CLI wiring are all correct. The one remaining concern is that formatInspectionLogBit in grade.ts is dashboard-specific and will silently produce empty log output for any future non-dashboard inspection hook, making the console output misleading as the hook system is adopted by more scenario types.

packages/hdx-eval/src/grading/grade.ts — formatInspectionLogBit and the silent skip when hyperdxApi is absent.

Important Files Changed

Filename Overview
packages/hdx-eval/src/scenarios/types.ts Adds four optional hook fields to the Scenario type plus supporting context/result types; design is well-documented and backward-compatible.
packages/hdx-eval/src/grading/grade.ts Adds post-run inspection logic with cache-reuse guard for re-grades; formatInspectionLogBit is coupled to dashboard-specific summary fields.
packages/hdx-eval/src/harness/settingsFile.ts Adds allowedToolPatterns filtering to deniedToolsFor and buildSettings; correctly applies only to per-MCP denied tools.
packages/hdx-eval/src/cli.ts Extracts credential constants, adds --email/--password to run and grade commands, and adds buildInspectionConfig helper; anchorTimeIso is no longer dropped in the grade command.
packages/hdx-eval/src/harness/systemPrompt.ts Dispatches to scenario.buildSystemPrompt hook when present; variant is correctly forwarded to the custom hook via SystemPromptContext.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant CLI as cli.ts
    participant RR as runRun.ts
    participant SP as systemPrompt.ts
    participant CS as claudeSpawn.ts
    participant SF as settingsFile.ts
    participant GB as gradeBatch
    participant GO as gradeOne
    participant JT as judge.ts

    CLI->>RR: runCell(scenario, ...)
    RR->>SP: buildSystemPrompt(scenario, anchorTimeIso, variant)
    SP-->>SP: scenario.buildSystemPrompt?(ctx) or default
    SP-->>RR: systemPromptAppend
    RR->>CS: "runClaude({allowedToolPatterns, ...})"
    CS->>SF: deniedToolsFor(variant, def, allowedToolPatterns)
    SF-->>CS: filtered deny list
    CS-->>RR: SpawnResult
    RR-->>CLI: RunRecord

    CLI->>GB: "gradeBatch(batchDir, {inspectionConfig})"
    GB->>GO: gradeOne(record, existing, opts)
    alt existing.inspectionSummary present
        GO-->>GO: reuse cached inspectionResult
    else scenario.postRunInspection and inspectionConfig
        GO->>GO: scenario.postRunInspection(ctx)
    end
    GO->>JT: "judgeTrajectory({judgeSystemPreamble, inspectionEvidence})"
    JT-->>GO: JudgeResult
    GO-->>GB: GradeRecord
    GB-->>CLI: GradeBatchSummary
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant CLI as cli.ts
    participant RR as runRun.ts
    participant SP as systemPrompt.ts
    participant CS as claudeSpawn.ts
    participant SF as settingsFile.ts
    participant GB as gradeBatch
    participant GO as gradeOne
    participant JT as judge.ts

    CLI->>RR: runCell(scenario, ...)
    RR->>SP: buildSystemPrompt(scenario, anchorTimeIso, variant)
    SP-->>SP: scenario.buildSystemPrompt?(ctx) or default
    SP-->>RR: systemPromptAppend
    RR->>CS: "runClaude({allowedToolPatterns, ...})"
    CS->>SF: deniedToolsFor(variant, def, allowedToolPatterns)
    SF-->>CS: filtered deny list
    CS-->>RR: SpawnResult
    RR-->>CLI: RunRecord

    CLI->>GB: "gradeBatch(batchDir, {inspectionConfig})"
    GB->>GO: gradeOne(record, existing, opts)
    alt existing.inspectionSummary present
        GO-->>GO: reuse cached inspectionResult
    else scenario.postRunInspection and inspectionConfig
        GO->>GO: scenario.postRunInspection(ctx)
    end
    GO->>JT: "judgeTrajectory({judgeSystemPreamble, inspectionEvidence})"
    JT-->>GO: JudgeResult
    GO-->>GB: GradeRecord
    GB-->>CLI: GradeBatchSummary
Loading

Reviews (6): Last reviewed commit: "Merge branch 'main' into brandon/eval-sc..." | Re-trigger Greptile

Comment thread packages/hdx-eval/src/cli.ts Outdated
Comment thread packages/hdx-eval/src/scenarios/types.ts
…ocs, add variant to hooks

- Restore removed comments in grade.ts (tool-error penalty math,
  needsJudge decision, resolveBatchDir path resolution)
- Restore removed comments in systemPrompt.ts (schema reference,
  anchor time explanation, hypothesis playbook description)
- Fix cleanupIds JSDoc: clarify cleanup is the hook's responsibility
  via PostRunInspectionContext.cleanup, not a framework step
- Add variant to SystemPromptContext so custom hooks can adapt to
  hypothesis-mode runs
- Add anchorTimeIso to grade command's inspectionConfig (was missing,
  hooks received undefined on standalone re-grades)
@changeset-bot

changeset-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: b190e46

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@hyperdx/hdx-eval Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 29, 2026 21:35 Inactive
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 223 passed • 3 skipped • 1509s

Status Count
✅ Passed 223
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

The anchor staleness auto-refresh (12h check + ClickHouse data freshness
query) is dashboard-scenario-specific behavior that belongs in the
dashboard-build PR, not the framework hooks PR.
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 29, 2026 21:51 Inactive
Comment thread packages/hdx-eval/src/grading/grade.ts Outdated
…ion on re-grade

- Extract buildInspectionConfig() helper used by both run and grade
  commands (was inlined in 2 places with inconsistent fields)
- Fix P1: re-grading (e.g. --rerun-judge) no longer re-fires the
  postRunInspection hook. On first grade, the evidence string and
  summary are persisted in the grade record. On re-grade, both are
  reused from the cached record — artifacts may have been cleaned up
  so re-running the hook would fail or produce empty evidence.
- Add inspectionEvidence field to GradeRecord so the judge can see
  the artifact evidence even on re-grades
@vercel vercel Bot temporarily deployed to Preview – hyperdx-storybook June 29, 2026 21:59 Inactive
@brandon-pereira brandon-pereira requested review from a team and fleon and removed request for a team June 30, 2026 20:31
@kodiakhq kodiakhq Bot merged commit 64d0bbe into main Jul 1, 2026
28 of 30 checks passed
@kodiakhq kodiakhq Bot deleted the brandon/eval-scenario-hooks branch July 1, 2026 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-3 Standard — full human review required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants