feat(hdx-eval): add scenario hooks for custom prompts, tools, grading, and inspection#2547
Conversation
…, and inspection Add optional hooks to the Scenario type so new scenario kinds can customize harness and grading behavior without modifying framework files. Hooks: - buildSystemPrompt(ctx): replace the default investigation system prompt - allowedToolPatterns: selectively unblock denied MCP tools - judgeSystemPreamble: replace the default LLM judge preamble - postRunInspection(ctx): inspect artifacts post-run, collect evidence for the judge, and clean up When a hook is not provided, the existing investigation behavior is preserved — all existing scenarios are unaffected. Adding a new scenario kind (e.g. dashboard-build, alert-build) now requires only the scenario files + one import line in index.ts. Zero changes to harness, grading, CLI, or type files. Other changes: - runCell takes a Scenario object instead of a scenario name string - GradeRecord has a generic inspectionSummary field (opaque JSON) - CLI: --email/--password options for post-run inspection auth - CLI: auto-refresh stale anchor time (>12h) with ClickHouse data freshness check to avoid unnecessary re-seeds
…kup configs - Extract DEFAULT_EVAL_EMAIL and DEFAULT_EVAL_PASSWORD constants in cli.ts — previously hardcoded in 3 places (setup-hyperdx, run, grade) - Add eval.config.*.json to .gitignore so backup configs (eval.config.branch.json, eval.config.main.json) are not tracked
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
🟡 Tier 3 — StandardIntroduces new logic, modifies core functionality, or touches areas with non-trivial risk. Why this tier:
Review process: Full human review — logic, architecture, edge cases. Stats
|
…ocs, add variant to hooks - Restore removed comments in grade.ts (tool-error penalty math, needsJudge decision, resolveBatchDir path resolution) - Restore removed comments in systemPrompt.ts (schema reference, anchor time explanation, hypothesis playbook description) - Fix cleanupIds JSDoc: clarify cleanup is the hook's responsibility via PostRunInspectionContext.cleanup, not a framework step - Add variant to SystemPromptContext so custom hooks can adapt to hypothesis-mode runs - Add anchorTimeIso to grade command's inspectionConfig (was missing, hooks received undefined on standalone re-grades)
🦋 Changeset detectedLatest commit: b190e46 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
E2E Test Results✅ All tests passed • 223 passed • 3 skipped • 1509s
Tests ran across 4 shards in parallel. |
The anchor staleness auto-refresh (12h check + ClickHouse data freshness query) is dashboard-scenario-specific behavior that belongs in the dashboard-build PR, not the framework hooks PR.
…ion on re-grade - Extract buildInspectionConfig() helper used by both run and grade commands (was inlined in 2 places with inconsistent fields) - Fix P1: re-grading (e.g. --rerun-judge) no longer re-fires the postRunInspection hook. On first grade, the evidence string and summary are persisted in the grade record. On re-grade, both are reused from the cached record — artifacts may have been cleaned up so re-running the hook would fail or produce empty evidence. - Add inspectionEvidence field to GradeRecord so the judge can see the artifact evidence even on re-grades
Summary
Adds optional hooks to the
Scenariotype so new eval scenario kinds can customize harness and grading behavior without modifying framework files. All existing scenarios are unaffected — hooks are optional with defaults preserved.Hooks
buildSystemPrompt(ctx)allowedToolPatternsjudgeSystemPreamblepostRunInspection(ctx)Why
Adding the dashboard-build eval (separate PR) required changes to 10+ framework files because
ScenarioKindwas threaded through the entire call stack. With hooks, adding a new scenario kind requires only the scenario files + one import line inindex.ts.Other changes
runCelltakes aScenarioobject instead of a scenario name string (needed for hook dispatch)GradeRecordhas genericinspectionSummaryandinspectionEvidencefields--rerun-judgeinstead of re-firing the hook (artifacts may have been cleaned up)SystemPromptContextincludesvariantso custom hooks can adapt to hypothesis-mode runsbuildInspectionConfig()helper extracted (used by bothrunandgradecommands)--email/--passwordoptions onrunandgradecommands (extracted toDEFAULT_EVAL_EMAIL/DEFAULT_EVAL_PASSWORDconstants).gitignore: ignoreeval.config.*.jsonbackup configsTest plan
buildSystemPrompthook dispatch verified with a fake scenario