test(evals): variant filter, structured transcripts, skill nudge#16626
Merged
Conversation
Switch claude --print to stream-json output, parse NDJSON events, and persist a typed TranscriptEvent[] on EvalResult so the dashboard can show tool calls and tool results in addition to the final message.
Render the new TranscriptEvent[] in the expanded result row with collapsible tool calls, tool results, and thinking blocks. Fall back to the previous plaintext agentLog when no structured events are available (older cache entries or stderr-only failures).
Contributor
📦 esbuild Bundle Analysis for payloadThis analysis was generated by esbuild-bundle-analyzer. 🤖 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Improves the evals dashboard and the Claude Code agent runner so we can see, at a glance, whether a run actually exercised the Payload skill, what the agent did, and how much that run cost.
Key Changes
Dashboard filter cleanup
all / users / admins / maintainersaudience filter row from the results table (audience UI remains in the Compare view).getVariant(result)classifier, with buttons derived from the variants present in the loaded entries (Agent Baseline, Agent Skill, Baseline, Skill).Structured Claude Code transcripts
claude --print --output-format stream-json --verboseand parses the NDJSON event stream.TranscriptEvent[]onCodegenRunnerResultandEvalResultcovering text, thinking, tool_use, and tool_result blocks.resultevent and populated on the runner result (no more zeros for agent runs).agentLogonly when non-empty, preserving timeout and spawn-error diagnostics.Transcript UI in the expanded row
TranscriptViewrenders the event timeline with collapsible blocks for tool_use, tool_result, and thinking.→ name, results as← result.<pre>when onlyagentLogis available (legacy cache entries).Skill invocation visibility
Skilltool (green when invoked, red when not), using v4 design tokens so it renders in both themes.payloadskill, gated onskillInstall === 'embedded'so theagent-baselinelane stays untouched.Design Decisions
Plaintext first, structured second. The first pass at transcripts kept the existing
agentLogplaintext and rendered it in a<pre>. That only showed the final assistant message because--printstrips tool calls. We then upgraded to stream-json with a typed event union, but keptagentLogon the type as an optional fallback so older cache entries still render and stderr has somewhere to go.Shared variant classifier. The dashboard's per-row Variant value uses the
getVariant(result)helper fromtest/evals/variant.tsrather than re-inferring fromsystemPromptKey/modelIdin the table. The Variant filter buttons derive their option set fromentries, so new variants appear without further changes.System-prompt nudge over skill description change. The
payloadskill description was permissive ("Use when working with Payload CMS projects"), so the agent often skipped it. Rather than rewrite the skill description (which would affect production users), the runner appends a--append-system-promptdirective on the agent-skill lane only. This preserves a clean A/B withagent-baseline.Cache invariance. The cache key already factored
skillInstallandskillHash. The new directive is included as well so any future copy change forces a fresh run.Audience stripped from list view only.
EvalEntry.audienceandaudience.tsremain becauseCompareTablestill uses them. Scope was kept to the dashboard list as requested.Overall Flow
sequenceDiagram participant Test as vitest eval case participant Runner as runCodegenEval participant Claude as claude CLI participant Cache as cache.ts participant UI as EvalDashboard Test->>Runner: run(instruction, fixture, opts) Runner->>Cache: codegenKey({ ..., agentSystemPrompt }) Cache-->>Runner: cached EvalResult? alt cache hit Runner-->>Test: cached result else cache miss (agent-skill) Runner->>Claude: --output-format stream-json --append-system-prompt Claude-->>Runner: NDJSON (system, assistant, user, result) Runner->>Runner: parse to TranscriptEvent[] and usage Runner->>Cache: persist EvalResult with transcript, usage, agentLog? Runner-->>Test: result end UI->>UI: read cache, flatten to EvalEntry[] UI->>UI: filter by Variant button UI->>UI: ExpandedRow renders TranscriptView and SkillInvocationBadge