test(evals): variant filter, structured transcripts, skill nudge by denolfe · Pull Request #16626 · payloadcms/payload

denolfe · 2026-05-14T19:40:10Z

Overview

Improves the evals dashboard and the Claude Code agent runner so we can see, at a glance, whether a run actually exercised the Payload skill, what the agent did, and how much that run cost.

Key Changes

Dashboard filter cleanup
- Removed the legacy all / users / admins / maintainers audience filter row from the results table (audience UI remains in the Compare view).
- Added a Variant filter strip driven by the shared getVariant(result) classifier, with buttons derived from the variants present in the loaded entries (Agent Baseline, Agent Skill, Baseline, Skill).
Structured Claude Code transcripts
- Runner now invokes claude --print --output-format stream-json --verbose and parses the NDJSON event stream.
- Persists a typed TranscriptEvent[] on CodegenRunnerResult and EvalResult covering text, thinking, tool_use, and tool_result blocks.
- Per-event content truncated to ~4k chars; total events capped at 200 (first 100 + last 100 + marker) to bound cache size.
- Token usage is extracted from the stream-json result event and populated on the runner result (no more zeros for agent runs).
- Stderr is captured separately and surfaces in agentLog only when non-empty, preserving timeout and spawn-error diagnostics.
Transcript UI in the expanded row
- New TranscriptView renders the event timeline with collapsible blocks for tool_use, tool_result, and thinking.
- Tool result errors render in red; tool calls show as → name, results as ← result.
- The Transcript section opens by default, as do its inner blocks.
- Falls back to the existing plaintext <pre> when only agentLog is available (legacy cache entries).
Skill invocation visibility
- A badge at the top of each transcript indicates whether the agent invoked the Skill tool (green when invoked, red when not), using v4 design tokens so it renders in both themes.
- Agent runner appends a system prompt directing the agent to invoke the payload skill, gated on skillInstall === 'embedded' so the agent-baseline lane stays untouched.
- That directive is part of the codegen cache key, so future tweaks to it auto-invalidate cached agent-skill results.

Design Decisions

Plaintext first, structured second. The first pass at transcripts kept the existing agentLog plaintext and rendered it in a <pre>. That only showed the final assistant message because --print strips tool calls. We then upgraded to stream-json with a typed event union, but kept agentLog on the type as an optional fallback so older cache entries still render and stderr has somewhere to go.

Shared variant classifier. The dashboard's per-row Variant value uses the getVariant(result) helper from test/evals/variant.ts rather than re-inferring from systemPromptKey/modelId in the table. The Variant filter buttons derive their option set from entries, so new variants appear without further changes.

System-prompt nudge over skill description change. The payload skill description was permissive ("Use when working with Payload CMS projects"), so the agent often skipped it. Rather than rewrite the skill description (which would affect production users), the runner appends a --append-system-prompt directive on the agent-skill lane only. This preserves a clean A/B with agent-baseline.

Cache invariance. The cache key already factored skillInstall and skillHash. The new directive is included as well so any future copy change forces a fresh run.

Audience stripped from list view only. EvalEntry.audience and audience.ts remain because CompareTable still uses them. Scope was kept to the dashboard list as requested.

Overall Flow

sequenceDiagram
    participant Test as vitest eval case
    participant Runner as runCodegenEval
    participant Claude as claude CLI
    participant Cache as cache.ts
    participant UI as EvalDashboard

    Test->>Runner: run(instruction, fixture, opts)
    Runner->>Cache: codegenKey({ ..., agentSystemPrompt })
    Cache-->>Runner: cached EvalResult?
    alt cache hit
        Runner-->>Test: cached result
    else cache miss (agent-skill)
        Runner->>Claude: --output-format stream-json --append-system-prompt
        Claude-->>Runner: NDJSON (system, assistant, user, result)
        Runner->>Runner: parse to TranscriptEvent[] and usage
        Runner->>Cache: persist EvalResult with transcript, usage, agentLog?
        Runner-->>Test: result
    end
    UI->>UI: read cache, flatten to EvalEntry[]
    UI->>UI: filter by Variant button
    UI->>UI: ExpandedRow renders TranscriptView and SkillInvocationBadge

Switch claude --print to stream-json output, parse NDJSON events, and persist a typed TranscriptEvent[] on EvalResult so the dashboard can show tool calls and tool results in addition to the final message.

Render the new TranscriptEvent[] in the expanded result row with collapsible tool calls, tool results, and thinking blocks. Fall back to the previous plaintext agentLog when no structured events are available (older cache entries or stderr-only failures).

…esult event

…em-prompt

…-code runs

…-mode visibility

github-actions · 2026-05-14T19:49:06Z

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

denolfe added 16 commits May 14, 2026 14:39

chore(test-evals): remove audience filter from dashboard

eb23c43

feat(test-evals): add variant filter to dashboard

ecb8ab3

chore: rename variantLabel and extract VariantFilter type

f1815ff

feat(test-evals): show claude code transcript in dashboard

755bd9c

feat(test-evals): capture structured transcripts from claude code runner

56eda92

Switch claude --print to stream-json output, parse NDJSON events, and persist a typed TranscriptEvent[] on EvalResult so the dashboard can show tool calls and tool results in addition to the final message.

chore(test-evals): flush trailing stdout buffer on agent exit

4711b4e

feat(test-evals): populate token usage from claude code stream-json r…

c71a1fa

…esult event

feat(test-evals): show skill invocation badge in transcript

6bf2062

chore(test-evals): expand transcript by default

60f81cb

feat(test-evals): nudge agent to invoke payload skill via append-syst…

d62816a

…em-prompt

feat(test-evals): include agent system prompt in cache key for claude…

706cd79

…-code runs

chore(test-evals): color skill invocation badge red when not invoked

19e98e4

chore(test-evals): expand all transcript details by default

f0dbc50

chore(test-evals): use solid color-500 fills for skill badge for dark…

9c0d482

…-mode visibility

fix(test-evals): use v4 design tokens for skill badge background

16a6134

github-actions Bot added the created-by: Payload team label May 14, 2026

denolfe changed the title ~~chore(test-evals): variant filter, structured transcripts, skill nudge~~ test(evals): variant filter, structured transcripts, skill nudge May 14, 2026

denolfe marked this pull request as ready for review May 14, 2026 19:47

denolfe merged commit caa676d into main May 15, 2026
170 of 173 checks passed

denolfe deleted the ai/evals-visibility branch May 15, 2026 16:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(evals): variant filter, structured transcripts, skill nudge#16626

test(evals): variant filter, structured transcripts, skill nudge#16626
denolfe merged 16 commits into
mainfrom
ai/evals-visibility

denolfe commented May 14, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

denolfe commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Key Changes

Design Decisions

Overall Flow

Uh oh!

github-actions Bot commented May 14, 2026

📦 esbuild Bundle Analysis for payload

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

denolfe commented May 14, 2026 •

edited

Loading