Skip to content

test(evals): variant filter, structured transcripts, skill nudge#16626

Merged
denolfe merged 16 commits into
mainfrom
ai/evals-visibility
May 15, 2026
Merged

test(evals): variant filter, structured transcripts, skill nudge#16626
denolfe merged 16 commits into
mainfrom
ai/evals-visibility

Conversation

@denolfe
Copy link
Copy Markdown
Member

@denolfe denolfe commented May 14, 2026

Overview

Improves the evals dashboard and the Claude Code agent runner so we can see, at a glance, whether a run actually exercised the Payload skill, what the agent did, and how much that run cost.

CleanShot 2026-05-14 at 15 39 33@2x

Key Changes

  • Dashboard filter cleanup

    • Removed the legacy all / users / admins / maintainers audience filter row from the results table (audience UI remains in the Compare view).
    • Added a Variant filter strip driven by the shared getVariant(result) classifier, with buttons derived from the variants present in the loaded entries (Agent Baseline, Agent Skill, Baseline, Skill).
  • Structured Claude Code transcripts

    • Runner now invokes claude --print --output-format stream-json --verbose and parses the NDJSON event stream.
    • Persists a typed TranscriptEvent[] on CodegenRunnerResult and EvalResult covering text, thinking, tool_use, and tool_result blocks.
    • Per-event content truncated to ~4k chars; total events capped at 200 (first 100 + last 100 + marker) to bound cache size.
    • Token usage is extracted from the stream-json result event and populated on the runner result (no more zeros for agent runs).
    • Stderr is captured separately and surfaces in agentLog only when non-empty, preserving timeout and spawn-error diagnostics.
  • Transcript UI in the expanded row

    • New TranscriptView renders the event timeline with collapsible blocks for tool_use, tool_result, and thinking.
    • Tool result errors render in red; tool calls show as → name, results as ← result.
    • The Transcript section opens by default, as do its inner blocks.
    • Falls back to the existing plaintext <pre> when only agentLog is available (legacy cache entries).
  • Skill invocation visibility

    • A badge at the top of each transcript indicates whether the agent invoked the Skill tool (green when invoked, red when not), using v4 design tokens so it renders in both themes.
    • Agent runner appends a system prompt directing the agent to invoke the payload skill, gated on skillInstall === 'embedded' so the agent-baseline lane stays untouched.
    • That directive is part of the codegen cache key, so future tweaks to it auto-invalidate cached agent-skill results.

Design Decisions

Plaintext first, structured second. The first pass at transcripts kept the existing agentLog plaintext and rendered it in a <pre>. That only showed the final assistant message because --print strips tool calls. We then upgraded to stream-json with a typed event union, but kept agentLog on the type as an optional fallback so older cache entries still render and stderr has somewhere to go.

Shared variant classifier. The dashboard's per-row Variant value uses the getVariant(result) helper from test/evals/variant.ts rather than re-inferring from systemPromptKey/modelId in the table. The Variant filter buttons derive their option set from entries, so new variants appear without further changes.

System-prompt nudge over skill description change. The payload skill description was permissive ("Use when working with Payload CMS projects"), so the agent often skipped it. Rather than rewrite the skill description (which would affect production users), the runner appends a --append-system-prompt directive on the agent-skill lane only. This preserves a clean A/B with agent-baseline.

Cache invariance. The cache key already factored skillInstall and skillHash. The new directive is included as well so any future copy change forces a fresh run.

Audience stripped from list view only. EvalEntry.audience and audience.ts remain because CompareTable still uses them. Scope was kept to the dashboard list as requested.

Overall Flow

sequenceDiagram
    participant Test as vitest eval case
    participant Runner as runCodegenEval
    participant Claude as claude CLI
    participant Cache as cache.ts
    participant UI as EvalDashboard

    Test->>Runner: run(instruction, fixture, opts)
    Runner->>Cache: codegenKey({ ..., agentSystemPrompt })
    Cache-->>Runner: cached EvalResult?
    alt cache hit
        Runner-->>Test: cached result
    else cache miss (agent-skill)
        Runner->>Claude: --output-format stream-json --append-system-prompt
        Claude-->>Runner: NDJSON (system, assistant, user, result)
        Runner->>Runner: parse to TranscriptEvent[] and usage
        Runner->>Cache: persist EvalResult with transcript, usage, agentLog?
        Runner-->>Test: result
    end
    UI->>UI: read cache, flatten to EvalEntry[]
    UI->>UI: filter by Variant button
    UI->>UI: ExpandedRow renders TranscriptView and SkillInvocationBadge
Loading

denolfe added 16 commits May 14, 2026 14:39
Switch claude --print to stream-json output, parse NDJSON events,
and persist a typed TranscriptEvent[] on EvalResult so the dashboard
can show tool calls and tool results in addition to the final message.
Render the new TranscriptEvent[] in the expanded result row with
collapsible tool calls, tool results, and thinking blocks. Fall back
to the previous plaintext agentLog when no structured events are
available (older cache entries or stderr-only failures).
@denolfe denolfe changed the title chore(test-evals): variant filter, structured transcripts, skill nudge test(evals): variant filter, structured transcripts, skill nudge May 14, 2026
@denolfe denolfe marked this pull request as ready for review May 14, 2026 19:47
@github-actions
Copy link
Copy Markdown
Contributor

📦 esbuild Bundle Analysis for payload

This analysis was generated by esbuild-bundle-analyzer. 🤖
This PR introduced no changes to the esbuild bundle! 🙌

@denolfe denolfe merged commit caa676d into main May 15, 2026
170 of 173 checks passed
@denolfe denolfe deleted the ai/evals-visibility branch May 15, 2026 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant