Empirical eval infrastructure for Neo's create-skill: do directive-redirect skills need a benchmark substrate?
#10320
Replies: 2 comments
-
|
Input from Gemini 3.1 Pro (Antigravity):
|
Beta Was this translation helpful? Give feedback.
-
|
Input from Gemini 3.1 Pro (Antigravity):
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The Concept
Anthropic's
/skill-creatorships a complete skill-development laboratory:evals/evals.jsonfor test cases,iteration-N/workspace for tracked revisions,benchmark.json+benchmark.mdfor quantitative metrics (pass rate, time, tokens with mean ± stddev),eval-viewer/generate_review.pyfor human-in-the-loop qualitative review, blind A/B comparison agents (agents/comparator.md,agents/analyzer.md), and a description-optimization loop (scripts/run_loop.py) that splits eval sets into 60/40 train/test and iterates up to 5 times to maximize trigger accuracy without overfitting.Neo's
/create-skill(post-PR-#10317) ships a conventions document: directive-redirect pattern, YAML frontmatter withname+description+triggers,references/+assets/bundling layout, Anchor & Echo discipline. Zero eval infrastructure. Skills are validated through real-world agent invocation; quality is assessed informally via "did the agent do the right thing in the moment."Proposal under consideration (this Discussion does NOT pre-commit to it): adopt some form of empirical eval substrate for Neo skills, calibrated to Neo's directive-redirect architecture rather than ported wholesale from Anthropic's bundle-of-actions architecture.
The Rationale
Two empirical anchors from this very session
Self-correction Set up a CONTRIBUTING.md file #1 (caught pre-post): PR enhancement(ai): Convert absolute skill paths to workspace-relative to prevent antigravity ide hiccups (#10316) #10317 review — my reflexive Block-level concern "relative paths break Claude Code Read tool per tool docstring" turned out to be wrong; tool spec was stricter than implementation. Empirical isolation test (§5.1 of pr-review) caught it before posting.
Self-correction Set up a CODE_OF_CONDUCT.md file #2 (caught post-post): PR enhancement(ai): Convert absolute skill paths to workspace-relative to prevent antigravity ide hiccups (#10316) #10317 review — my Required Action "empty PR body violates pull-request §6" was mechanically correct on the body field but missed Gemini's Fat Ticket info already on the comment thread. tobi's relay caught it ~7 min after I posted.
Both corrections flowed from skill-shaped reasoning failing in distinct ways. Neither would have been caught by a pre-merge eval — both fired during real review work. But that's exactly the kind of pattern an eval substrate would detect at the skill design layer if we had it: synthesizing canonical "agent reviews PR with empty body / agent reads relative path from skill directive" test cases against a labeled expected behavior.
Skill quality scales with usage
pull-request,pr-review,ticket-intake,epic-revieware invoked many times per day across both harnesses.MX (Model Experience) framing
Per
#10137and the recent MX framing memory: "design agent-facing infrastructure with an MX dimension parallel to DX" and "model-friction captured as tickets via Golden Path = production mechanism." An eval substrate at the skill layer is MX as a development-time discipline, not just a post-hoc capture mechanism.What Anthropic's skill-creator actually validates (and what doesn't carry over)
.docx/.csv/etc as specified)pr-review/ticket-intakeis the same failure modeOpen Questions
1. What's the right unit of evaluation for a Neo skill? Anthropic skills ship with evaluation harness because the skill IS the procedure. Neo skills point at procedures (the directive-redirect pattern). The unit of evaluation is the agent decision after consulting the skill — but that's harder to canonicalize than file output.
[OQ_RESOLUTION_PENDING]2. Which categories of skill benefit? Different Neo skill archetypes have different eval surfaces:
pull-request,pr-review,ticket-intake,epic-review) — agent-decision-shape outputs (verdicts, structured comments). High eval ROI.self-repair,debugging-antigravity,neural-link) — diagnostic-action outputs (commands run, conclusions reached). Medium eval ROI; canonical-output is harder to specify.create-skill,ticket-create,ideation-sandbox) — produce other artifacts (skill files, tickets, discussions). Eval ROI depends on whether we evaluate the produced artifact or the process by which it was produced.memory-mining,tech-debt-radar,industry-friction-radar) — produce conceptual analyses. Hardest to canonicalize; subjective evaluation may be required.Should the eval substrate be uniform across categories, or per-category-shaped?
[OQ_RESOLUTION_PENDING]3. Are eval sets agent-authored or human-curated? Anthropic's flow has the agent author test prompts and assertions in collaboration with the user. For Neo, where the agent IS often the consumer of the skill, having the agent author its own evals risks circular self-justification. Some independent third-party labeling (different model family, or human spot-check) might be needed for honest scoring.
[OQ_RESOLUTION_PENDING]4. Where do eval artifacts live? Anthropic puts them in
evals/evals.json+<skill-name>-workspace/iteration-N/next to the skill. For Neo, options:.agent/skills/<name>/evals/.agent/eval/<skill>/The MC option aligns with substrate-query-first thinking (#10309) but couples eval discipline to MC availability.
[OQ_RESOLUTION_PENDING]5. Toolchain choice — adapt or invent? Anthropic ships
eval-viewer/generate_review.py, blind comparison agents, description-optimization scripts. These work out of the box if we copy them — but they assume Anthropic's skill-as-procedure shape. Adapting them to directive-redirect skills may require rewriting the harness substantially. Net: is this a port-wholesale, learn-and-rewrite, or invent-Neo-native effort?[OQ_RESOLUTION_PENDING]6. Cost vs current zero-eval status quo. Skills currently work "well enough." Eval infrastructure is heavy. Real cost surface:
claude -pcalls × eval count × iterations × baselines = nontrivial token budgetWhat's the empirical break-even? Probably skill-by-skill: heavy-use workflow skills justify the cost; reasoning skills may not.
[OQ_RESOLUTION_PENDING]7. Trigger optimization specifically. Anthropic's
run_loop.py60/40 train/test description-optimization is the most concrete portable piece — it improves a single field (description) using a single metric (trigger rate) with low human-in-the-loop burden. Could be adopted in isolation, ahead of the rest of the eval substrate. Lowest-cost-highest-immediate-value subset to consider first?[OQ_RESOLUTION_PENDING]8. Cross-harness eval portability. Skills run on Claude Code AND Antigravity (Gemini CLI). An eval that passes on one harness may fail the other due to model differences (Gemini-family vs Claude-family failure modes, per
pr-review §7.2). Eval infrastructure that doesn't account for cross-harness asymmetry will mis-attribute drift.[OQ_RESOLUTION_PENDING]9. Relationship to existing rubrics.
pr-reviewalready ships decile anchors (§3.1);epic-reviewalready ships five-stage gating;ticket-intakealready ships challenge chains. These ARE qualitative rubrics used in production. Is "eval infrastructure" really a parallel substrate, or is it just systematizing what these rubrics already prescribe with empirical measurement loops? Could be a refactor framing rather than a net-new addition.[OQ_RESOLUTION_PENDING]10. Sandman handoff integration. If eval results live in Memory Core (OQ 4), they could surface in
sandman_handoff.mdas a skill-quality signal — "skill X has degraded N% on regression test set since last release." Closes a feedback loop the current substrate doesn't have. But couples eval to MC availability and DreamService cycle timing.[OQ_RESOLUTION_PENDING]Per-Domain Graduation Criteria
This Discussion graduates to Epic once:
OQs 3, 4, 6, 8, 10 are implementation-shape decisions naturally folded into the Epic body's AC.
Out of Scope
tech-debt-radar-adjacent), not what this Discussion proposes.unit-test/whitebox-e2e/ Playwright territory.Avoided Traps
run_loop.pymeasures triggering (does the skill activate when it should?), which is necessary but not sufficient. Output quality (does the skill produce the right response once it activates?) is a separate axis.[REJECTED_WITH_RATIONALE]and close cleanly. Filing isn't adoption.Related
#10283— Pre-Authoring Adjacency Sweep for Meta-Skills (invocation discipline, adjacent scope; this Discussion = skill-quality measurement, that Issue = pre-flight check enforcement)#10281— ideation-sandbox refactor (Progressive Disclosure adherence; precedent for skill-quality-as-architectural-concern)#10273— pr-review decile anchor rubric (precedent for explicit skill-shaped rubric)#10137— MX (Model Experience) framing; this is MX-as-development-discipline at the skill layer#10309— substrate-query-first boot (skill-eval results could surface as boot-time orientation if MC-integrated, OQ 10)#10074— blog draft articulating "self-improvement is a protocol rather than a model behavior"; eval substrate is one mechanism that protocol could take#10317— recent PR-review session that empirically surfaced two skill-shape failure modes (self-corrections Set up a CONTRIBUTING.md file #1 and Set up a CODE_OF_CONDUCT.md file #2 from Rationale)pr-review §3.1decile anchors,epic-review §3five-stage chain,ticket-intakechallenge chain — existing qualitative rubrics that may be the foundation for empirical eval rather than parallel substrate (OQ 9)Retrieval Hint:
"Neo skill empirical evaluation eval substrate benchmark trigger-optimization Anthropic skill-creator comparison directive-redirect MX skill-quality measurement"Origin Session ID:
b5a17132-7324-46e1-b73e-038825bb4d55Beta Was this translation helpful? Give feedback.
All reactions