💡 SkillOpt-Style Self-Improving Skills: Eval-Gated, Human-Reviewed Agent Skill Evolution #572

don-petry · 2026-06-11T20:40:39Z

don-petry
Jun 11, 2026
Maintainer

Summary

Stand up an eval-gated, human-reviewed self-improvement loop for our agent skills and prompts, modeled on Microsoft's open-source SkillOpt pattern. Today every skill and prompt in frameworks/, prompts/, and agents/ is hand-authored markdown that only changes when a human writes a commit — there is no feedback loop turning agent run outcomes back into better skills. The proposal: let agents propose bounded edits to their own skill/prompt markdown, validate each candidate against a held-out eval set, and route the winner through the existing pr-review + release-channel gates as a normal PR. Weight-free (no fine-tuning), fully version-controlled, and human-gated — improvement becomes a reviewable diff, never a silent mutation.

Market Signal

Self-evolving agent skills (not weights) became a distinct, productized category in 2025–2026:

Microsoft SkillOpt (open-source) optimizes agent skills without touching model weights: it makes bounded edits to a markdown skill file, gates every candidate behind a strict held-out validation set, and ships only the winner as best_skill.md — author-reported 52/52 wins (microsoft/SkillOpt, MS Research: "Executive Strategy for Self-Evolving Agent Skills", VentureBeat).
GEPA (reflective prompt evolution) beats RL (GRPO) while using up to ~35× fewer rollouts, and ships inside DSPy as a drop-in optimizer (GEPA paper 2507.19457, gepa-ai/gepa, DSPy GEPA). ACE (Agentic Context Engineering) evolves the agent's "playbook" context the same way (2510.04618).
Anthropic's skill-creator explicitly closes the loop with eval + blind A/B testing to measure and refine Agent Skills — which are themselves composable markdown folders, exactly our format (Improving skill-creator, Agent Skills, anthropics/skills).
Research lineage: Reflexion and Voyager's ever-growing skill library established experience-driven self-improvement (Voyager); 2026 production guidance converges on eval gates + prompt versioning + canary in CI (AI-native CI/CD eval gates, prompt versioning & change management, MLflow: evaluating skills).

Hype filter: fully agent-authored skills remain aspirational. What is shippable today is the SkillOpt/skill-creator shape — agent proposes a bounded edit, an offline eval gate decides, a human reviews the PR (GitHub: reviewing agent PRs).

User Signal

This org already has every prerequisite except the loop itself:

Markdown skills under version control: frameworks/bmad-method/ (25+ skills), prompts/*.md (triage, deep-review, synthesize, dev-lead phases), agents/*.md — SkillOpt's exact input format.
A human-gate pipeline that's idle for this: Ideas → idea:approved → initiative-planner → epic/DAG → dev-lead → pr-review. We can run skill improvements through it unchanged.
Guardrail alignment: the Safe Release Strategy (Initiative: Safe Release Strategy for Agentic Workflows (versioning · rings · canary) #495) already blocks agents from moving their own release tags. A weight-free, PR-gated improvement loop respects that boundary by construction — no self-promotion, only proposed diffs.
Conspicuous gap: zero feedback collection exists today. Idea approval rates, initiative completion, pr-review human-override rates, and Token Cost Observatory spend are all captured-but-unused signals that could seed an eval set.

Shadow-mode dual-run (Idea 566) and health-gated promotion (#501) are the natural enforcement layer for rolling out an improved skill safely.

Technical Opportunity

A minimal, weight-free loop reusing what we already run:

Eval harness (the keystone): a skills/evals/<skill>/cases.jsonl held-out set per high-traffic skill (start with prompts/triage.md and prompts/deep-review.md), scored by a deterministic check or a Haiku-tier LLM-judge. No gate, no loop — so this lands first.
Proposer: a scheduled agent reads recent run outcomes (override rates, escalations, failures) and emits bounded candidate edits to one skill file — diff-only, size-capped, à la SkillOpt's edit budget.
Validation gate: each candidate runs against the held-out set; it must strictly beat the incumbent (no ties promoted) before anything is opened. Reward-hacking guard: eval cases are versioned separately and never edited by the proposer.
Human gate: the winner is opened as a normal PR (or an Idea here for larger skills), reviewed by pr-review + a human CODEOWNER — identical to any other change.
Safe rollout: merged skill rides shadow-mode dual-run (Idea 566) / health-gated promotion ([Phase 2] Replace PUT-contents clobber deploy with versioned, ring-staged, health-gated promotion #501) before becoming stable.

This is additive — no changes to the release-tag ruleset, no agent self-promotion, no model fine-tuning.

Assessment

Dimension	Score	Rationale
Feasibility	med	The hard part is building a trustworthy held-out eval set per skill; the loop plumbing reuses the existing Ideas → initiative → pr-review pipeline and weight-free markdown edits. Start with one skill end-to-end.
Impact	high	Converts unused outcome signals (override rates, approvals, cost) into compounding skill quality; SkillOpt/GEPA show double-digit gains, and it scales across every agent in the org.
Urgency	med	No fire today, but skill quality is hand-tuned and static; the longer we run agents without a feedback loop, the more captured signal we waste. Sequence after eval-harness groundwork.

Adversarial Review

Strongest objection: This is a reward-hacking and drift magnet. An agent optimizing its own skill against a metric will learn to game the eval rather than genuinely improve — and a self-modifying skill loop is exactly the kind of "agent edits its own infrastructure" the org deliberately fenced off with the release-channel ruleset.

Rebuttal: The design neutralizes both. (1) Reward hacking is bounded because the eval set is version-controlled separately and never writable by the proposer — the agent cannot edit the test, only the skill, and a strict-improvement gate rejects ties/regressions; SkillOpt's held-out-validation discipline and Anthropic's blind-A/B exist precisely for this. (2) Self-modification is not what's proposed — the agent never moves a tag or merges anything; it opens a diff that goes through the identical human + pr-review + promotion gates as any contributor's PR. It is strictly more conservative than today, where a human can hand-edit a skill with no eval gate at all. The genuine cost is building and maintaining honest eval sets — which is why the first deliverable is the eval harness alone, with the loop gated on it proving stable.

Suggested Next Step

Pilot on a single high-traffic skill (prompts/triage.md). Deliverable 1: a versioned held-out eval set + scorer wired into CI as a non-blocking report (no loop yet). Deliverable 2: a manual "propose → validate → PR" dry run by a human operator to prove the gate rejects regressions. Only then automate the proposer. Gate the rollout layer on Safe Release Strategy Phase 2 (#501) and shadow-mode dual-run (Idea 566).

2026-06-11T22:57:31Z

github-actions[bot]
Bot Jun 11, 2026

📋 Initiative planned by the BMAD Scrum Master (Bob).

Epic #581 — Initiative: Eval-gated, human-reviewed self-improving skills (SkillOpt-style)

6 stories created (inert — labelled initiative, NOT initiative:auto):

[Phase 1] Define held-out eval-case format + seed prompts/triage.md case set #582 (M) — [Phase 1] Define held-out eval-case format + seed prompts/triage.md case set
[Phase 1] Build the deterministic eval scorer + offline tests (triage) #583 (M) — [Phase 1] Build the deterministic eval scorer + offline tests (triage)
[Phase 1] Wire the eval scorer into CI as a non-blocking report #584 (S) — [Phase 1] Wire the eval scorer into CI as a non-blocking report
[Phase 2] Extend the harness to prompts/deep-review.md via an LLM-judge scorer #585 (M) — [Phase 2] Extend the harness to prompts/deep-review.md via an LLM-judge scorer
[Phase 2] Strict-improvement gate + manual propose->validate->PR runbook with regression-rejection proof #586 (M) — [Phase 2] Strict-improvement gate + manual propose->validate->PR runbook with regression-rejection proof
[Phase 3] Automate the scheduled proposer behind the safe-rollout gates #587 (L) — [Phase 3] Automate the scheduled proposer behind the safe-rollout gates

Open questions for review:

Directory naming: the idea specifies skills/evals//cases.jsonl, but this repo's skills live under prompts/ and frameworks/ (there is no skills/ dir today). Confirm skills/evals/ vs evals/ or prompts/evals/ before Story 1 creates the tree.
The safe-rollout layer depends on shadow-mode dual-run (Discussion 💡 Shadow Mode Dual-Run for Safe Agent Canary Validation #566), which is not yet an epic/issue — Story 6's activation prerequisite is only partially tracked ([Phase 2] Replace PUT-contents clobber deploy with versioned, ring-staged, health-gated promotion #501). Should 💡 Shadow Mode Dual-Run for Safe Agent Canary Validation #566 be promoted to an initiative before Phase 3 starts?
Held-out set sizing/sourcing: how many cases per skill, and may we seed them from real historical triage/deep-review decisions (and how should we de-identify PR content)? This directly affects how much trust the gate's verdict earns.
This initiative's Phase 3 rollout overlaps the Safe Release Strategy epic (Initiative: Safe Release Strategy for Agentic Workflows (versioning · rings · canary) #495 / [Phase 2] Replace PUT-contents clobber deploy with versioned, ring-staged, health-gated promotion #501). Confirm Phase 3 should consume that ring/health-gating machinery rather than re-implement any of it.

Review the epic and its sub-issue DAG, adjust as needed, then add initiative:auto to epic #581 to hand it to initiative-driver for auto-implementation.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 SkillOpt-Style Self-Improving Skills: Eval-Gated, Human-Reviewed Agent Skill Evolution #572

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

💡 SkillOpt-Style Self-Improving Skills: Eval-Gated, Human-Reviewed Agent Skill Evolution #572

Uh oh!

Uh oh!

don-petry Jun 11, 2026 Maintainer

Summary

Market Signal

User Signal

Technical Opportunity

Assessment

Adversarial Review

Suggested Next Step

Replies: 1 comment

Uh oh!

github-actions[bot] Bot Jun 11, 2026

don-petry
Jun 11, 2026
Maintainer

github-actions[bot]
Bot Jun 11, 2026