Skill-eval pipeline — operational review (2026-06-26) + proposal: raise pipeline-review improvements as Discussions #951

don-petry · 2026-06-26T22:03:19Z

don-petry
Jun 26, 2026
Maintainer

Operational review of the self-improving-skills pipeline (epic #581, Discussion #572), 14-day window ending 2026-06-26. Report-only — this informs, it does not gate any merge.

TL;DR

The pipeline is operationally healthy with no open regression or scorer error. The latest triage run passed, all trackers are recovered/closed, and the #920 throttle-as-regression gap is now fixed in code. The remaining findings are coverage/hygiene gaps, not skill defects. This Discussion also proposes a small process change: raise pipeline-review improvements here as Discussions, rather than only as transient action_items in the review JSON.

Health summary (last 14 days)

16 eval runs of Skill Eval Report; all workflow-level success (the workflow is non-blocking by design).
Zero open trackers — all four eval-health issues are closed; no eval-infra issues exist.
Three triage regression trackers were raised and all recovered:

Tracker	Score	`got` pattern	Classification	Outcome
#762	0/5	all `null`	infra throttle false alarm	recovered
#814	0.4 (2/5)	non-null mismatches	real regression (since fixed)	recovered ✅
#911	0/5	all `null`, 25s run	infra throttle false alarm (the #920 live case)	recovered ✅

Only #814 was a genuine behavioral regression; #762 and #911 were throttle runs mis-scored as regressions (all-null got on implausibly short runs — #911's run was 25s vs ~60–90s healthy).

Positive: the #920 throttle-misclassification gap is fixed in code — run-eval.sh now captures the engine exit code and exits 2 → outcome=error on all-infra runs, and notify-eval-health.sh routes that to a separate eval-infra tracker instead of a false eval-health regression.

Improvements identified

#	Priority	Improvement	Why
1	medium	Score `deep-review` on the schedule (or document it as dispatch-only)	It has a 5-case holdout + a prompt, but `skill-eval-report.yml` hardcodes `SKILL=triage` on cron, so its holdout is never scored on a schedule.
2	low	Resolve `lsp-pilot`'s definition gap	6-case holdout exists, but there's no `prompts/lsp-pilot.md` and no scheduled run — wire it + add a prompt, or mark it experimental.
3	low	Confirm `example-skill` exclusion	It's the `evals/README` fixture (no prompt, no refs) — confirm it's intentionally not scored so it isn't mistaken for an unmonitored skill.

Minor notes (no action): #762 was closed without a recovery comment (manual closure); the strict-improvement gate (gate.sh/review-skill.sh) is present in scripts/evals/ but not yet wired to any workflow.

Proposal: raise pipeline-review improvements as Discussions

Today, improvements surface in two ways: routine eval regressions auto-raise eval-health/eval-infra tracker issues, and the operational review emits action_items in its JSON (each tagged suggested_issue_labels: ["dev-lead"]) for a human to file. The review JSON is transient and not easily browsable over time.

Proposed: each operational-review window posts (or updates) a Discussion like this one as the durable home for its findings and improvement ideas. Concrete skill_fix items still spawn dev-lead issues for execution; the Discussion is the running, browsable record and the place to debate process changes. This keeps epic #581's improvement loop visible without adding merge-blocking noise.

2026-06-27T13:47:51Z

github-actions[bot]
Bot Jun 27, 2026

Enhancement: Operational Review → Discussions — Prior Art, Options & Durability Strategy

Sharpened problem & goal
The proposal correctly identifies a gap: operational-review improvements surface as transient action_items in JSON, invisible after the run ends. The sharper question is: what is the right information architecture so that recurring pipeline findings are durable, browsable, and feed back into the improvement loop — without creating a parallel tracking burden alongside the existing issue system? The risk is not just transience but also sprawl: a new Discussion per review window quickly creates a wall of threads that nobody reads.

Context & existing infrastructure

scripts/evals/notify-eval-health.sh already bridges evals → Issues: it creates/updates de-duplicated tracker issues keyed on eval-health / eval-infra labels with stable per-skill titles. This is the "concrete fix" execution path.
The self-improving-skills pipeline (epic Initiative: Eval-gated, human-reviewed self-improving skills (SkillOpt-style) #581, Discussion 💡 SkillOpt-Style Self-Improving Skills: Eval-Gated, Human-Reviewed Agent Skill Evolution #572) has a mature eval loop: run-eval.sh (scorer), gate.sh / review-skill.sh (strict-improvement gate), and notify-eval-health.sh (visibility). The gap is specifically at the meta level: "how should we improve the pipeline itself."
The operational review (Skill-eval pipeline — operational review (2026-06-26) + proposal: raise pipeline-review improvements as Discussions #951) identifies three improvement categories: coverage gaps (deep-review unscored on cron), definition gaps (lsp-pilot missing prompt), and housekeeping (example-skill exclusion). These are advisory — they need discussion, not a sprint task.

Prior art / competitive signal

Pattern	Example	Durable?	Threaded?	Automated?
ADR (Architecture Decision Records)	`docs/decisions/*.md` in repo	Yes (files)	No (PR comments only)	Emerging (AI-generated ADRs)
SRE post-mortem automation	Rootly, incident.io, Jeli (acquired by PagerDuty), FireHydrant (acquired by Freshworks Dec 2025)	Yes	Yes (Slack threads → doc)	Yes
GitHub Discussions for advisory findings	Issue → Discussion conversion pattern; GitHub natively supports this	Yes	Yes (native threading)	Yes (full GraphQL CRUD)
Rolling knowledge page	Confluence / wiki with per-cycle sections	Yes	Limited	Manual typically
GitHub Projects for meta-tracking	Board view with review-window columns	Yes	No	Partial

Key insight from SRE tooling: teams that automate the documentation layer (Rootly, incident.io) free engineers to focus on analysis. This proposal does exactly that — automating the capture of review findings into a durable Discussion.

Options table

Option	Description	Pros	Cons
A. One Discussion per review window	Each operational review creates a new Discussion titled "Skill-eval review — YYYY-MM-DD"	Clean separation per cycle; easy to search by date	Sprawl risk at weekly cadence (~50/year); stale threads accumulate
B. Rolling Discussion (append-only)	One Discussion per skill (e.g., "triage eval improvements"), updated with a new comment per review window	Fewer threads; history in one place; mirrors the de-dup pattern in `notify-eval-health.sh`	Long threads become hard to navigate; GitHub caps comment rendering
C. ADR-style files + Discussion index	Write `docs/pipeline-reviews/YYYY-MM-DD.md` in repo, link from a pinned index Discussion	Fully version-controlled; searchable via grep; review via PR	Highest friction; no native threading; CI step needed to render

Recommendation: Option B — rolling Discussion per skill — best fits the existing de-dup-by-label pattern already used by notify-eval-health.sh. One Discussion per skill stays browsable; the review automation appends a timestamped comment per cycle. This avoids sprawl while preserving threading for team dialogue on each finding.

Risks / unknowns

Risk	Severity	Mitigation
Discoverability (GitHub Discussions search is keyword-only, no `is:discussion` in Advanced Search)	High	Pin the per-skill Discussion; use consistent title convention; cross-link from eval-health issues
Discussion sprawl if scoped too broadly	Medium	One Discussion per skill, not per review window; strict category ("Pipeline Review" or "Ops Review")
Dual tracking overhead (Discussions + Issues)	Medium	Clear rule: Discussions = findings + context; Issues = assigned work with deadlines. Use Discussion→Issue conversion for items that graduate
Notification fatigue	Low	Use Announcement category format (only maintainers can post new threads); comments notify only subscribers
Stale findings that never get acted on	Medium	Add a "days since last action" field to the review comment; after 90 days with no linked Issue, recommend archival

Phased pilot path

Phase 1 (next review cycle): Manually create one Discussion for triage skill improvements. Post this review's 3 findings as the first comment. Observe team engagement for 2 review cycles.
Phase 2: Automate: extend the operational-review script to upsert a Discussion per skill via comment_on_discussion (the discussion-mutations.sh helper already exists). Include a structured template: What Changed / Impact / Recommendation / Linked Issues.
Phase 3: Add a "Pipeline Review" Discussion category with Announcement format. Pin the per-skill Discussions. Cross-link from notify-eval-health.sh tracker issues.

Success metrics

Improvement closure rate: ≥60% of identified improvements linked to a closed Issue within 30 days
Time-to-discovery: ≤5 minutes for any team member to find the history of a specific skill's improvement findings
No net increase in total open Issues (Discussions absorb the advisory load that would otherwise become stale Issues)
Team engagement: ≥1 threaded reply per Discussion per quarter (the Discussion is being read, not just written to)

0 replies

2026-06-27T15:23:09Z

github-actions[bot]
Bot Jun 27, 2026

Sharpened problem & goal

The proposal is to route pipeline-review improvement findings into durable GitHub Discussions rather than leaving them as transient action items embedded in the operational review body. Worth clarifying before implementing: should these Discussions use the existing Ideas category (so they flow through idea-triage → idea-enhancer → initiative-planner automatically), or does a dedicated "Pipeline Improvements" category better prevent signal pollution and keep the Ideas queue focused on product-facing features?

Context

The org already has a full idea-lifecycle pipeline (idea-triage.yml → idea-enhancer.yml → initiative-planner.yml) that converts Ideas Discussions into planned epics with sub-issue DAGs. Routing eval improvement proposals through Ideas would give them that lifecycle for free. However, scripts/idea-enhancer/gather-candidates.sh skips bot-authored Discussions — so if the eval pipeline auto-posts improvement Discussions, they'd need either a human author or a deliberate opt-in mechanism to enter the enhancement flow.

One important constraint: evals/ is protected by the holdout-guard (holdout-guard.yml / scripts/lib/holdout-guard.sh), which fails any PR by the automated skill-proposer that touches held-out paths. The Discussion-posting approach doesn't touch evals/ directly, but any resulting plan must not allow the automated proposer to route suggestions back through itself in a way that contaminates the holdout.

Posting a Discussion is already a thin gh api graphql createDiscussion call — no new infrastructure needed.

Impact / Effort


Impact	MEDIUM — eval learnings become searchable, linkable, and plan-ready without any process overhead; value compounds over the pipeline's lifetime
Effort	S — the posting mechanism already exists; a small `scripts/evals/post-improvement.sh` with a dedup marker is the main deliverable

Suggested acceptance criteria

Decide and document which Discussion category improvement proposals use and whether they should flow through idea-triage/enhancer or be kept separate (update scripts/evals/README.md or equivalent).
Clarify whether auto-posted improvement Discussions carry a bot-authored marker (to suppress enhancer treatment per gather-candidates.sh's bot-skip logic) or intentionally flow through the pipeline.
scripts/evals/post-improvement.sh (or equivalent) posts one Discussion per actionable finding from an eval run, with an idempotency marker to prevent duplicate posts across re-runs.
At least one existing improvement identified in this review is trialled as a Discussion to validate the full flow end-to-end before the process is formalized.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skill-eval pipeline — operational review (2026-06-26) + proposal: raise pipeline-review improvements as Discussions #951

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Skill-eval pipeline — operational review (2026-06-26) + proposal: raise pipeline-review improvements as Discussions #951

Uh oh!

don-petry Jun 26, 2026 Maintainer

TL;DR

Health summary (last 14 days)

Improvements identified

Proposal: raise pipeline-review improvements as Discussions

Replies: 2 comments

Uh oh!

github-actions[bot] Bot Jun 27, 2026

Enhancement: Operational Review → Discussions — Prior Art, Options & Durability Strategy

Uh oh!

github-actions[bot] Bot Jun 27, 2026

don-petry
Jun 26, 2026
Maintainer

github-actions[bot]
Bot Jun 27, 2026

github-actions[bot]
Bot Jun 27, 2026