Release v0.3.2 · light-design-tw/a11y-moda

Description-only patch — no code change. Rewrites the bundled Claude
Code skill description field for higher trigger accuracy in
real-invocation conditions. Validated by N=3 × 30-prompt benchmark
using claude -p cold-context invocation against a real installed
skill in ~/.claude/skills/a11y-moda/.

Changed

Claude Code skill description rewritten —
src/a11y_moda/_examples/claude-code-skill/SKILL.md frontmatter.
Existing installs upgrade by re-running a11y-moda init claude-code --force. Skill body (sections 1-12, REFERENCE.md) unchanged.

Benchmark (real `claude -p`, n=3 × 30 prompts = 90 runs)

Tier	v0.3.1 stock	v0.3.2 rewrite	Δ
1 — must trigger (10)	6/10 (60%)	9/10 (93%)	+3
2 — should trigger (10)	0/10 (0%)	8/10 (77%)	+8
3 — must NOT trigger (10)	10/10 (0% FP)	10/10 (0% FP)	0
Total	16/30	27/30	+11

90% pass rate. Zero false-positive regression.

Why

v0.3.1 description listed trigger phrases but covered no implicit a11y
pain (keyboard 不到 / contrast / dialog ESC) and did not address
Claude's bias to (a) answer rule-content questions from training
memory or (b) bypass the skill and invoke a11y-moda directly via
Bash when the user names the CLI.

v0.3.2 rewrite addresses both:

Explicit anti-pattern: "Do NOT answer from memory or run a11y-moda
directly via Bash"
Six numbered invoke clauses covering rule_id (with prefix pattern),
CLI mention by name, a11y pain phrases, MODA / WCAG asks, pre-write
element list, and vague target phrasings
"Claude does not know MODA rule content, must look up" forces
invocation on rule_id queries that previously bypassed (e.g.
HM1110100C 怎麼修?)

Notes

Three remaining benchmark misses (Tier 1 #4, Tier 2 #11 #19)
are structural: Claude has a strong direct-answer bias for
design-feedback phrasings ("設計師說背景跟字太接近") and UX-pattern
questions framed as general best practice. Adding more keywords
beyond the v0.3.2 rewrite was estimated to break Tier 3's perfect
no-false-positive score with diminishing returns. 90% / 0 FP is the
ship threshold.
Methodology and raw data live in the contributor scratch dir
(.scratch/, gitignored): eval_set.json (30 prompts),
run_trigger_eval_win.py (Windows-friendly streaming-event
detection runner), and per-version raw JSON results.
Benchmark was performed on claude-opus-4-7[1m]; trigger behavior
may vary on other models. Real-conversation triggering tends to be
higher than cold claude -p because conversation context primes
skill-relevance scoring. Treat the 93% / 77% as a conservative
floor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.3.2

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Changed

Benchmark (real `claude -p`, n=3 × 30 prompts = 90 runs)

Why

Notes

Uh oh!

v0.3.2

Changed

Benchmark (real claude -p, n=3 × 30 prompts = 90 runs)

Why

Notes

Uh oh!

Benchmark (real `claude -p`, n=3 × 30 prompts = 90 runs)