-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
Description
Migrated from spboyer/waza#66
Waza Skills Development Platform
This is the tracking issue for the waza platform implementation based on the PRD and Squad Proposal.
Primary Phase (Core Features)
E1: Go CLI Foundation (P0)
- feat: Add --format flag to waza check command #24 -
waza runcommand - test: Add JSON output tests for waza check #25 -
waza initcommand - feat: Implement JSON output for waza check #26 -
waza generatecommand - fix: Standardize emoji spacing in waza check display #27 -
waza comparecommand - feat: Define JSON output structs for waza check #28 - All 8 grader types
- --discover doesn't find eval.yaml in evals/ subdirectory (inconsistent with waza init/new) #29 - Copilot SDK executor
- waza check rejects SKILL.md file path — only accepts directory #30 - Verbose mode
- waza suggest deadlocks — goroutine panic on copilot SDK stdio transport #31 - Transcript logging
- feat: Registry backend evaluation (Git/OCI/Releases/federated) #16 - JSON-RPC server for IDE integration
- Integration tests and documentation for .waza.yaml consolidation #21 - Session event logging and viewer
E2: Sensei Engine (P0)
- v0.11.0 release binary is stale — reports 'waza version 0.9.0', missing v0.10.0 fixes #32 -
waza devcommand (Sensei loop) - feat: Migrate Squad from .ai-team/ to .squad/ (v0.5.3) #33 - Compliance scoring system
- chore(deps): Bump rollup from 4.57.1 to 4.59.0 in /web #34 - Improvement suggestions engine
- chore: bump version to 0.12.0 #35 - Target score option
- chore: Release v0.12.0 — registry and version sync #36 - Trigger accuracy tests
- fix: make release workflow resilient to enterprise token restrictions #37 -
--skip-integrationflag - fix: make release workflow resilient to enterprise token restrictions #38 -
--fastflag
E3: Evaluation Framework (P0)
- Images broken in dashboard page on gh-pages #39 - Multiple model execution
- fix: broken images on gh-pages due to missing base path configuration #40 - Task completion metrics
- Consolidating the keyword and regex graders into a single text grader. #41 - Trigger accuracy metrics
- azd waza run examples/code-explainer/eval.yaml fail with error #42 - Behavior quality metrics
- fix: regression test + changelog for waza suggest deadlock #43 - Trials for statistical confidence
- fix: --discover finds eval.yaml in project-root evals/{name}/ layout #44 - LLM-powered suggestions
- fix: Standardize emoji spacing in waza check display #45 - Parallel task execution
- fix: resolve dashboard image paths using Astro base URL #46 - Task filtering
E4: Token Management (P1)
- Copilot SDK usage display for waza run #47 -
waza tokens count - Review init code, make sure we always prompt if data needs to be overwritten #48 -
waza tokens check - images on docs are still broken #49 -
--strictmode - Fix broken dashboard-explore doc images across deployment bases #50 -
waza tokens suggest - Fix Docker build #51 -
waza tokens compare - feat: Add trigger heuristic grader #80 - BPE token counter
E5: Waza Skill (P1)
- Waza doesn't understand skills under the .github directory #52 - SKILL.md for microsoft/skills
- Working around an issue in the copilot SDK, Start() and contexts #53 - Guided requirements gathering
- Migrate all uses of copilot.Client to use execution's CopilotEngine instead #54 - Conversational readiness check
- Migrate prompt grader from raw copilot.Client to CopilotEngine #55 - Result interpretation
- fix: make site base path configurable + remove unused workflow #56 - CLI command invocation
Secondary Phase (Integration & Extensions)
E6: CI/CD Integration (P1)
- Fix config.schema.json mismatches + add parity test #57 - GitHub Actions workflow template
- Refactor waza new to use shared FileWriter #58 - CI exit codes
- Invert token limits priority: .waza.yaml first, .token-limits.json as legacy fallback #59 - GitHub PR comment reporter
- fix: discover skills under .github/skills/ directory #60 - microsoft/skills CI compatibility
- feat: add FileWriter for safe scaffold file creation #61 - Evaluation result caching
E7: AZD Extension (P2)
- chore: add MIT LICENSE file #62 - Package as AZD extension
- feat: add FileWriter service and refactor waza init inventory #48 #63 -
azd wazacommands - feat: invert token limits priority to .waza.yaml first #59 #64 - IntelliSense metadata
- fix: align config.schema.json defaults with Go source of truth #57 #65 - azure.yaml integration (closed — shipped via extension.yaml)
E8: Getting Started Experience (P1)
- [E1] Decouple ExecutionResponse from Copilot SDK + Multi-Agent Engine Support #10 - Getting Started Experience umbrella
- #168 - Getting started documentation
- #169 - Redesign waza init
- #170 - waza new skill scaffolding
- #171 - Retrofit CLI commands for workspace awareness
- #172 - internal/workspace package
E3: Evaluation Framework — Azure ML Evaluator Integration
- #104 - Implement prompt (LLM-as-judge) grader
- #105 - Implement action_sequence grader
- #106 - Port Azure ML tool_call evaluation rubrics
- #107 - Port Azure ML task evaluation rubrics
- #108 - Create example eval YAMLs using new graders
- #109 - Document prompt and action_sequence grader types
- #138 - Multi-model evaluation with recommendation engine
E3: Multi-Skill Evaluation Support
- #142 - Wire skill_directories from eval YAML to Copilot SDK
- #143 - Add required_skills preflight validation
- #144 - Add skill_invocation grader for asserting dependent skill usage
E3: A/B Skill Impact Measurement
- #194 -
--baselineflag for A/B skill impact comparison
E3: Extended Evaluation Features
- feat: add waza quality command — LLM-as-Judge skill quality scoring #98 - Behavior grader
- chore: add MIT LICENSE file #99 - Diff grader for workspace changes
- #184 - Retry/attempts mechanism
- #185 - Lifecycle hooks
- #186 - Template variable support
- #187 - CSV dataset support
- #188 - Result groupBy/categorization
- #189 - Custom inputs
E9: Competitive Positioning (P1)
- #195 - Multi-agent engine support (assigned: richardpark-msft)
E10: Web UI (P2)
- feat: Map OpenAI Evals YAML format → waza graders #14 - Web UI + Dashboard (competitive analysis)
- #201 - Scaffold React 19 + Vite + Tailwind CSS v4 (PR #212)
- #202 - Dashboard shell layout with DevEx-style dark theme (PR #215)
- #203 - HTTP web server for waza serve (PR #211)
- #204 - Phase 1 REST API endpoints (PR #210)
- #205 - KPI summary cards component (PR #214)
- #206 - Recent Runs sortable table (PR #216)
- #207 - Run Detail drill-down view (PR #217)
- #208 - Playwright E2E test infrastructure (PR #218)
v0.8.0: Advanced Features & MCP Integration (Shipped)
E11: MCP Server Integration (P0)
- #286 - Always-on waza serve with MCP transport
- #316 - MCP scoring validators integration
- #289 - 10 MCP tools (run, init, generate, compare, dev, tokens, serve, new, init-task, help)
E12: LLM-Powered Intelligence (P0)
- #287 -
waza suggestcommand for AI-powered eval recommendations - #309 -
--judge-modelflag for separate judge LLM configuration
E13: Advanced Skill Development (P1)
- #288 - Interactive skill for workflow orchestration
- #319 - Auto-generate trigger tests from skill triggers
- #311 - Skill profile with static token analysis
E14: Evaluation Enhancements (P1)
- #299 - Grader weighting for weighted composite scores
- #308 - Statistical confidence intervals via bootstrap
- #317 - Batch processing with
waza devmulti-skill support - #318 - Token budget enforcement with strict comparison mode
E15: Compliance & Validation (P2)
- #314 - agentskills.io specification compliance checks
- #315 - SkillsBench 5 advisory checks
- #312 - JUnit XML reporter for CI pipeline integration
v0.9.0: A/B Testing, Discovery & Competitive Features
E16: A/B Testing & Comparative Evaluation (P0)
- #307 - A/B baseline testing (--baseline flag)
- #310 - Pairwise LLM judging with bias mitigation
- #391 - Tool constraint assertions (expect_tools / reject_tools)
- #392 - Auto skill discovery (--discover flag)
E17: Documentation & Site (P1)
- #383 - Releases page on GitHub Pages site
- #381 - Convert ASCII diagrams to Mermaid
E18: Eval & Grader Registry (P2) — Design Complete
- #385 - Eval & Grader Registry design doc (parent epic)
- #386 - Map OpenAI Evals YAML format to waza graders
- #387 - Go-module-style grader/eval references
- #388 - Registry backend evaluation
- #389 - Composable eval construction
- #390 - Grader plugin extensibility
Summary
| Epic | Total | Done | Open | Priority |
|---|---|---|---|---|
| E1: Go CLI Foundation | 10 | 9 | 1 | P0 |
| E2: Sensei Engine | 7 | 5 | 2 | P0 |
| E3: Evaluation Framework | 23 | 23 | 0 | P0 |
| E4: Token Management | 6 | 6 | 0 | P1 |
| E5: Waza Skill | 5 | 5 | 0 | P1 |
| E6: CI/CD Integration | 5 | 5 | 0 | P1 |
| E7: AZD Extension | 4 | 4 | 0 | P2 |
| E8: Getting Started | 6 | 6 | 0 | P1 |
| E9: Competitive Positioning | 1 | 0 | 1 | P1 |
| E10: Web UI | 9 | 9 | 0 | P2 |
| E11: MCP Server | 3 | 3 | 0 | P0 |
| E12: LLM Intelligence | 2 | 2 | 0 | P0 |
| E13: Skill Development | 3 | 3 | 0 | P1 |
| E14: Evaluation Enhancements | 4 | 4 | 0 | P1 |
| E15: Compliance & Validation | 3 | 3 | 0 | P2 |
| E16: A/B Testing | 4 | 4 | 0 | P0 |
| E17: Documentation & Site | 2 | 2 | 0 | P1 |
| E18: Eval Registry | 6 | 0 | 6 | P2 |
| Total | 103 | 93 | 10 |
Related Documents
Reactions are currently unavailable