Add CWE ingestion prompt and skill for per-domain taxonomy generation#229
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Copilot Chat slash command and matching Copilot CLI skill to run a reproducible, multi-phase workflow for ingesting the official MITRE CWE XML corpus and generating per-domain PromptKit CWE taxonomy artifacts (per Issue #228’s ingestion pipeline concept).
Changes:
- Added
/ingest-cwe-taxonomiesprompt file that specifies a 6-phase ingestion → normalization → mapping → taxonomy generation → integration → verification workflow. - Added a corresponding Copilot CLI skill that delegates execution to the prompt file.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
.github/skills/ingest-cwe-taxonomies/SKILL.md |
New CLI skill entrypoint that directs users/agents to run the CWE ingestion prompt. |
.github/prompts/ingest-cwe-taxonomies.prompt.md |
New slash-command prompt implementing the end-to-end CWE ingestion and taxonomy generation pipeline. |
Adds a reusable /ingest-cwe-taxonomies slash command (prompt file) and matching CLI skill that implements the CWE taxonomy ingestion pipeline from Issue microsoft#228. The prompt guides an agent through: - Downloading and parsing official MITRE CWE XML - Mapping 900+ CWEs to 13 audit domains using a 4-priority algorithm - Generating per-domain PromptKit taxonomy files - Updating manifest.yaml and running sanity checks - Producing version diffs for incremental CWE updates Closes microsoft#228 (partially — template integration is a follow-up) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace hardcoded CWE XML namespace with runtime namespace detection - Resolve CWE views by name first, use IDs as optional hints only - Remove Java-to-.NET analogy from core view table (now requires explicit override rule if desired) - Replace underspecified Priority 3 context analysis with deterministic keyword-based scoring heuristic with exact term lists and thresholds - Scope .gitignore pattern from *.xml.zip to cwec_*.xml.zip Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Replace 'header comment or frontmatter' with provenance metadata guidance: top-level _meta object for JSON, comment/frontmatter for Markdown/YAML - Drop Rarely/Undetermined prevalence as secondary/optional entries; simply ignore them for domain assignment (deterministic, no schema ambiguity) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Phase 2 now instructs the agent to run the existing script first, with the full parsing/mapping specification preserved as a reference for validation and regeneration. Phase 3 reframed as algorithm documentation rather than imperative instructions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ad0d505 to
4714930
Compare
| description: 'Ingest CWE XML from MITRE and generate per-domain security audit taxonomies for PromptKit' | ||
| agent: 'agent' | ||
| tools: ['search/codebase', 'edit'] | ||
| argument-hint: 'Path to CWE XML file, or "latest" to download cwec_latest.xml.zip from MITRE' |
There was a problem hiding this comment.
Frontmatter limits tools to ['search/codebase', 'edit'], but the workflow supports latest (download from MITRE) and instructs running python scripts/ingest-cwe.py and extracting a zip. With these tools, an agent typically can’t fetch remote content or execute shell commands, so the latest path (and “run the script” steps) may be non-actionable. Consider either (a) adding the appropriate toolset(s) (e.g., web/fetch) and explicitly describing how to handle zip extraction/execution, or (b) removing/softening the latest automation and requiring the user to provide a local XML path + run commands themselves.
| 2. **Update `.gitignore`.** Add an entry to exclude the raw downloaded | ||
| CWE XML file (but NOT the `data/cwe/` processed artifacts): | ||
|
|
||
| ``` | ||
| # Raw CWE XML downloads (large, available from MITRE) | ||
| cwec_*.xml.zip | ||
| cwec_*.xml | ||
| ``` |
There was a problem hiding this comment.
The .gitignore instruction says to “add” cwec_*.xml.zip / cwec_*.xml, but this repo already has a broad *.xml.zip ignore (see .gitignore:8). If a user follows these steps, they’ll likely end up with both patterns and still ignore all *.xml.zip artifacts. Consider updating the instruction to replace any existing *.xml.zip rule with the CWE-scoped patterns to avoid unintentionally ignoring unrelated XML zip files.
| - **CWE source**: A local path to a CWE XML file, or `latest` to download | ||
| the current release from `https://cwe.mitre.org/data/xml/cwec_latest.xml.zip`. | ||
| - Any overrides to the domain registry or mapping rules the user specifies. | ||
|
|
||
| ## Output | ||
|
|
||
| - Per-domain taxonomy files at `taxonomies/cwe-<domain>.md` (13 domains) | ||
| - Normalized CWE data at `data/cwe/<version>/` | ||
| - Updated `manifest.yaml` with new taxonomy entries | ||
| - Reusable ingestion script at `scripts/ingest-cwe.py` | ||
| - Diff report if a previous CWE version exists |
There was a problem hiding this comment.
This skill advertises a latest mode that downloads from cwe.mitre.org and outputs multiple repo modifications (new/updated taxonomy files, data/cwe/..., manifest.yaml, and scripts/ingest-cwe.py). It would be helpful to explicitly call out the required permissions/capabilities (network access for download, filesystem write access, and the ability to run Python) so users aren’t surprised when invoking the skill in constrained environments.
Summary
Adds a reusable
/ingest-cwe-taxonomiesslash command (prompt file) and matching CLI skill that implements the CWE taxonomy ingestion pipeline from #228.Files Added
.github/prompts/ingest-cwe-taxonomies.prompt.md/ingest-cwe-taxonomies).github/skills/ingest-cwe-taxonomies/SKILL.mdWhat the Prompt Does
When invoked, the prompt guides an agent through a 6-phase pipeline:
scripts/ingest-cwe.pyto parse XML into structured JSONApplicable_Platformsfield matchingtaxonomies/cwe-<domain>.mdfiles in PromptKit formatmanifest.yaml, generates version diffsDomain Registry (13 domains)
kernel-mode-c-cpp,native-user-mode-c-cpp,managed-dotnet,web-js-ts,web-backend,cloud-service,iac,irmware-embedded,crypto-protocols,data-processing,cli-tools,mobile-app,container-k8sComponents Assembled
security-auditor(CWE/vulnerability expertise)nti-hallucination+self-verification+operational-constraints(all verbatim)multi-artifact(adapted for taxonomy + metadata output)Re-run Workflow
Download new CWE XML → invoke
/ingest-cwe-taxonomies→ review diff report → commit.Testing
python tests/validate-manifest.pypasses ✅�oot.prompt.md,promptkit/SKILL.md)Partially addresses #228 — template integration (
domainparameter in security audit templates) is tracked as a follow-up.