Compile unstructured knowledge into validated, installable AI agent swarms.
Translating loose knowledge (notes, API docs, specs, scratch files) into structured AI agent configurations is manual, error-prone, and non-reproducible. The resulting agent scaffolds drift from source material, contain unstated assumptions, break when switching model providers, and have no evidence trail proving correctness.
SwarmMaker automates this as a two-stage compiler:
graph LR
A["Loose Docs<br/>notes, specs,<br/>API docs, OpenAPI"] -->|Stage 1<br/>LLM-backed| B[".tasks/ ledger<br/>9 ledger files<br/>7 IR artifacts<br/>evidence.json"]
B --> C{"Validation<br/>Programmatic<br/>Pre-screen<br/>Adversarial review<br/>Revision (max 3)<br/>Parity check"}
C -->|pass| D["Stage 2: Render<br/>.claude/ .codex/ .gemini/<br/>.agents/skills/ + mcp_tool.json<br/>custom platforms"]
C -->|fail| E["validation-report.md"]
Stage 1 uses LLM calls to decompose source material into a shared .tasks/ ledger with per-claim citations. Stage 2 is a deterministic, LLM-free render from that ledger into platform-specific output trees. The expensive work happens once; rendering to N targets is a transform.
| Phase | LLM Calls | Input | Output | Failure Mode |
|---|---|---|---|---|
| 1. Ingest | 0 | Source folder | Evidence manifest, complexity analysis, OpenAPI parsing | Missing/unreadable files recorded, not hidden; OpenAPI specs parsed into structured endpoints; basic sanity gate rejects empty dirs |
| 2. IR Emit | 0 | Ingestion output + routing | 7 JSON artifacts under .tasks/ir/ |
Contract validation rejects malformed schemas |
| 2.5 Pre-flight | 1 | Summary + complexity metrics | SUFFICIENT/INSUFFICIENT verdict | Rejects material too thin for skill decomposition (~$0.01) |
| 3. Generate | 9 (two-phase) | Compiled prompts + source | 9 .tasks/ ledger files |
Phase A (context+tasks) then Phase B (7 dependent files with ledger context); per-task retry with backoff |
| 4. Validate | 1-10 | Generated ledger | Validation report | Multi-round: programmatic → pre-screen → adversarial review → revision (up to 3 rounds) → post-screen |
| 5. Render | 0 | Validated ledger | Platform trees + MCP tool defs + README + installer | Atomic staged write; MCP tool JSON per skill; parity check across targets; optional custom platform output |
The pipeline starts by walking the input folder and recording evidence for every file decision (read, skipped as binary, hidden, oversized, noise directory, symlink, or unreadable). It detects source code files and infers their language and purpose for tool integration. OpenAPI and Swagger specs (.yaml, .yml, .json) are detected automatically and parsed into structured endpoint summaries with methods, parameters, and schemas — giving the LLM accurate API context instead of raw spec text. It then scans PATH for installed LLM CLIs (claude, codex, gemini, ollama) and probes their capabilities and versions. The routing module assigns generator, critic, and renderer roles based on user flags and available providers, logging any fallback (e.g., same-model critique when only one provider is installed).
The pipeline then emits an Intermediate Representation (IR), a set of seven versioned JSON artifacts written to .tasks/ir/. The IR captures everything the pipeline knows before any LLM generation begins. It serves three purposes: it gives every downstream prompt a consistent, typed view of the project; it provides a complete audit trail of the decisions made during ingestion and routing; and it allows any step in the pipeline to be reproduced or debugged independently. The seven artifacts are:
| Artifact | What it captures |
|---|---|
product-definition.json |
Project name, target output formats, generator/critic/renderer providers |
source-ir.json |
Every ingested file with path, type, size, and content digest |
provider-capabilities.json |
Discovered LLM CLIs, their versions, and supported capabilities |
routing-decision.json |
Which provider was assigned to each role and any fallback events |
output-tree-spec.json |
The target platform tree structure (paths, required files, metadata) |
tool-synthesis-request.json |
Whether helper tools are needed, which languages were detected, and what evidence supports the decision |
prompt-ir.json |
The prompt metadata passed to every LLM call, with source material redacted for secrets |
Each artifact is validated against its schema contract before being written. A SHA-256 digest is computed per artifact and recorded in a manifest (ir/manifest.json) so downstream consumers can verify integrity.
Before generation, a pre-flight LLM call (~$0.01) evaluates whether the source material is rich enough to decompose into skills. If insufficient, the run exits with a specific explanation of what is missing, saving 9+ expensive generation calls.
Generation runs in two phases. Phase A produces the foundational files (context.md, tasks.md) first. Phase B then generates the remaining 7 files with a summary of Phase A output injected into their prompts, ensuring cross-file consistency from the start rather than catching contradictions only in review. Tasks within each phase run concurrently (or serially for same-provider) with round-robin assignment across available LLMs.
After generation, the validation pipeline runs (see below). After each revision round, a citation path repair step fixes near-miss path hallucinations where the LLM mangled a directory path but kept the filename correct. On success, the renderer compiles the validated ledger into platform-specific output trees plus the cross-platform .agents/skills/ standard path.
graph TD
SRC["Source Folder<br/>docs, code, OpenAPI specs"] --> INGEST["Ingest + Discover<br/>Walk files, parse OpenAPI,<br/>scan PATH for LLM CLIs"]
INGEST --> SANITY{"Sanity Check"}
SANITY -->|empty dir| REJECT["Reject"]
SANITY -->|ok| IR["IR Emit + Route<br/>7 JSON artifacts,<br/>assign generator/critic roles"]
IR --> PREFLIGHT{"Pre-flight<br/>LLM Validation"}
PREFLIGHT -->|insufficient| REJECT
PREFLIGHT -->|sufficient| PHASE_A["Phase A: Generate<br/>context.md + tasks.md"]
PHASE_A --> PHASE_B["Phase B: Generate<br/>7 dependent files<br/>with ledger context"]
PHASE_B --> VALIDATE{"Validate<br/>Programmatic checks<br/>Pre-screen heuristics<br/>Adversarial LLM review"}
VALIDATE -->|revise| REPAIR["Citation Repair +<br/>Targeted Revision"]
REPAIR --> VALIDATE
VALIDATE -->|pass| RENDER["Render<br/>.claude/ .codex/ .gemini/<br/>.agents/skills/ + mcp_tool.json<br/>custom platforms"]
VALIDATE -->|fail| REPORT["validation-report.md<br/>+ cost breakdown"]
Generated agents follow the OODA loop (Observe-Orient-Decide-Act):
| Role | Responsibility | Example |
|---|---|---|
| Observe | Gather and normalize input, preserve evidence | Alert ingestion, file inventory |
| Orient | Analyze, correlate, decompose into structure | Alert correlation, dependency mapping |
| Decide | Apply rules, validate constraints, choose paths | Priority classification, routing decisions |
| Act | Execute workflows, produce outputs | Runbook generation, notification delivery |
Multiple agents may share an OODA role when the domain requires distinct execution concerns within a phase. The agent count is the minimum required to cover all source-backed responsibilities.
These are not guidelines. They are enforced by the pipeline and tested:
| Invariant | Enforcement |
|---|---|
| No silent defaults | Missing required facts become UNKNOWN; dependent decisions are blocked |
| No hidden fallbacks | Every fallback is counted, recorded in evidence, and visible in the validation report |
| No fabrication | Pre-screen detects fabrication patterns; adversarial review checks source fidelity |
| No partial output | Atomic staged writes; incomplete runs leave no artifacts in the output directory |
| No untracked provider routing | Routing decisions are persisted as machine-readable JSON with fallback accounting |
| No success without evidence | Validation report is mandatory on both success and failure paths |
| No cross-target drift | Parity validation checks skill/agent/metadata consistency across all selected platforms |
| No stale citations | Programmatic link checker + template leak detector run before and after revision |
Measured on a real 4-file input (10KB source material) producing a codex skill bundle:
| Metric | Claude CLI | Codex CLI | Mixed (Claude gen + Codex critic) |
|---|---|---|---|
| Generation (9 tasks) | ~18 min (serial) | ~18 min (serial) | ~7 min (concurrent) |
| Adversarial review | ~2 min | ~2 min | N/A (uses critic) |
| Revision (per file) | ~2 min | ~2 min | N/A (uses generator) |
| Total (with revision) | ~35 min | ~35 min | ~25 min |
| Output size (skills.md) | ~62 KB | ~88 KB | Varies by generator |
| Skill count | ~11 | ~10 | Varies by generator |
Codex uses model_reasoning_effort=medium to avoid multi-minute agentic loops. Claude uses -p for direct prompt-to-response. Both produce operational-depth skills with numbered process steps, inline schemas, and Required/Prohibited constraints. The validation report includes a per-task cost breakdown table covering every LLM call in the pipeline.
Scaling: generation time is O(N) in task count (currently fixed at 9). Prompt size is O(S) in source material size. Revision rounds are bounded at 3 with regression detection.
The full production workflow is: generate → validate → iterate.
# 1. Generate the skill bundle
swarm-maker --input ./notes --model codex --output-swarm claude -o ./SKILL
# 2. Validate skills are discoverable and triggerable
swarm-maker validate --bundle ./SKILL --target claude
# 3. Iterate on individual skills without re-running the full pipeline
swarm-maker regen --skill hash-hunt --input ./notes --model codex -o ./SKILL
# 4. Re-validate after changes
swarm-maker validate --bundle ./SKILL --target claude| Approach | SwarmMaker | LangChain/CrewAI/AutoGen | Manual prompt engineering |
|---|---|---|---|
| When it runs | Build time (offline) | Runtime (online) | Human time |
| Output | Static reviewable files | Running processes | Prompts in code |
| Provider lock-in | None (renders to any target) | Framework-specific | Provider-specific |
| Validation | Adversarial + programmatic | Unit tests on chains | Manual review |
| Evidence trail | Full (evidence.json, IR, report) | Logs | None |
| Reproducibility | Same input → same structure | Non-deterministic | Depends on author |
| Source fidelity | Per-claim citations required | No citation contract | No citation contract |
SwarmMaker does not replace runtime agent frameworks. It produces the knowledge artifacts those frameworks consume.
Point SwarmMaker at a repository and it generates skills that know how to use the tools in that codebase. The pipeline detects source files, infers languages and entry points, and produces skills with exact invocation commands.
Example: A Python threat hunting toolkit with hash2processarg.py, multikeyword_search.py, and an AMP API client:
swarm-maker --input ./amp-toolkit --model codex --output-swarm claude -o ./SKILLGenerated skill hash-ioc-process-arguments includes:
### Process
1. Validate every supplied line as SHA256 using the pattern ^[a-fA-F0-9]{64}$.
Source: [amp_client/utils/validators.py](...)
2. Run `python3 hash2processarg.py -c config.txt hashset/windows-binaries/cmd.exe.txt`
for single-source mode or `python3 hash2processarg.py -c config.txt hashes.txt --csv output.csv`
for batch mode. Source: [readme.md](...)
3. CHECKPOINT: capture every returned execution record with both the process SHA256
and child SHA256. If no child SHA256 is returned, store UNKNOWN explicitly.The agent can now execute the actual tool with the actual flags from the actual source. No invented commands.
A security team has runbooks, hash databases, keyword files, API clients, and an OpenAPI spec for their endpoint protection platform. SwarmMaker compiles these into validated skills covering the full hunting workflow.
Example: From a folder containing AMP client code, SHA256 hash sets, IOC keyword files, and an API spec, SwarmMaker generated 10 skills:
| Skill | What the agent can do | MCP Params |
|---|---|---|
credential-theft-hunt |
Combine tool hashes + LSASS keywords to detect credential dumping | 4 |
keyword-ioc-sweep |
Multi-type IOC search (hash, IP, string) with type classification | 6 |
lateral-movement-detection |
Detect lateral movement via network + process correlation | 6 |
network-connection-hunting |
Hunt malicious connections by hash or URL pattern | 6 |
persistence-mechanism-hunt |
Detect registry, service, and scheduled task persistence | 3 |
timeline-analysis |
Reconstruct per-endpoint event timelines | 5 |
hash-ioc-process-arguments |
Map SHA256 hashes to process execution records with CLI args | 3 |
event-extraction-by-type |
Extract specific AMP event types for analysis | 4 |
statistics-anomaly-reporting |
Statistical anomaly detection across endpoint fleet | 3 |
vulnerability-triage |
Triage CVE exposure across monitored endpoints | 3 |
Each skill includes exact CLI commands from the source (python3 hash2processarg.py -c config.txt hashset/hacking-tools/mimikatz.txt --csv mimikatz_hits.csv), MITRE ATT&CK technique mappings, and an MCP tool definition with typed input parameters.
Feed SwarmMaker your API documentation (REST endpoints, SDKs, OpenAPI specs) and it produces agents that know how to call your APIs with correct parameters, authentication, and error handling.
Example: An internal platform with a REST API, Python SDK, and Swagger spec:
# API docs + SDK source + OpenAPI spec in one folder
swarm-maker --input ./platform-docs --model claude --output-swarm codex -o ./SKILLSwarmMaker auto-detects the OpenAPI spec and parses it into structured endpoints:
### GET /v1/computers
List monitored endpoints
**Parameters**: `limit` (query) [integer], `hostname` (query) [string]
### GET /v1/events
Get security events
**Parameters**: `start_date` (query) [string] *required*, `event_type` (query) [integer]
### POST /v1/file_lists/application_blocking
Add SHA256 to blocking list
**Parameters**: `sha256` (query) [string] *required*The generated skills reference these endpoints with correct parameter types instead of hallucinating API contracts. The MCP tool definition includes a JSON Schema matching the API parameters, so MCP-compatible tool servers can call the API directly.
A team has accumulated institutional knowledge across dozens of documents. New team members take weeks to absorb it. SwarmMaker compiles the documentation into structured skill files that an AI assistant can load immediately.
Example: An SRE team with runbooks, architecture diagrams, incident response procedures, and monitoring tool configs:
swarm-maker --input ./sre-knowledge --model codex --output-swarm all -o ./SKILLThe output installs into Claude, Codex, and Gemini simultaneously. A new engineer asks Claude "how do I investigate a disk pressure alert on the payment cluster?" and the agent responds with the exact procedure from the team's runbook, citing the source document, not generating a generic answer.
An organization uses Claude for engineering, Codex for data teams, and Gemini for operations. SwarmMaker compiles once and renders to all three from the same intermediate representation.
# One compilation, three platform outputs
swarm-maker --input ./docs --model codex --output-swarm all -o ./SKILL
# Output:
# ./SKILL/.claude/skills/alert-triage/SKILL.md
# ./SKILL/.codex/instructions/alert-triage.md
# ./SKILL/.gemini/playbooks/alert-triage.md
# ./SKILL/.agents/skills/alert-triage/SKILL.md (cross-platform)
# ./SKILL/.agents/skills/alert-triage/mcp_tool.json (MCP definition)Render parity validation ensures all three platform trees contain identical skills and metadata. If a skill exists in .claude/ but is missing from .codex/, the pipeline fails.
After generating a full bundle (~25 min), a domain expert reviews the output and identifies one skill that needs work. Instead of re-running the full pipeline, they target that single skill:
# Regenerate one skill (2 min instead of 25 min)
swarm-maker regen --skill credential-theft-hunt --input ./docs --model codex -o ./SKILL
# Verify it triggers correctly
swarm-maker validate --bundle ./SKILL --target claude
# Result: 10/10 skills validatedThe regen command atomically updates the SKILL.md, mcp_tool.json, all platform-specific trees, references, and the source ledger. Sibling skills are untouched.
Regulated industries need provenance for automated decisions. SwarmMaker provides a complete evidence chain:
Evidence chain for any agent instruction:
Skill process step → Source: citation → Original document
│
├── .tasks/evidence.json (every ingestion decision logged)
├── .tasks/ir/manifest.json (SHA-256 digests for all IR artifacts)
├── .tasks/validation-report.md (what was checked, what passed/failed)
└── Per-task cost breakdown (every LLM call with tokens and USD)
When an auditor asks "why does this agent classify alerts using P0-P3 rules?", the answer traces through: skill process step 3 → Source: [ops-runbook.md section 2.1] → the actual text in the original document.
A team uses an internal framework with its own skill format. They define a YAML spec:
platform: internal-orchestrator
skill_path: "workflows/{slug}/config.md"
frontmatter: true
frontmatter_fields: [name, description, version]
sections: [summary, process, constraints]swarm-maker --input ./docs --model codex --output-swarm claude,custom:internal.yaml -o ./SKILL
# Output includes standard Claude tree AND custom platform:
# ./SKILL/workflows/alert-triage/config.md
# ./SKILL/workflows/hash-hunt/config.md
# ...No code changes. No rebuild. The renderer reads the spec and produces the custom output tree alongside standard platforms.
make build
# Binary at ./build/swarm-makerInstall to ~/.local/bin:
make installAt least one LLM CLI:
Check availability: swarm-maker discover
swarm-maker --input <dir> --model <provider> --output-swarm <format> [flags]
| Flag | Required | Description |
|---|---|---|
--input <dir> |
Yes | Source documentation folder |
--model <provider> |
Yes | Generator LLM: codex, claude, gemini, or ollama |
--output-swarm <format> |
Yes | Target(s): claude, codex, gemini, all, comma-separated, or custom:<spec.yaml> |
-o, --output-folder <dir> |
No | Output folder (default: .) |
--critique <provider> |
No | Critic LLM (auto-detected if omitted) |
-n, --name <name> |
No | Project name (derived from folder if omitted) |
--model-primary <model> |
No | Specific model override for generator |
--model-critic <model> |
No | Specific model override for critic |
--prompt-pack <path> |
No | Custom prompt pack JSON |
--dry-run |
No | Preview without LLM calls |
--force |
No | Overwrite existing output |
-v, --verbose |
No | Show full LLM interactions |
# Claude generates, codex reviews, output as codex skill bundle
swarm-maker --input ./notes --model claude --critique codex --output-swarm codex -o ./SKILL
# All platforms from one run
swarm-maker --input ./notes --model claude --output-swarm all -o ./SKILL
# Local LLM via Ollama (offline, free)
swarm-maker --input ./notes --model ollama --output-swarm claude -o ./SKILL
# Regenerate a single skill without re-running the full pipeline (~18 min saved)
swarm-maker regen --skill hunt-hashes --input ./notes --model codex -o ./SKILL
# Custom output platform via YAML spec
swarm-maker --input ./notes --model claude --output-swarm claude,custom:my-platform.yaml -o ./SKILL
# Validate generated skills against a live LLM
swarm-maker validate --bundle ./SKILL --target claude
# Custom prompt pack
swarm-maker prompt-pack export -o ./pack.json # export, edit, then:
swarm-maker --input ./notes --model claude --output-swarm codex --prompt-pack ./pack.json -o ./SKILL| Command | Description |
|---|---|
swarm-maker discover |
Discover available LLM CLI tools on your system |
swarm-maker regen --skill <slug> --input <dir> --model <provider> |
Regenerate a single skill without re-running the full pipeline |
swarm-maker validate --bundle <dir> --target <provider> |
Validate skill discoverability and triggerability against a live LLM CLI |
swarm-maker prompt-pack export -o <file> |
Export the default prompt pack for customization |
swarm-maker version |
Print swarm-maker version |
The regen subcommand re-generates a single skill by slug without re-running the full pipeline. It reads the existing .tasks/ ledger, compiles a focused prompt for the target skill (injecting sibling skill slugs as context), runs one LLM call, and atomically updates all artifacts for that skill:
.agents/skills/<slug>/SKILL.md— cross-platform skill definition.agents/skills/<slug>/mcp_tool.json— MCP tool definition with updated input schema.agents/skills/<slug>/references/— progressive disclosure split (if applicable).claude/skills/<slug>/SKILL.md,.codex/instructions/<slug>.md,.gemini/playbooks/<slug>.md— platform-specific trees.tasks/skills.md— ledger source of truth (skill block replaced in-place)
This saves ~18 minutes per iteration compared to a full pipeline run. Sibling skills are preserved unchanged.
swarm-maker regen --skill hunt-hashes --input ./notes --model codex -o ./SKILL<output>/
.tasks/ # Stage 1: shared build ledger
context.md # Source context with per-claim citations
tasks.md # Task decomposition from source goals
skills.md # Skill definitions (renderer input)
agents.md # Agent definitions (renderer input)
todo.md # Delivery queue with OODA phases
prompts/{product,technical, # Domain-specific compiled prompts
tools,deployment}.md
ir/ # 7 versioned JSON artifacts
evidence.json # Ingestion + generation evidence
manifest.json # Build manifest with digests
validation-report.md # Full PASS/FAIL report + cost breakdown
.agents/ # Cross-platform standard path
skills/
<skill-slug>/
SKILL.md # Frontmatter + body (platform-agnostic)
mcp_tool.json # MCP-compatible tool definition (JSON Schema)
.claude/ # Platform-specific: Claude
SKILL.md # Skill router
README.md # Skill bundle readme
skills/<skill-slug>/SKILL.md # Per-skill instruction files
.codex/ # Platform-specific: Codex
AGENTS.md # Agent router with OODA roles
README.md # Skill bundle readme
instructions/ # Per-skill instruction files
index.md
<skill-slug>.md ...
.gemini/ # Platform-specific: Gemini
GEMINI.md # Playbook router
README.md # Skill bundle readme
playbooks/ # Per-skill playbook files
index.md
<skill-slug>.md ...
plugins/{slug}/prompt.md ... # Custom platform output (if --output-swarm custom:spec.yaml)
README.md # Bundle readme
REVIEW_CHECKLIST.md # Human review checklist before deploying
install.sh # Installer (--target, --global)
.gitignore # Excludes debug artifacts from git
.agents/skills/ is the cross-platform standard path. Every skill is emitted here in addition to the platform-specific tree, using YAML frontmatter (name, description) for discovery by skill loaders. Each skill also gets an mcp_tool.json file containing an MCP-compatible tool definition. The input schema is generated by the LLM as a fenced JSON Schema block (mandated by the skill compiler contract's "MCP Input Schema" section), not regex-parsed from prose — ensuring accurate parameter types, descriptions, and required fields.
During ingestion, OpenAPI and Swagger specs (.yaml, .yml, .json files containing openapi: or swagger: keys) are detected automatically and parsed into structured endpoint summaries with methods, parameters, and schemas. This gives the LLM accurate API context instead of raw spec text, producing skills with correct endpoint references and parameter types.
In addition to the built-in Claude, Codex, and Gemini output trees, custom output platforms can be defined via YAML config files:
platform: my-agent-framework
skill_path: "plugins/{slug}/prompt.md"
frontmatter: true
frontmatter_fields: [name, description, version]
sections: [summary, process, constraints]Use with --output-swarm claude,custom:my-platform.yaml. Custom output is supplemental — at least one standard format is still required.
Every generated ledger passes through six validation layers before output is written. No layer can be skipped.
Source material is scanned for prompt injection patterns (e.g., "ignore previous instructions") during ingestion. Detected patterns are recorded as evidence events and flagged in the validation report for human review; content is never silently modified.
The pipeline starts with zero-LLM-cost programmatic checks: file existence, minimum sizes, markdown link integrity, template leak detection (16 known patterns that LLMs might copy from prompt instructions), and meta-commentary filtering (rejecting outputs that describe what they did instead of producing the artifact).
Next, a pre-screen gate runs depth-adaptive heuristics. Shallow sources get lenient citation checks. Deep sources (like our alert triage example with 33 sections) require higher citation density using a sub-linear formula, dimension coverage verification, and amplification ratio checks. Fabrication patterns and boilerplate injection are checked regardless of depth. The pre-screen produces per-file flags classified as concrete or advisory:
- Concrete flags (e.g., "low citation density: 24 citations in 20K chars, expect 25") are specific, measurable violations. They block the PASS verdict until resolved through revision.
- Advisory flags (e.g., "excessive ALL-CAPS: 22 non-standard uppercase words") are style and quality signals. They are reported in the validation report for human review but do not block the PASS verdict. This prevents the pipeline from failing on cosmetic issues after substantive problems have been fixed.
If concrete flags exist, they are forwarded to the adversarial LLM review, a separate call to the critic provider that evaluates cross-file consistency, source fidelity, coverage gaps, and UNKNOWN gate enforcement. The reviewer returns APPROVE or REVISE with per-file findings. Critically, concrete pre-screen findings block approval even if the reviewer says APPROVE because the programmatic layer has veto power over the LLM.
When the verdict is REVISE, only flagged files are regenerated in targeted revision rounds. After each round, a post-revision re-screen checks whether the revision improved things. If the flag count decreased, another round runs (up to 3 total). If the count didn't decrease (regression), the loop stops immediately to avoid wasting LLM calls on revisions that aren't helping.
Finally, when multiple output formats are selected, a render parity check verifies that all platform trees contain the same skills, agent roles, metadata, and source references. Drift between platforms is a hard failure.
The validation report at .tasks/validation-report.md is written on both success and failure paths. It includes:
- Cost Breakdown: A per-task table showing input tokens, output tokens, and estimated USD for every LLM call — preflight, all 9 generation tasks, adversarial review, and each revision round. Token counts are propagated from the executor (not re-estimated), and the total row is the verified sum of per-task entries.
- Risk Analysis: Counts total process steps across all skills and computes compound reliability estimates at 95% and 99% per-step accuracy, surfacing compounding error risk for long pipelines.
If the report file cannot be written, it is dumped to stderr as a last resort.
graph TD
LEDGER["Generated Ledger"] --> PROG["Programmatic Checks<br/>Links, leaks, meta-commentary"]
PROG --> PRE["Pre-Screen<br/>Citations, fabrication,<br/>boilerplate, depth-adaptive"]
PRE --> REVIEW["Adversarial LLM Review<br/>Cross-file consistency,<br/>source fidelity"]
REVIEW -->|approve| PARITY
REVIEW -->|revise| REVISE["Targeted Revision<br/>+ Citation Path Repair"]
REVISE --> RECHECK["Re-Screen<br/>Stops if no improvement<br/>Max 3 rounds"]
RECHECK -->|concrete flags remain| REVISE
RECHECK -->|clear| PARITY
PARITY["Render Parity Check"] --> RESULT["PASS / FAIL"]
After the pipeline renders output, swarm-maker validate smoke-tests the installed skill bundle against a live LLM CLI to verify that skills are both discoverable and triggerable.
# Validate against Claude
swarm-maker validate --bundle ./SKILL --target claude
# Validate against Codex
swarm-maker validate --bundle ./SKILL --target codexThe validate command runs two phases:
Phase 1: Skill Discovery. All skill slugs, names, and descriptions are loaded from .agents/skills/*/SKILL.md frontmatter and injected into a single discovery prompt. The LLM is asked to list every installed skill. Each skill that appears in the response is marked FOUND.
Phase 2: Skill Triggering. For each skill, the full skill catalog is injected alongside a use-case prompt derived from the skill's description (e.g., "I need to hunt for credential dumping activity..."). The LLM must respond with the correct skill slug. This tests whether the skill's description is specific enough to be selected over other skills when a matching request arrives.
Skill Validation Report
[PASS] credential-theft-hunt: FOUND in skill list, TRIGGERED on test prompt
[PASS] hash-ioc-process-arguments: FOUND in skill list, TRIGGERED on test prompt
[PASS] keyword-ioc-sweep: FOUND in skill list, TRIGGERED on test prompt
...
Result: 10/10 skills validated
Validation requires a working LLM CLI. A skill that fails triggering indicates its frontmatter description is too generic or too similar to another skill's description — the skill content may be correct but the routing metadata needs refinement.
make build # Compile to ./build/swarm-maker
make test # All tests with -race
make fmt # gofmt
make lint # golangci-lint
make all # fmt + lint + test + build
make release # Cross-compile (linux/darwin/windows, amd64/arm64)Source code lives in src/swarmmaker/. The root Makefile delegates all Go commands there.
GoReleaser builds for linux/darwin/windows on amd64/arm64. Version injected via -X main.version={{.Version}}.
The pipeline validates source material in two stages before running the 9-task generation swarm:
Stage 1: Basic sanity check (zero LLM cost). The pipeline requires at least 1 readable text file with non-empty content. Empty directories are rejected immediately without any LLM calls.
Stage 2: Pre-flight source validation (1 LLM call, ~$0.01). After ingestion, discovery, and routing complete, one short LLM call evaluates whether the source material contains enough domain concepts, requirements, constraints, or data structures to decompose into at least one agent skill with concrete process steps. If the LLM judges the material INSUFFICIENT, the run exits with a specific explanation of what is missing. This costs one cheap call to potentially save 9+ expensive generation calls ($1-5) on input that cannot produce useful output.
Both rejections record an evidence event in evidence.json for auditability.
- LLM output is non-deterministic. Two runs with the same input produce structurally similar but textually different ledgers. The validation pipeline catches drift but cannot guarantee identical output.
- Tool synthesis is planning-only. The tool synthesis module decides whether tools are needed and what language they should use, but does not generate executable code. Source code files are now detected and referenced in generated skills, but executable tool code is not synthesized.
- Citation density heuristic. The pre-screen uses a sub-linear formula for expected citation count. Very long documents (>50K chars) may trigger false positives.
- Ollama quality varies by model. Local models via Ollama produce lower quality output than frontier cloud models. Smaller models may fail validation checks that larger models pass. Use Ollama for iteration and drafting; use cloud providers for final production runs.
SwarmMaker's architecture is informed by current agent engineering research. See docs/agent-engineering-reference.md for the design principles, cognitive limits, tool design patterns, and anti-patterns that shaped the pipeline.
See LICENSE.