Corpus-to-dataset pipeline for AI training data curation
15 skills, 7 agents, 5 format adapters, 3 decontamination modes, 6 default benchmark targets, 485 research REFs. Agentic surface that works out of the box + optional Python runtime for scale.
/plugin install training@aiwg # Claude Code plugin install
aiwg use training # or via AIWG CLIGet Started · What You Get · Architecture · Research · Docs
aiwg-training is a marketplace plugin for AIWG that turns any corpus — research papers, code repositories, conversation logs, documentation sites — into training-ready datasets for fine-tuning language models. It produces datasets suitable for SFT, DPO, KTO, ORPO, SimPO, and GRPO training workflows, with full provenance, license inheritance, benchmark decontamination, and byte-for-byte reproducibility.
If you have tried to build a fine-tuning dataset and ended up with ad-hoc scripts, manually curated JSONL files, mystery licenses, and hope-this-doesn't-contaminate-the-eval vibes, aiwg-training is the missing infrastructure layer. It implements every published best practice from dataset methodology research (Self-Instruct, Evol-Instruct, Orca, PersonaHub, STaR), preference-optimization research (DPO, KTO, ORPO, SimPO), governance standards (Datasheets for Datasets, Model Cards, Data Statements, ML Reproducibility Checklist), and safety research (Benchmark Contamination, Model Collapse, Llama Guard) behind a single cohesive framework.
Unlike HuggingFace datasets (storage format) or Axolotl (training orchestrator), aiwg-training is a curation pipeline. It ingests, assesses, synthesizes, filters, formats, decontaminates, versions, and documents — the work that happens before you invoke trainer.train() and the part that determines whether your fine-tune actually learns anything useful.
Building a fine-tuning dataset is hard in ways that don't show up in tutorials. Four failure modes dominate:
Typical dataset scripts produce JSONL files with no record of where each example came from, what license governs it, what transformations were applied, or how to rebuild the same dataset again next week. When something goes wrong — a model overfits a biased subsample, a source is later retracted, a license changes — there's no way to trace or fix it.
Without aiwg-training: 70%+ of published fine-tuning datasets fail the ML Reproducibility Checklist (Pineau et al. 2020). Lineage from raw source to trained model is almost always missing.
With aiwg-training: Every example traces back to its source via W3C PROV (REF-062). Every dataset version ships with a SHA-256 fixity manifest + deterministic seed + reproduction recipe. aiwg-training dataset reproduce byte-reproduces any prior version.
Most fine-tuning datasets accidentally include examples from the benchmarks you'll later use to evaluate the model. Your "HumanEval 67.2%" score is meaningless if 40% of HumanEval was in your training data. Published papers have been retracted over this.
Without aiwg-training: Benchmark leakage is detected post-hoc, if ever. REF-442 (Sainz et al. 2023) shows ChatGPT reproduces CoNLL-2003 verbatim — pervasive contamination across major benchmarks.
With aiwg-training: Decontamination is a first-class pipeline stage that blocks publication. Three detection modes (exact 13-gram per REF-442, fuzzy edit-distance, semantic embedding similarity). Six default targets (MMLU, GSM8K, HumanEval, HELM, MT-Bench, AlpacaEval) extensible to any benchmark. The decontamination-gate lint rule makes override explicit with triple audit trail (manifest + activity log + report appendix).
"MIT output derived from GPL-3.0 sources" is license laundering. The training run is legally exposed — often not discovered until the derived model ships commercially. Mixed-license corpora are worse: combining GPL-2.0-only and GPL-3.0-only produces no valid outbound license at all.
Without aiwg-training: Manual license tracking breaks down past ~10 sources. Most datasets declare a single license for all examples, which is often wrong.
With aiwg-training: Every source declares an SPDX identifier at acquisition time (--allow-unlicensed override requires explicit audit). Every derived example inherits the most-restrictive license from its sources. The license-check rule catches laundering (example license weaker than source), incompatible combinations (GPL-2/3 mix), and commercial-incompatible mixes (CC-BY-NC derivatives in commercial exports). Share-alike obligations propagate automatically.
Training a model on synthetic data generated by that same model family is a known failure mode: progressive tail loss, variance collapse, converging to mediocre medians (REF-446, Shumailov 2024). Recursive synthesis makes this worse with each generation.
Without aiwg-training: Synthetic data pipelines drift into recursive generation without realizing it. By the time collapse is detected, the damage is in the weights.
With aiwg-training: Every example tracks metadata.synthetic_depth. The synthetic-data-generator enforces max recursion depth = 1 per ADR-022 D10 — attempting depth > 1 raises ModelCollapseGuardError unless --allow-recursive-synthetic is set explicitly (and logs the override to provenance). Synthetic and human examples live in separate derivedPages categories so the mixing ratio is always measurable.
aiwg-training's pipeline maps each known failure mode to a dedicated stage. Each stage is an invocable skill + agent pair; the flow-dataset-build orchestrator chains them with human-authorization gates.
acquire-training-source ingests from filesystems, URLs, git repositories, or existing AIWG research REFs. Every source gets an SPDX license declaration, a SHA-256 fixity manifest, a W3C PROV entity, and a format classification (code / docs / papers / dialogues / mixed). Sources without licenses are rejected unless --allow-unlicensed is passed (which writes the exception to the activity log).
example-quality-assess adapts the GRADE evidence framework (REF-060, used in medicine for evaluating study quality) to per-example training data. Source-level GRADE sets a baseline; 11 factors adjust individual example grades (5 upgrades: clear reasoning, task diversity, cross-source corroboration, verifiable output, human-written; 6 downgrades: hallucinated citation, out-of-distribution, ambiguous prompt, truncated output, unsafe content per REF-443 Llama Guard, synthetic recursion beyond depth 1). Aggregate reports break quality down by domain, synthetic-vs-human, and identify the 10 worst offenders.
example-synthesizer implements four synthesis patterns from the research literature:
- Self-Instruct (REF-375, Wang et al. 2022): bootstrap from seed instructions
- Evol-Instruct: depth (complexity) + breadth (variation) evolution of existing instructions
- SQuAD-style (REF-454, Rajpurkar et al. 2016): extract Q&A pairs from document passages
- STaR (REF-445, Zelikman et al. 2022): augment with chain-of-thought reasoning traces
Per-example provenance records the seeds used, the generator model, temperature, and pattern. Synthesized examples go to derivedPages.synthesizedExamples with synthetic: true, synthetic_depth: 1 — strictly segregated from human-sourced examples.
preference-generator produces preference pairs in three modes:
- LLM-judge (Opus for ambiguous judgments, Sonnet default): model evaluates two candidates with a rubric (correctness, clarity, completeness, safety)
- Rule-based (5 heuristics, capped 0.8 confidence): shorter-when-correct, cites-source, reasoning-trace-present, no-hallucination, coherent-not-truncated
- Human (interactive):
AskUserQuestion-style prompting for each pair
Preferences are written as graph edges in the consumer's memory (Fortemi preferred, aiwg index fallback). Four export formats ship: DPO ({prompt, chosen, rejected}), KTO ({prompt, completion, label: bool}), ORPO (+ ratio metadata), SimPO (+ length-normalized hint metadata). Rationale notes from LLM judgments are captured as separate analysis pages linked via rationale_note_id.
Canonical internal format + five adapters:
| Target | Format | Notes |
|---|---|---|
| Alpaca | {instruction, input, output} JSONL |
Reasoning traces → sidecar; rejects preference records |
| ShareGPT | conversations: [{from, value}] |
Axolotl-native; multi-turn preserved; tool calls as tool turns |
| ChatML | OpenAI messages: [{role, content}] |
Native tool_calls field (lossless for tool-use) |
| JSONL | Canonical record-per-line | Identity adapter; reference implementation |
| Parquet | Apache Arrow + Parquet | Columnar; --shard-size N for large exports; HuggingFace Datasets native |
Each adapter validates round-trip invariants (per schemas/example-record.yaml): fields that cannot be expressed in the target format land in <output>.metadata.yaml sidecars so reconstruction is lossless.
decontamination-check scans the dataset against eval benchmarks with three modes:
- Exact n-gram (default N=13 per REF-442)
- Fuzzy (edit distance — catches near-duplicates and light paraphrasing)
- Semantic (sentence-transformers embedding similarity ≥ 0.95 — catches translations, deep paraphrases)
Six default targets ship: MMLU, GSM8K, HumanEval, HELM, MT-Bench, AlpacaEval. Per-target configuration supports custom n-gram sizes (HumanEval uses 8 for code) and per-target detection mode lists. User-declared targets union with defaults via the override_defaults: false flag.
The decontamination-gate rule blocks dataset-version from publishing unless the report is fresh and all targets pass threshold. Override only via --acknowledge-contamination with explicit justification written to manifest.ethical_considerations, activity.log, and the report appendix.
dataset-version runs a 9-step atomic operation: gate validation → split computation (deterministic seed) → most-restrictive-wins license resolution → synthetic ratio computation → SHA-256 fixity manifest → W3C PROV record → storage snapshot → YAML manifest + auto-exported JSON sibling → dataset-version event logged. Any failure triggers rollback — partial outputs are cleaned up from .staging/.
dataset-docs generates three compliance documents:
- Datasheet for Datasets (REF-451, Gebru et al. 2021) — 57 questions across 7 sections
- Model Card (REF-452, Mitchell et al. 2019) — 9-section model documentation
- Data Statement for NLP (REF-453, Bender & Friedman 2018) — 9-component linguistic documentation
≥60% of fields auto-populate from the dataset manifest + example metadata + quality reports. HUMAN FILL markers flag fields requiring judgment; interactive mode prompts for each; LLM-assisted mode offers suggestions.
Building a code-review training dataset from open-source repos. The pipeline runs on a single command once the config is written; the time breakdown is where humans add value, not where the machine churns.
# pipeline.yaml
sources:
- uri: "git:https://github.com/rust-lang/rust"
license: "Apache-2.0 OR MIT"
format_hint: code
quality_grade: HIGH
- uri: "git:https://github.com/torvalds/linux"
license: "GPL-2.0-only"
format_hint: code
quality_grade: HIGH
- uri: "ref:REF-238" # CodeBERT paper from research corpus
format_hint: papers
quality_grade: HIGH
synthesis:
enabled: true
pattern: squad # Extract Q&A pairs from code review discussions
count: 2000
temperature: 0.7
preference_generation:
enabled: true
mode: llm-judge # Opus for ambiguous code quality judgments
pair_count: 500
target_format: dpo
format_exports: [jsonl, alpaca, parquet]
decontamination:
mode: exact-ngram
ngram_size: 8 # Shorter n-gram for code overlap detection
threshold: 0
publish:
version: "2026.4.0"
name: "code-review-gold-v1"
description: "Code review examples curated from open-source projects with expert annotations"
seed: 42
split_ratios:
train: 0.8
validation: 0.1
test: 0.1
target_model: "claude-sonnet-4-6-finetune"
intended_use: "Fine-tune code review models for educational feedback"aiwg-training flow build pipeline.yaml --interactiveWith --interactive, the pipeline pauses after stage 3 (license-check) for human review and after stage 9 (decontamination-gate) before publishing. You see each stage's output as it completes.
.aiwg/training/
├── raw/ # Original source files with SHA-256 fixity
├── examples/raw/ # Canonical examples ingested from sources
├── examples/synthesized/ # Synthesis outputs (2,000 SQuAD-style Q&A pairs)
├── preferences/ # 500 DPO triples
├── exports/
│ ├── alpaca/2026.4.0.jsonl
│ ├── jsonl/2026.4.0.jsonl
│ └── parquet/2026.4.0.parquet
├── reports/
│ ├── decontamination-2026.4.0.md # Per-target overlap report (PASS)
│ ├── pipeline-2026.4.0-<run-id>.md # Full stage-by-stage report
│ └── quality-<timestamp>.md # GRADE distribution + worst offenders
├── provenance/dataset-2026.4.0.jsonld # W3C PROV bundle
└── datasets/
├── 2026.4.0.yaml # Dataset manifest (source of truth)
├── 2026.4.0.json # Auto-exported JSON sibling
├── 2026.4.0-CHECKSUMS.sha256 # Fixity manifest
├── 2026.4.0-datasheet.md # Gebru datasheet
├── 2026.4.0-model-card.md # Mitchell model card
└── 2026.4.0-data-statement.md # Bender & Friedman data statement
The Parquet file loads directly into HuggingFace Datasets:
from datasets import load_dataset
ds = load_dataset("parquet", data_files=".aiwg/training/exports/parquet/2026.4.0.parquet")Or into Axolotl via ShareGPT export. Or into TRL (REF-477) via DPO JSONL. The dataset manifest ID + provenance bundle travel with the data — downstream eval results (via matric-eval) link back to it via trained_on_dataset_version.
Six months later, someone else wants to regenerate the exact same dataset:
aiwg-training dataset reproduce datasets/2026.4.0.yaml --compare-fixityIf sources haven't drifted and the synthesis config produces deterministic output at the declared seed, the SHA-256 fixity matches byte-for-byte. Where non-determinism exists (model API drift, GPU floating-point), the report enumerates what differs so you know whether to trust the rebuild.
| Claim | Basis |
|---|---|
| 144 tests passing, 0 failing | pytest tests/ after Phase 4 commit |
| 60%+ auto-population rate on datasheets | REF-451 standard benchmark |
| 0% benchmark contamination at publication | Gate rule blocks any overlap > threshold |
| <5 minutes for 10K-example decontamination scan | Phase 2 acceptance criteria |
| ~$7,200 total cost for 1M-example pipeline | RLM cost model: $400 Haiku screening + $6,000 Sonnet synthesis + $800 Haiku formatting |
| 14,320 lines of Python runtime | 49 files, measurable via wc -l |
| 485 research REFs | roctinam/research-papers corpus as of 2026-04-14 |
- Fine-tuning a Claude/GPT/Llama family model on your domain corpus
- Building a preference dataset for DPO/KTO/ORPO training
- Needing to publish a dataset with full provenance for research compliance
- Turning a research corpus (papers, docs, code) into training examples
- Preparing datasets for benchmark-safe evaluation (no contamination)
- Mixing human-curated + synthetic examples with explicit ratio control
- Working across licenses that need inheritance tracking
- You already have a clean, versioned, decontaminated JSONL and just need to train — use TRL/Axolotl directly
- You need a pre-training dataset at web scale (billions of tokens) — use Dolma or RedPajama construction pipelines
- You want a single-purpose tool for one transformation (e.g., just format conversion) — the CLI works for this but simpler tools exist
- You need eval execution (running benchmarks against trained models) — this delegates to
matric-eval
aiwg-training optimizes for correctness and provenance at human-scale (tens of thousands of examples, human-reviewable quality gates, cross-session durability). It doesn't optimize for web-scale pre-training corpora (billions of documents, streaming throughput). The sweet spot is project-scale fine-tuning where you need auditability more than raw throughput.
aiwg-training ships as a dual-stack framework. Both layers implement the same operations; the Python layer is installed on demand when batch scale demands it.
┌─────────────────────────────────────────────────────────────────┐
│ Agentic Surface (15 SKILL.md + 7 agents) │
│ │
│ - Works out of the box in any AIWG install │
│ - AI agents read specs and execute them in-context │
│ - Human-authorization gates at license + decontamination │
│ - Cost-effective for exploration, small datasets │
│ │
│ IF AIWG_TRAINING_AVAILABLE=1: │
│ → skills delegate batch work via `aiwg-training <subcommand>` │
│ ELSE: │
│ → agent handles it in-context (slower but works everywhere) │
└─────────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────────┐
│ Python Runtime (optional, in lib/) │
│ │
│ - Installed by hooks/post-install.js on user approval │
│ - Requires Python 3.10+ │
│ - Registers .aiwg/training/runtime.json marker │
│ - hooks/pre-skill.js exposes AIWG_TRAINING_BIN to each skill │
│ │
│ Scales to: │
│ - 1M-example format conversions │
│ - 10K-example decontamination scans │
│ - GPU-free bulk synthesis via Haiku batching │
│ - Byte-for-byte reproducibility │
└─────────────────────────────────────────────────────────────────┘
The agentic layer works for everyone; the Python layer is there when you need it. Nothing forces one or the other — skills degrade gracefully.
# Via AIWG plugin marketplace (recommended)
/plugin install training@aiwg
# Or via AIWG CLI
npm i -g aiwg
aiwg use training
# The post-install hook will detect Python 3.10+ and ask whether
# to install the optional runtime backend. Decline for agent-only mode.# Create a minimal pipeline config
cat > pipeline.yaml <<'EOF'
sources:
- uri: "file:./docs/"
license: "MIT"
format_hint: docs
quality_grade: MODERATE
format_exports: [jsonl]
decontamination:
mode: exact-ngram
publish:
version: "0.1.0"
name: "docs-dataset"
description: "Documentation extracted for Q&A training"
seed: 42
EOF
# Run end-to-end (agent-only path)
# Just ask your AIWG-enabled assistant: "build training dataset from pipeline.yaml"
# Or use the CLI if the Python runtime is installed:
aiwg-training flow build pipeline.yamlcd lib/
python3 -m venv .venv
.venv/bin/pip install -e .[all]
# Format conversion
.venv/bin/aiwg-training format convert examples.jsonl --target alpaca --output alpaca.jsonl --validate-round-trip
# Decontamination check
.venv/bin/aiwg-training decontamination check dataset.jsonl --mode exact-ngram
# End-to-end pipeline
.venv/bin/aiwg-training flow build pipeline.yaml --version 2026.4.0 --interactiveIngest + Quality (3): acquire-training-source, example-quality-assess, license-check (lint rule)
Synthesis + Preferences (3): example-synthesizer, preference-generator, synthetic-data-generator
Formats (5): format-adapter-alpaca, -sharegpt, -chatml, -jsonl, -parquet
Decontamination (2): decontamination-check, decontamination-gate (lint rule)
Publication (4): dataset-version, dataset-reproduce, dataset-docs, flow-dataset-build (orchestrator)
Per RLM cost guidance (haiku for bulk mechanical, sonnet default, opus for ambiguous judgments):
source-curator-agent(sonnet) — decides which sources admit to the corpusexample-synthesizer-agent(sonnet) — SFT generationpreference-generator-agent(opus) — ambiguous preference judgmentsformat-converter-agent(haiku) — bulk adapter dispatchdecontamination-agent(sonnet) — contamination check + hard gatedataset-evaluator-agent(sonnet) — metrics + matric-eval handoffdataset-publication-agent(sonnet) — versioning + publication
Each agent has ≤7 responsibilities per the god-session rule, cites its authority boundaries (can-do / requires-approval / never), and references ADR-022.
example-record.yaml— canonical training example (12 task types, input/output/metadata)dataset-manifest.yaml— version-level metadata with reproducibility fieldssynthetic-generator-config.yaml— config for large-batch synthesis patternsdecontamination-targets.yaml— eval target config with per-target n-gram sizes
datasheet-for-datasets.md— 57-question Gebru 2021model-card.md— 9-section Mitchell 2019data-statement.md— 9-component Bender & Friedman 2018decontamination-report.md— per-target overlap samples + reproducibility block
license-check.md— C1–C5 with most-restrictive-wins resolution + incompatible-combo catalogdecontamination-gate.md— C1–C5 publication gate with--acknowledge-contaminationoverride
aiwg_training.schemas— Pydantic models mirroring the YAML schemasaiwg_training.core— topology, fixity, W3C PROV, log writeraiwg_training.formats— 5 adapters with round-trip validationaiwg_training.decontamination— ngram, fuzzy, semantic + reportaiwg_training.quality— LicenseChecker, QualityAssessoraiwg_training.synthesis— LLM client, example/preference/synthetic generatorsaiwg_training.publication— DatasetVersioner, DatasetReproducer, DatasetDocsGenerator, FlowDatasetBuildaiwg_training.cli— Click-based CLI (12 subcommand groups)
See ADR-022 for the full architectural record. Ten locked decisions (D1–D10):
| # | Decision |
|---|---|
| D1 | Framework name: training-complete |
| D2 | Topology: skills + flow-orchestrated |
| D3 | Storage: filesystem + Fortemi (preferred) + aiwg index (fallback) |
| D4 | Granularity: 1 example = 1 Fortemi note |
| D5 | Preferences: Fortemi graph edges with metadata |
| D6 | Dataset versioning: Fortemi collections + YAML dataset manifests |
| D7 | Canonical format: JSONL + 5 adapters |
| D8 | Decontamination: first-class pipeline stage + lint rule + matric-eval delegation |
| D9 | Provenance: W3C PROV-O |
| D10 | Synthetic recursion: max depth 1 (Model Collapse guard) |
Three-tier storage per ADR-022 D3:
| Tier | Purpose | Lifetime |
|---|---|---|
Filesystem (.aiwg/training/raw/) |
Raw sources before ingestion | Immutable reference |
| Fortemi (via MCP) | Durable, relationship-rich, cross-session | Persistent |
aiwg index |
Graph fallback when Fortemi unavailable; always used for session cache | Session or persistent |
Fortemi excels at preference-edge storage + multi-hop retrieval for pair synthesis. aiwg index serves as the graph fallback with multi-backend support (json / graphology / sqlite).
Grounded in the roctinam/research-papers corpus (485 REFs as of 2026-04-14). Organized by what each group of papers informs:
- REF-376 DPO (Rafailov et al. 2023) — replaces RM+PPO with single classification loss
- REF-391 KTO (Ethayarajh et al. 2024) — Kahneman-Tversky; unpaired binary feedback
- REF-392 ORPO (Hong et al. 2024) — single-stage SFT+alignment with odds ratio
- REF-393 SimPO (Meng et al. 2024) — length-normalized avg log-prob; reference-free
- REF-394 GRPO (Shao et al. 2024) — group relative policy optimization (DeepSeek-R1 basis)
- REF-395 IPO (Azar et al. 2023) — general Ψ-PO framework; DPO overfit fix
- REF-396 RLAIF (Lee et al. 2023) — LLM-generated preferences at 1/10 cost of human
- REF-438 UltraFeedback (Cui et al. 2024) — 256K conversations × 1M GPT-4 annotations
- REF-375 Self-Instruct (Wang et al. 2022) — bootstrap from seeds, 52K instructions
- REF-470 Orca (Mukherjee et al. 2023) — progressive learning from GPT-4 explanation traces
- REF-435 Orca 2 (Mitra et al. 2023) — reasoning strategy selection + prompt erasing
- REF-436 Phi-1 (Gunasekar et al. 2023) — textbook-quality data thesis origin
- REF-437 Phi-3 (Abdin et al. 2024) — heavily filtered web + synthetic; runs on iPhone
- REF-445 STaR (Zelikman et al. 2022) — iterative rationale bootstrapping
- REF-457 V-STaR (Hosseini et al. 2024) — DPO-trained verifier on correct+incorrect rationales
- REF-448 PersonaHub (Ge et al. 2024) — 1B personas for synthetic diversity
- REF-456 ReST (Gulcehre et al. 2023) — grow/improve offline RL loop (legitimate recursion case)
- REF-458 LIMA (Zhou et al. 2023) — superficial alignment hypothesis; 1K curated examples
- REF-454 SQuAD (Rajpurkar et al. 2016) — paragraph→question→answer workflow
- REF-442 Benchmark Contamination (Sainz et al. 2023) — 3-level taxonomy; contamination registry
- REF-443 Llama Guard (Inan et al. 2023) — 6-category unsafe taxonomy
- REF-444 Sleeper Agents (Hubinger et al. 2024) — backdoored LLMs persist through training
- REF-446 Model Collapse (Shumailov et al. 2024) — recursive synthetic causes tail loss
- REF-060 GRADE Handbook — evidence quality methodology adapted for examples
- REF-451 Datasheets for Datasets (Gebru et al. 2021) — 57-question documentation template
- REF-452 Model Cards (Mitchell et al. 2019) — 9-section model documentation
- REF-453 Data Statements (Bender & Friedman 2018) — 9-component NLP-specific documentation
- REF-475 ML Reproducibility Checklist (Pineau et al. 2020) — 17-question NeurIPS checklist
- REF-474 Dataset Versioning (DVC / LakeFS / HuggingFace Hub) — comparison study
- REF-056 FAIR Principles — findability, accessibility, interoperability, reusability
- REF-062 W3C PROV-O — entity-activity-agent provenance model
- REF-471 HuggingFace Datasets (Lhoest et al. 2021) — Apache Arrow-backed; 650+ datasets
- REF-472 ChatML / ShareGPT / OpenAI messages — conversation format specifications
- REF-473 Arrow + Parquet — in-memory + on-disk columnar (HF ecosystem backbone)
- REF-455 LAION-5B (Schuhmann et al. 2022) — 5.85B pairs; WebDataset sharding
- REF-476 T0 (Sanh et al. 2021) — 2,073 PromptSource templates; 11B model competitive with GPT-3
- REF-377 LoRA (Hu et al. 2021) — low-rank adaptation; 10,000× param reduction
- REF-378 QLoRA (Dettmers et al. 2023) — 4-bit NormalFloat; fine-tune 65B on single GPU
- REF-379 DoRA (Liu et al. 2024) — weight decomposition; bridges LoRA-FT gap
- REF-477 TRL (von Werra et al. 2020–2026) — HuggingFace post-training library (SFT/DPO/PPO/KTO/ORPO)
- REF-478 Axolotl / LLaMA-Factory / Unsloth — fine-tuning orchestration frameworks
- REF-449 LM Evaluation Harness (Gao / EleutherAI) — 60+ benchmarks; HF Open LLM Leaderboard
- REF-450 AlpacaEval (Dubois et al. 2024) — length-controlled win rates; 0.98 Spearman with Arena
- REF-063 HELM, REF-064 BigBench — holistic evaluation
| Standard | How aiwg-training Conforms |
|---|---|
| SPDX License Identifiers | Every source + derived example declares an SPDX ID; inheritance computed via most-restrictive-wins |
| W3C PROV-O | Entity-Activity-Agent chain per example (source → ingest → synthesize → quality-score → format-convert → export → version) |
| OAIS Fixity Information | SHA-256 manifests generated per dataset version with self-verifying headers (sha256sum-compatible format) |
| Datasheets for Datasets (REF-451) | All 57 questions addressed in auto-populated template; 60%+ fields machine-filled |
| Model Cards (REF-452) | All 9 sections; training data section fully auto-populated from manifest |
| Data Statements for NLP (REF-453) | All 9 components; speaker/annotator demographics surfaced as HUMAN FILL |
| ML Reproducibility Checklist (REF-475) | Version + seed + sources + provenance + fixity + reproduction recipe all captured in dataset manifest |
aiwg-training/
├── .claude-plugin/
│ └── plugin.json # Marketplace discovery metadata
├── manifest.json # AIWG framework manifest (memory.topology)
├── README.md # This file
├── hooks/ # JS hooks for plugin lifecycle
│ ├── post-install.js # Optional Python runtime installer
│ └── pre-skill.js # Exposes AIWG_TRAINING_BIN to skills
├── agents/ # 7 domain agent definitions
├── skills/ # 15 SKILL.md files
├── templates/ # 4 template files
├── schemas/ # 4 YAML schemas
├── rules/ # 2 lint rules
├── docs/ # matric-eval integration
└── lib/ # Python runtime (optional)
├── pyproject.toml
├── aiwg_training/ # 14K lines Python across 49 files
└── tests/ # 144 tests
See the main AIWG repository for contribution guidelines. Training-specific contributions:
- New synthesis patterns: add a generator class under
lib/aiwg_training/synthesis/+ a SKILL.md stub - New format adapters: extend
FormatAdapterbase class; add round-trip test fixture - New decontamination targets: add to
schemas/decontamination-targets.yamlwith per-target config - New quality factors: extend
UPGRADE_FACTORSorDOWNGRADE_FACTORSinquality/example_quality.py - Research references: file induction issues in
roctinam/research-papersfirst; cite REF IDs in skill docs
All changes must pass .venv/bin/pytest tests/ with 0 regressions and respect the god-session + human-authorization rules from aiwg-utils.
- Source (Gitea): https://github.com/jmagly/aiwg-training
- Mirror (GitHub): https://github.com/jmagly/aiwg-training
- AIWG (parent project): https://github.com/jmagly/aiwg
- Research corpus: https://github.com/jmagly/research-papers
- Evaluation sister project: https://github.com/jmagly/matric-eval
- Discord: https://discord.gg/BuAusFMxdA
- Telegram: https://t.me/+oJg9w2lE6A5lOGFh
| Need | Where to file |
|---|---|
| Training-complete bugs (data, decontamination, manifest shape) | roctinam/aiwg with label training-complete |
| Eval execution (benchmarks, trained model evaluation) | roctinam/matric-eval |
| Research corpus additions | roctinam/research-papers |
MIT. See LICENSE.
Built on 485 research papers from the AIWG research corpus. Special recognition to the authors whose work grounds this framework: Rafailov (DPO), Ethayarajh (KTO), Hong (ORPO), Meng (SimPO), Shao (GRPO), Wang (Self-Instruct), Mukherjee (Orca), Gunasekar (Phi), Zelikman (STaR), Ge (PersonaHub), Gulcehre (ReST), Shumailov (Model Collapse), Sainz (Benchmark Contamination), Gebru (Datasheets), Mitchell (Model Cards), Bender & Friedman (Data Statements), Pineau (ML Repro Checklist), and the HuggingFace team (Datasets, TRL).