aiwg-training

Corpus-to-dataset pipeline for AI training data curation

15 skills, 7 agents, 5 format adapters, 3 decontamination modes, 6 default benchmark targets, 485 research REFs. Agentic surface that works out of the box + optional Python runtime for scale.

/plugin install training@aiwg      # Claude Code plugin install
aiwg use training                   # or via AIWG CLI

Get Started · What You Get · Architecture · Research · Docs

What aiwg-training Is

aiwg-training is a marketplace plugin for AIWG that turns any corpus — research papers, code repositories, conversation logs, documentation sites — into training-ready datasets for fine-tuning language models. It produces datasets suitable for SFT, DPO, KTO, ORPO, SimPO, and GRPO training workflows, with full provenance, license inheritance, benchmark decontamination, and byte-for-byte reproducibility.

If you have tried to build a fine-tuning dataset and ended up with ad-hoc scripts, manually curated JSONL files, mystery licenses, and hope-this-doesn't-contaminate-the-eval vibes, aiwg-training is the missing infrastructure layer. It implements every published best practice from dataset methodology research (Self-Instruct, Evol-Instruct, Orca, PersonaHub, STaR), preference-optimization research (DPO, KTO, ORPO, SimPO), governance standards (Datasheets for Datasets, Model Cards, Data Statements, ML Reproducibility Checklist), and safety research (Benchmark Contamination, Model Collapse, Llama Guard) behind a single cohesive framework.

Unlike HuggingFace datasets (storage format) or Axolotl (training orchestrator), aiwg-training is a curation pipeline. It ingests, assesses, synthesizes, filters, formats, decontaminates, versions, and documents — the work that happens before you invoke trainer.train() and the part that determines whether your fine-tune actually learns anything useful.

What Problems aiwg-training Solves

Building a fine-tuning dataset is hard in ways that don't show up in tutorials. Four failure modes dominate:

1. No Provenance, No Reproducibility

Typical dataset scripts produce JSONL files with no record of where each example came from, what license governs it, what transformations were applied, or how to rebuild the same dataset again next week. When something goes wrong — a model overfits a biased subsample, a source is later retracted, a license changes — there's no way to trace or fix it.

Without aiwg-training: 70%+ of published fine-tuning datasets fail the ML Reproducibility Checklist (Pineau et al. 2020). Lineage from raw source to trained model is almost always missing.

With aiwg-training: Every example traces back to its source via W3C PROV (REF-062). Every dataset version ships with a SHA-256 fixity manifest + deterministic seed + reproduction recipe. aiwg-training dataset reproduce byte-reproduces any prior version.

2. Benchmark Contamination

Most fine-tuning datasets accidentally include examples from the benchmarks you'll later use to evaluate the model. Your "HumanEval 67.2%" score is meaningless if 40% of HumanEval was in your training data. Published papers have been retracted over this.

Without aiwg-training: Benchmark leakage is detected post-hoc, if ever. REF-442 (Sainz et al. 2023) shows ChatGPT reproduces CoNLL-2003 verbatim — pervasive contamination across major benchmarks.

With aiwg-training: Decontamination is a first-class pipeline stage that blocks publication. Three detection modes (exact 13-gram per REF-442, fuzzy edit-distance, semantic embedding similarity). Six default targets (MMLU, GSM8K, HumanEval, HELM, MT-Bench, AlpacaEval) extensible to any benchmark. The decontamination-gate lint rule makes override explicit with triple audit trail (manifest + activity log + report appendix).

3. License Laundering

"MIT output derived from GPL-3.0 sources" is license laundering. The training run is legally exposed — often not discovered until the derived model ships commercially. Mixed-license corpora are worse: combining GPL-2.0-only and GPL-3.0-only produces no valid outbound license at all.

Without aiwg-training: Manual license tracking breaks down past ~10 sources. Most datasets declare a single license for all examples, which is often wrong.

With aiwg-training: Every source declares an SPDX identifier at acquisition time (--allow-unlicensed override requires explicit audit). Every derived example inherits the most-restrictive license from its sources. The license-check rule catches laundering (example license weaker than source), incompatible combinations (GPL-2/3 mix), and commercial-incompatible mixes (CC-BY-NC derivatives in commercial exports). Share-alike obligations propagate automatically.

4. Model Collapse

Training a model on synthetic data generated by that same model family is a known failure mode: progressive tail loss, variance collapse, converging to mediocre medians (REF-446, Shumailov 2024). Recursive synthesis makes this worse with each generation.

Without aiwg-training: Synthetic data pipelines drift into recursive generation without realizing it. By the time collapse is detected, the damage is in the weights.

With aiwg-training: Every example tracks metadata.synthetic_depth. The synthetic-data-generator enforces max recursion depth = 1 per ADR-022 D10 — attempting depth > 1 raises ModelCollapseGuardError unless --allow-recursive-synthetic is set explicitly (and logs the override to provenance). Synthetic and human examples live in separate derivedPages categories so the mixing ratio is always measurable.

The Seven Pipeline Stages

aiwg-training's pipeline maps each known failure mode to a dedicated stage. Each stage is an invocable skill + agent pair; the flow-dataset-build orchestrator chains them with human-authorization gates.

1. Acquire — Sources with License and Provenance

acquire-training-source ingests from filesystems, URLs, git repositories, or existing AIWG research REFs. Every source gets an SPDX license declaration, a SHA-256 fixity manifest, a W3C PROV entity, and a format classification (code / docs / papers / dialogues / mixed). Sources without licenses are rejected unless --allow-unlicensed is passed (which writes the exception to the activity log).

2. Quality — GRADE Assessment Per Example

example-quality-assess adapts the GRADE evidence framework (REF-060, used in medicine for evaluating study quality) to per-example training data. Source-level GRADE sets a baseline; 11 factors adjust individual example grades (5 upgrades: clear reasoning, task diversity, cross-source corroboration, verifiable output, human-written; 6 downgrades: hallucinated citation, out-of-distribution, ambiguous prompt, truncated output, unsafe content per REF-443 Llama Guard, synthetic recursion beyond depth 1). Aggregate reports break quality down by domain, synthetic-vs-human, and identify the 10 worst offenders.

3. Synthesize — Four Published Patterns

example-synthesizer implements four synthesis patterns from the research literature:

Self-Instruct (REF-375, Wang et al. 2022): bootstrap from seed instructions
Evol-Instruct: depth (complexity) + breadth (variation) evolution of existing instructions
SQuAD-style (REF-454, Rajpurkar et al. 2016): extract Q&A pairs from document passages
STaR (REF-445, Zelikman et al. 2022): augment with chain-of-thought reasoning traces

Per-example provenance records the seeds used, the generator model, temperature, and pattern. Synthesized examples go to derivedPages.synthesizedExamples with synthetic: true, synthetic_depth: 1 — strictly segregated from human-sourced examples.

4. Preference — DPO/KTO/ORPO/SimPO Pair Generation

preference-generator produces preference pairs in three modes:

LLM-judge (Opus for ambiguous judgments, Sonnet default): model evaluates two candidates with a rubric (correctness, clarity, completeness, safety)
Rule-based (5 heuristics, capped 0.8 confidence): shorter-when-correct, cites-source, reasoning-trace-present, no-hallucination, coherent-not-truncated
Human (interactive): AskUserQuestion-style prompting for each pair

Preferences are written as graph edges in the consumer's memory (Fortemi preferred, aiwg index fallback). Four export formats ship: DPO ({prompt, chosen, rejected}), KTO ({prompt, completion, label: bool}), ORPO (+ ratio metadata), SimPO (+ length-normalized hint metadata). Rationale notes from LLM judgments are captured as separate analysis pages linked via rationale_note_id.

5. Format — Five Adapters with Round-Trip Validation

Canonical internal format + five adapters:

Target	Format	Notes
Alpaca	`{instruction, input, output}` JSONL	Reasoning traces → sidecar; rejects preference records
ShareGPT	`conversations: [{from, value}]`	Axolotl-native; multi-turn preserved; tool calls as tool turns
ChatML	OpenAI `messages: [{role, content}]`	Native `tool_calls` field (lossless for tool-use)
JSONL	Canonical record-per-line	Identity adapter; reference implementation
Parquet	Apache Arrow + Parquet	Columnar; `--shard-size N` for large exports; HuggingFace Datasets native

Each adapter validates round-trip invariants (per schemas/example-record.yaml): fields that cannot be expressed in the target format land in <output>.metadata.yaml sidecars so reconstruction is lossless.

6. Decontamination — Blocking Publication Gate

decontamination-check scans the dataset against eval benchmarks with three modes:

Exact n-gram (default N=13 per REF-442)
Fuzzy (edit distance — catches near-duplicates and light paraphrasing)
Semantic (sentence-transformers embedding similarity ≥ 0.95 — catches translations, deep paraphrases)

Six default targets ship: MMLU, GSM8K, HumanEval, HELM, MT-Bench, AlpacaEval. Per-target configuration supports custom n-gram sizes (HumanEval uses 8 for code) and per-target detection mode lists. User-declared targets union with defaults via the override_defaults: false flag.

The decontamination-gate rule blocks dataset-version from publishing unless the report is fresh and all targets pass threshold. Override only via --acknowledge-contamination with explicit justification written to manifest.ethical_considerations, activity.log, and the report appendix.

7. Publish — Atomic Versioning with Documentation

dataset-version runs a 9-step atomic operation: gate validation → split computation (deterministic seed) → most-restrictive-wins license resolution → synthetic ratio computation → SHA-256 fixity manifest → W3C PROV record → storage snapshot → YAML manifest + auto-exported JSON sibling → dataset-version event logged. Any failure triggers rollback — partial outputs are cleaned up from .staging/.

dataset-docs generates three compliance documents:

Datasheet for Datasets (REF-451, Gebru et al. 2021) — 57 questions across 7 sections
Model Card (REF-452, Mitchell et al. 2019) — 9-section model documentation
Data Statement for NLP (REF-453, Bender & Friedman 2018) — 9-component linguistic documentation

≥60% of fields auto-populate from the dataset manifest + example metadata + quality reports. HUMAN FILL markers flag fields requiring judgment; interactive mode prompts for each; LLM-assisted mode offers suggestions.

A Real Walkthrough

Building a code-review training dataset from open-source repos. The pipeline runs on a single command once the config is written; the time breakdown is where humans add value, not where the machine churns.

Config

# pipeline.yaml
sources:
  - uri: "git:https://github.com/rust-lang/rust"
    license: "Apache-2.0 OR MIT"
    format_hint: code
    quality_grade: HIGH
  - uri: "git:https://github.com/torvalds/linux"
    license: "GPL-2.0-only"
    format_hint: code
    quality_grade: HIGH
  - uri: "ref:REF-238"              # CodeBERT paper from research corpus
    format_hint: papers
    quality_grade: HIGH

synthesis:
  enabled: true
  pattern: squad                     # Extract Q&A pairs from code review discussions
  count: 2000
  temperature: 0.7

preference_generation:
  enabled: true
  mode: llm-judge                    # Opus for ambiguous code quality judgments
  pair_count: 500
  target_format: dpo

format_exports: [jsonl, alpaca, parquet]

decontamination:
  mode: exact-ngram
  ngram_size: 8                      # Shorter n-gram for code overlap detection
  threshold: 0

publish:
  version: "2026.4.0"
  name: "code-review-gold-v1"
  description: "Code review examples curated from open-source projects with expert annotations"
  seed: 42
  split_ratios:
    train: 0.8
    validation: 0.1
    test: 0.1
  target_model: "claude-sonnet-4-6-finetune"
  intended_use: "Fine-tune code review models for educational feedback"

Run

aiwg-training flow build pipeline.yaml --interactive

With --interactive, the pipeline pauses after stage 3 (license-check) for human review and after stage 9 (decontamination-gate) before publishing. You see each stage's output as it completes.

What You Get

.aiwg/training/
├── raw/                                  # Original source files with SHA-256 fixity
├── examples/raw/                         # Canonical examples ingested from sources
├── examples/synthesized/                 # Synthesis outputs (2,000 SQuAD-style Q&A pairs)
├── preferences/                          # 500 DPO triples
├── exports/
│   ├── alpaca/2026.4.0.jsonl
│   ├── jsonl/2026.4.0.jsonl
│   └── parquet/2026.4.0.parquet
├── reports/
│   ├── decontamination-2026.4.0.md       # Per-target overlap report (PASS)
│   ├── pipeline-2026.4.0-<run-id>.md     # Full stage-by-stage report
│   └── quality-<timestamp>.md            # GRADE distribution + worst offenders
├── provenance/dataset-2026.4.0.jsonld    # W3C PROV bundle
└── datasets/
    ├── 2026.4.0.yaml                     # Dataset manifest (source of truth)
    ├── 2026.4.0.json                     # Auto-exported JSON sibling
    ├── 2026.4.0-CHECKSUMS.sha256         # Fixity manifest
    ├── 2026.4.0-datasheet.md             # Gebru datasheet
    ├── 2026.4.0-model-card.md            # Mitchell model card
    └── 2026.4.0-data-statement.md        # Bender & Friedman data statement

Hand Off to Training

The Parquet file loads directly into HuggingFace Datasets:

from datasets import load_dataset
ds = load_dataset("parquet", data_files=".aiwg/training/exports/parquet/2026.4.0.parquet")

Or into Axolotl via ShareGPT export. Or into TRL (REF-477) via DPO JSONL. The dataset manifest ID + provenance bundle travel with the data — downstream eval results (via matric-eval) link back to it via trained_on_dataset_version.

Reproduce Later

Six months later, someone else wants to regenerate the exact same dataset:

aiwg-training dataset reproduce datasets/2026.4.0.yaml --compare-fixity

If sources haven't drifted and the synthesis config produces deterministic output at the declared seed, the SHA-256 fixity matches byte-for-byte. Where non-determinism exists (model API drift, GPU floating-point), the report enumerates what differs so you know whether to trust the rebuild.

Quantified Claims

Claim	Basis
144 tests passing, 0 failing	`pytest tests/` after Phase 4 commit
60%+ auto-population rate on datasheets	REF-451 standard benchmark
0% benchmark contamination at publication	Gate rule blocks any overlap > threshold
<5 minutes for 10K-example decontamination scan	Phase 2 acceptance criteria
~$7,200 total cost for 1M-example pipeline	RLM cost model: $400 Haiku screening + $6,000 Sonnet synthesis + $800 Haiku formatting
14,320 lines of Python runtime	49 files, measurable via `wc -l`
485 research REFs	`roctinam/research-papers` corpus as of 2026-04-14

When to Use aiwg-training (and When Not To)

Good Fit

Fine-tuning a Claude/GPT/Llama family model on your domain corpus
Building a preference dataset for DPO/KTO/ORPO training
Needing to publish a dataset with full provenance for research compliance
Turning a research corpus (papers, docs, code) into training examples
Preparing datasets for benchmark-safe evaluation (no contamination)
Mixing human-curated + synthetic examples with explicit ratio control
Working across licenses that need inheritance tracking

Not the Best Fit

You already have a clean, versioned, decontaminated JSONL and just need to train — use TRL/Axolotl directly
You need a pre-training dataset at web scale (billions of tokens) — use Dolma or RedPajama construction pipelines
You want a single-purpose tool for one transformation (e.g., just format conversion) — the CLI works for this but simpler tools exist
You need eval execution (running benchmarks against trained models) — this delegates to matric-eval

The Trade-off

aiwg-training optimizes for correctness and provenance at human-scale (tens of thousands of examples, human-reviewable quality gates, cross-session durability). It doesn't optimize for web-scale pre-training corpora (billions of documents, streaming throughput). The sweet spot is project-scale fine-tuning where you need auditability more than raw throughput.

How It Works

aiwg-training ships as a dual-stack framework. Both layers implement the same operations; the Python layer is installed on demand when batch scale demands it.

┌─────────────────────────────────────────────────────────────────┐
│ Agentic Surface (15 SKILL.md + 7 agents)                         │
│                                                                   │
│ - Works out of the box in any AIWG install                       │
│ - AI agents read specs and execute them in-context               │
│ - Human-authorization gates at license + decontamination         │
│ - Cost-effective for exploration, small datasets                 │
│                                                                   │
│ IF AIWG_TRAINING_AVAILABLE=1:                                    │
│   → skills delegate batch work via `aiwg-training <subcommand>` │
│ ELSE:                                                             │
│   → agent handles it in-context (slower but works everywhere)   │
└─────────────────────────────────────────────────────────────────┘
                              ↓
┌─────────────────────────────────────────────────────────────────┐
│ Python Runtime (optional, in lib/)                                │
│                                                                   │
│ - Installed by hooks/post-install.js on user approval            │
│ - Requires Python 3.10+                                          │
│ - Registers .aiwg/training/runtime.json marker                   │
│ - hooks/pre-skill.js exposes AIWG_TRAINING_BIN to each skill    │
│                                                                   │
│ Scales to:                                                       │
│ - 1M-example format conversions                                  │
│ - 10K-example decontamination scans                              │
│ - GPU-free bulk synthesis via Haiku batching                     │
│ - Byte-for-byte reproducibility                                  │
└─────────────────────────────────────────────────────────────────┘

The agentic layer works for everyone; the Python layer is there when you need it. Nothing forces one or the other — skills degrade gracefully.

Quick Start

Install

# Via AIWG plugin marketplace (recommended)
/plugin install training@aiwg

# Or via AIWG CLI
npm i -g aiwg
aiwg use training

# The post-install hook will detect Python 3.10+ and ask whether
# to install the optional runtime backend. Decline for agent-only mode.

First Pipeline (3 minutes)

# Create a minimal pipeline config
cat > pipeline.yaml <<'EOF'
sources:
  - uri: "file:./docs/"
    license: "MIT"
    format_hint: docs
    quality_grade: MODERATE
format_exports: [jsonl]
decontamination:
  mode: exact-ngram
publish:
  version: "0.1.0"
  name: "docs-dataset"
  description: "Documentation extracted for Q&A training"
  seed: 42
EOF

# Run end-to-end (agent-only path)
# Just ask your AIWG-enabled assistant: "build training dataset from pipeline.yaml"
# Or use the CLI if the Python runtime is installed:
aiwg-training flow build pipeline.yaml

Standalone Python Usage

cd lib/
python3 -m venv .venv
.venv/bin/pip install -e .[all]

# Format conversion
.venv/bin/aiwg-training format convert examples.jsonl --target alpaca --output alpaca.jsonl --validate-round-trip

# Decontamination check
.venv/bin/aiwg-training decontamination check dataset.jsonl --mode exact-ngram

# End-to-end pipeline
.venv/bin/aiwg-training flow build pipeline.yaml --version 2026.4.0 --interactive

What You Get

Skills (15)

Ingest + Quality (3): acquire-training-source, example-quality-assess, license-check (lint rule) Synthesis + Preferences (3): example-synthesizer, preference-generator, synthetic-data-generator Formats (5): format-adapter-alpaca, -sharegpt, -chatml, -jsonl, -parquet Decontamination (2): decontamination-check, decontamination-gate (lint rule) Publication (4): dataset-version, dataset-reproduce, dataset-docs, flow-dataset-build (orchestrator)

Agents (7)

Per RLM cost guidance (haiku for bulk mechanical, sonnet default, opus for ambiguous judgments):

source-curator-agent (sonnet) — decides which sources admit to the corpus
example-synthesizer-agent (sonnet) — SFT generation
preference-generator-agent (opus) — ambiguous preference judgments
format-converter-agent (haiku) — bulk adapter dispatch
decontamination-agent (sonnet) — contamination check + hard gate
dataset-evaluator-agent (sonnet) — metrics + matric-eval handoff
dataset-publication-agent (sonnet) — versioning + publication

Each agent has ≤7 responsibilities per the god-session rule, cites its authority boundaries (can-do / requires-approval / never), and references ADR-022.

Schemas (4)

example-record.yaml — canonical training example (12 task types, input/output/metadata)
dataset-manifest.yaml — version-level metadata with reproducibility fields
synthetic-generator-config.yaml — config for large-batch synthesis patterns
decontamination-targets.yaml — eval target config with per-target n-gram sizes

Templates (4)

datasheet-for-datasets.md — 57-question Gebru 2021
model-card.md — 9-section Mitchell 2019
data-statement.md — 9-component Bender & Friedman 2018
decontamination-report.md — per-target overlap samples + reproducibility block

Rules (2 + delegated)

license-check.md — C1–C5 with most-restrictive-wins resolution + incompatible-combo catalog
decontamination-gate.md — C1–C5 publication gate with --acknowledge-contamination override

Python Runtime (lib/)

aiwg_training.schemas — Pydantic models mirroring the YAML schemas
aiwg_training.core — topology, fixity, W3C PROV, log writer
aiwg_training.formats — 5 adapters with round-trip validation
aiwg_training.decontamination — ngram, fuzzy, semantic + report
aiwg_training.quality — LicenseChecker, QualityAssessor
aiwg_training.synthesis — LLM client, example/preference/synthetic generators
aiwg_training.publication — DatasetVersioner, DatasetReproducer, DatasetDocsGenerator, FlowDatasetBuild
aiwg_training.cli — Click-based CLI (12 subcommand groups)

Architecture

See ADR-022 for the full architectural record. Ten locked decisions (D1–D10):

#	Decision
D1	Framework name: `training-complete`
D2	Topology: skills + flow-orchestrated
D3	Storage: filesystem + Fortemi (preferred) + `aiwg index` (fallback)
D4	Granularity: 1 example = 1 Fortemi note
D5	Preferences: Fortemi graph edges with metadata
D6	Dataset versioning: Fortemi collections + YAML dataset manifests
D7	Canonical format: JSONL + 5 adapters
D8	Decontamination: first-class pipeline stage + lint rule + matric-eval delegation
D9	Provenance: W3C PROV-O
D10	Synthetic recursion: max depth 1 (Model Collapse guard)

Storage Model

Three-tier storage per ADR-022 D3:

Tier	Purpose	Lifetime
Filesystem (`.aiwg/training/raw/`)	Raw sources before ingestion	Immutable reference
Fortemi (via MCP)	Durable, relationship-rich, cross-session	Persistent
`aiwg index`	Graph fallback when Fortemi unavailable; always used for session cache	Session or persistent

Fortemi excels at preference-edge storage + multi-hop retrieval for pair synthesis. aiwg index serves as the graph fallback with multi-backend support (json / graphology / sqlite).

Research Foundations

Grounded in the roctinam/research-papers corpus (485 REFs as of 2026-04-14). Organized by what each group of papers informs:

Preference Optimization

REF-376 DPO (Rafailov et al. 2023) — replaces RM+PPO with single classification loss
REF-391 KTO (Ethayarajh et al. 2024) — Kahneman-Tversky; unpaired binary feedback
REF-392 ORPO (Hong et al. 2024) — single-stage SFT+alignment with odds ratio
REF-393 SimPO (Meng et al. 2024) — length-normalized avg log-prob; reference-free
REF-394 GRPO (Shao et al. 2024) — group relative policy optimization (DeepSeek-R1 basis)
REF-395 IPO (Azar et al. 2023) — general Ψ-PO framework; DPO overfit fix
REF-396 RLAIF (Lee et al. 2023) — LLM-generated preferences at 1/10 cost of human
REF-438 UltraFeedback (Cui et al. 2024) — 256K conversations × 1M GPT-4 annotations

Synthesis Methodologies

REF-375 Self-Instruct (Wang et al. 2022) — bootstrap from seeds, 52K instructions
REF-470 Orca (Mukherjee et al. 2023) — progressive learning from GPT-4 explanation traces
REF-435 Orca 2 (Mitra et al. 2023) — reasoning strategy selection + prompt erasing
REF-436 Phi-1 (Gunasekar et al. 2023) — textbook-quality data thesis origin
REF-437 Phi-3 (Abdin et al. 2024) — heavily filtered web + synthetic; runs on iPhone
REF-445 STaR (Zelikman et al. 2022) — iterative rationale bootstrapping
REF-457 V-STaR (Hosseini et al. 2024) — DPO-trained verifier on correct+incorrect rationales
REF-448 PersonaHub (Ge et al. 2024) — 1B personas for synthetic diversity
REF-456 ReST (Gulcehre et al. 2023) — grow/improve offline RL loop (legitimate recursion case)
REF-458 LIMA (Zhou et al. 2023) — superficial alignment hypothesis; 1K curated examples
REF-454 SQuAD (Rajpurkar et al. 2016) — paragraph→question→answer workflow

Safety & Quality

REF-442 Benchmark Contamination (Sainz et al. 2023) — 3-level taxonomy; contamination registry
REF-443 Llama Guard (Inan et al. 2023) — 6-category unsafe taxonomy
REF-444 Sleeper Agents (Hubinger et al. 2024) — backdoored LLMs persist through training
REF-446 Model Collapse (Shumailov et al. 2024) — recursive synthetic causes tail loss
REF-060 GRADE Handbook — evidence quality methodology adapted for examples

Governance & Documentation

REF-451 Datasheets for Datasets (Gebru et al. 2021) — 57-question documentation template
REF-452 Model Cards (Mitchell et al. 2019) — 9-section model documentation
REF-453 Data Statements (Bender & Friedman 2018) — 9-component NLP-specific documentation
REF-475 ML Reproducibility Checklist (Pineau et al. 2020) — 17-question NeurIPS checklist
REF-474 Dataset Versioning (DVC / LakeFS / HuggingFace Hub) — comparison study
REF-056 FAIR Principles — findability, accessibility, interoperability, reusability
REF-062 W3C PROV-O — entity-activity-agent provenance model

Dataset Formats & Infrastructure

REF-471 HuggingFace Datasets (Lhoest et al. 2021) — Apache Arrow-backed; 650+ datasets
REF-472 ChatML / ShareGPT / OpenAI messages — conversation format specifications
REF-473 Arrow + Parquet — in-memory + on-disk columnar (HF ecosystem backbone)
REF-455 LAION-5B (Schuhmann et al. 2022) — 5.85B pairs; WebDataset sharding
REF-476 T0 (Sanh et al. 2021) — 2,073 PromptSource templates; 11B model competitive with GPT-3

PEFT & Training Infrastructure

REF-377 LoRA (Hu et al. 2021) — low-rank adaptation; 10,000× param reduction
REF-378 QLoRA (Dettmers et al. 2023) — 4-bit NormalFloat; fine-tune 65B on single GPU
REF-379 DoRA (Liu et al. 2024) — weight decomposition; bridges LoRA-FT gap
REF-477 TRL (von Werra et al. 2020–2026) — HuggingFace post-training library (SFT/DPO/PPO/KTO/ORPO)
REF-478 Axolotl / LLaMA-Factory / Unsloth — fine-tuning orchestration frameworks

Evaluation (for decontamination targets)

REF-449 LM Evaluation Harness (Gao / EleutherAI) — 60+ benchmarks; HF Open LLM Leaderboard
REF-450 AlpacaEval (Dubois et al. 2024) — length-controlled win rates; 0.98 Spearman with Arena
REF-063 HELM, REF-064 BigBench — holistic evaluation

Standards & Compliance

Standard	How aiwg-training Conforms
SPDX License Identifiers	Every source + derived example declares an SPDX ID; inheritance computed via most-restrictive-wins
W3C PROV-O	Entity-Activity-Agent chain per example (source → ingest → synthesize → quality-score → format-convert → export → version)
OAIS Fixity Information	SHA-256 manifests generated per dataset version with self-verifying headers (sha256sum-compatible format)
Datasheets for Datasets (REF-451)	All 57 questions addressed in auto-populated template; 60%+ fields machine-filled
Model Cards (REF-452)	All 9 sections; training data section fully auto-populated from manifest
Data Statements for NLP (REF-453)	All 9 components; speaker/annotator demographics surfaced as HUMAN FILL
ML Reproducibility Checklist (REF-475)	Version + seed + sources + provenance + fixity + reproduction recipe all captured in dataset manifest

Repository Layout

aiwg-training/
├── .claude-plugin/
│   └── plugin.json            # Marketplace discovery metadata
├── manifest.json              # AIWG framework manifest (memory.topology)
├── README.md                  # This file
├── hooks/                     # JS hooks for plugin lifecycle
│   ├── post-install.js        # Optional Python runtime installer
│   └── pre-skill.js           # Exposes AIWG_TRAINING_BIN to skills
├── agents/                    # 7 domain agent definitions
├── skills/                    # 15 SKILL.md files
├── templates/                 # 4 template files
├── schemas/                   # 4 YAML schemas
├── rules/                     # 2 lint rules
├── docs/                      # matric-eval integration
└── lib/                       # Python runtime (optional)
    ├── pyproject.toml
    ├── aiwg_training/         # 14K lines Python across 49 files
    └── tests/                 # 144 tests

Contributing

See the main AIWG repository for contribution guidelines. Training-specific contributions:

New synthesis patterns: add a generator class under lib/aiwg_training/synthesis/ + a SKILL.md stub
New format adapters: extend FormatAdapter base class; add round-trip test fixture
New decontamination targets: add to schemas/decontamination-targets.yaml with per-target config
New quality factors: extend UPGRADE_FACTORS or DOWNGRADE_FACTORS in quality/example_quality.py
Research references: file induction issues in roctinam/research-papers first; cite REF IDs in skill docs

All changes must pass .venv/bin/pytest tests/ with 0 regressions and respect the god-session + human-authorization rules from aiwg-utils.

Community & Support

Source (Gitea): https://github.com/jmagly/aiwg-training
Mirror (GitHub): https://github.com/jmagly/aiwg-training
AIWG (parent project): https://github.com/jmagly/aiwg
Research corpus: https://github.com/jmagly/research-papers
Evaluation sister project: https://github.com/jmagly/matric-eval
Discord: https://discord.gg/BuAusFMxdA
Telegram: https://t.me/+oJg9w2lE6A5lOGFh

Issues

Need	Where to file
Training-complete bugs (data, decontamination, manifest shape)	`roctinam/aiwg` with label `training-complete`
Eval execution (benchmarks, trained model evaluation)	`roctinam/matric-eval`
Research corpus additions	`roctinam/research-papers`

License

MIT. See LICENSE.

Acknowledgments

Built on 485 research papers from the AIWG research corpus. Special recognition to the authors whose work grounds this framework: Rafailov (DPO), Ethayarajh (KTO), Hong (ORPO), Meng (SimPO), Shao (GRPO), Wang (Self-Instruct), Mukherjee (Orca), Gunasekar (Phi), Zelikman (STaR), Ge (PersonaHub), Gulcehre (ReST), Shumailov (Model Collapse), Sainz (Benchmark Contamination), Gebru (Datasheets), Mitchell (Model Cards), Bender & Friedman (Data Statements), Pineau (ML Repro Checklist), and the HuggingFace team (Datasets, TRL).

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.claude-plugin		.claude-plugin
agents		agents
docs		docs
hooks		hooks
lib		lib
rules		rules
schemas		schemas
skills		skills
templates		templates
README.md		README.md
manifest.json		manifest.json

Folders and files

Latest commit

History

Repository files navigation

aiwg-training

What aiwg-training Is

What Problems aiwg-training Solves

1. No Provenance, No Reproducibility

2. Benchmark Contamination

3. License Laundering

4. Model Collapse

The Seven Pipeline Stages

1. Acquire — Sources with License and Provenance

2. Quality — GRADE Assessment Per Example

3. Synthesize — Four Published Patterns

4. Preference — DPO/KTO/ORPO/SimPO Pair Generation

5. Format — Five Adapters with Round-Trip Validation

6. Decontamination — Blocking Publication Gate

7. Publish — Atomic Versioning with Documentation

A Real Walkthrough

Config

Run

What You Get

Hand Off to Training

Reproduce Later

Quantified Claims

When to Use aiwg-training (and When Not To)

Good Fit

Not the Best Fit

The Trade-off

How It Works

Quick Start

Install

First Pipeline (3 minutes)

Standalone Python Usage

What You Get

Skills (15)

Agents (7)

Schemas (4)

Templates (4)

Rules (2 + delegated)

Python Runtime (lib/)

Architecture

Storage Model

Research Foundations

Preference Optimization

Synthesis Methodologies

Safety & Quality

Governance & Documentation

Dataset Formats & Infrastructure

PEFT & Training Infrastructure

Evaluation (for decontamination targets)

Standards & Compliance

Repository Layout

Contributing

Community & Support

Issues

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages