Thesis: Given a tiny model from HuggingFace, every format conversion and runtime engine in the Sovereign AI Stack must produce token-identical greedy outputs (or bounded quantization drift). A single failure proves a bug.
Current status: 0/139 checks passing (but inference now works for SmolLM/Qwen2).
Exercises 23 of 40+ apr subcommands across 17 check suites.
Twelve issues filed against aprender/realizar — see Filed Issues.
See CLAIMS.md for pre-registered falsifiable claims and design rationale (ADRs).
- Hardware: Any x86_64 or ARM64 machine with ≥4GB RAM. CPU-only (no GPU required).
- Software:
apr(v0.2.16+),uv(v0.5+), Python 3.11+ - Disk: ~5GB for models directory
# Docker (fully reproducible)
docker build -t tmgt . && docker run tmgt
# Native (requires Rust toolchain)
bash ci/setup.sh # Install apr
uv sync # Install Python deps (locked via uv.lock)make pull # Download 3 tiny models (~1.5GB)
make convert # Import to APR (Int4/Int8) + export GGUF
make check # Run all parity checks (actually invokes apr inference)
make ticket # Generate GitHub issue markdown for failuresRun a single check suite or model:
uv run python scripts/parity_check.py --check canary
uv run python scripts/parity_check.py --model smollm-135m
uv run python scripts/parity_check.py --check token --model qwen2-0.5bRun tests:
make test # Unit + property tests (no apr/model deps)
make test-parity # Integration parity tests (requires apr + models)
make coverage # Coverage report| Model | APR Int8 | APR Int4 | GGUF Roundtrip | PPL |
|---|---|---|---|---|
| SmolLM-135M | RUNS but --json broken (#240) | RUNS garbled output | BLOCKED (#240) | BLOCKED (#242) |
| Qwen2-0.5B | RUNS "Paris" correct (#240) | RUNS "Paris" correct | BLOCKED (#240) | BLOCKED (#242) |
| GPT-2 124M | FAIL qkv_weight zeros (#241) | FAIL qkv_weight zeros | FAIL (#241) | BLOCKED (#242) |
Current blockers:
apr run --jsonoutputs human-readable text instead of JSON (#240) — blocks programmatic checks- GPT-2 fused qkv_weight not dequantized (#241) — blocks all GPT-2 inference
apr evalrejects APR format, GGUF fused_matmul type error (#242) — blocks PPL checks
| Issue | Bug | Status |
|---|---|---|
| #231 | Int8 embedding NaN/Inf + shape mismatch | Fixed |
| #232 | Int4 all-zero embedding tensors | Fixed |
| #233 | GPT-2 tensor name mapping + wpe density | Fixed |
| #234 | lm_head not excluded from quantization | Fixed |
| #235 | GPT-2 hidden_dim=64 instead of 768 | Fixed |
| #236 | GPT-2 GGUF export hidden_dim=0 | Fixed |
| #237 | Quantization write pipeline broken for all tensors | Fixed |
| #239 | realizar loader reads Q8/Q4 bytes as f32 | Fixed |
| #240 | apr run --json flag ignored |
Open — blocks programmatic checks |
| #241 | GPT-2 fused qkv_weight not dequantized | Open — blocks GPT-2 |
| #242 | apr eval rejects APR + GGUF type error |
Open — blocks PPL |
| Suite | Checks | What it tests |
|---|---|---|
check-canary |
12 | Golden output regression — Int8 text must exactly match oracle |
check-token |
24 | Int4/Int8 token mismatch bounds (≤5/32 and ≤3/32) |
check-drift |
12 | Int8 mismatches ≤ Int4 mismatches + 1 |
check-roundtrip |
6 | APR → GGUF → reimport produces identical tokens |
check-ppl |
9 | PPL within model-specific bounds, Int4/Int8 diff < 0.5 |
| Suite | Checks | What it tests |
|---|---|---|
check-inspect |
6 | Metadata matches expected arch params (Claim 9) |
check-validate |
6 | Magic/header/version integrity (Claim 10) |
check-tensors |
6 | Tensor count and dtypes match quant level (Claim 11) |
check-lint |
6 | No critical lint violations (Claim 12) |
check-selftest |
6 | ≥7/10 apr check pipeline stages pass (Claim 13) |
check-diff |
3 | Int4 vs Int8 differ only in dtype (Claim 14) |
check-tree |
6 | Layer/tensor count matches expected (Claim 15) |
check-oracle-id |
6 | apr oracle identifies correct architecture (Claim 16) |
check-hex |
6 | Tensor statistics non-zero (Claim 17) |
check-debug |
6 | apr debug reports health=OK (Claim 18) |
check-bench |
6 | apr bench produces >0 tok/s (Claim 19, expected fail) |
check-qa |
6 | ≥3 QA gates execute without critical failure (Claim 20) |
| Suite | Checks | What it tests |
|---|---|---|
check-list |
1 | apr list cache inventory succeeds |
check-rosetta-diff |
3 | No layout mismatches between Int8/Int4 |
check-parity-gpu |
3 | GPU/CPU produce identical results (GGUF) |
| Total: 139 | 20 suites |
|---|
23 of 40+ apr subcommands exercised:
| Category | Subcommands |
|---|---|
| Pipeline | pull, import, export, run, eval |
| Inspection | inspect, validate, tensors, lint, tree, hex, debug |
| Analysis | check, diff, oracle, bench, qa, parity |
| Rosetta | rosetta diff-tensors |
| Utility | list, --version |
Deferred (broken or inapplicable): compare-hf (ZSTD error), flow (ZSTD error),
canary (audio-only), serve (server), tui/chat/cbtop (interactive),
rm/publish (destructive), rosetta fingerprint (Q8 bug).
This repo uses Popperian falsification: we attempt to disprove parity rather than prove it. Each test encodes a specific falsifiable prediction. A single failure constitutes evidence of a bug in the format conversion or runtime engine.
- Source: HuggingFace
transformers(v5.1.0) with PyTorch (v2.10.0) - Precision: float32, CPU-only,
do_sample=False(deterministic greedy) - Random seeds: Not applicable — greedy decoding is fully deterministic (see ADR-001)
- Max tokens: 32 per prompt
- Output: JSON with token IDs, decoded text, model metadata
| Comparison | Threshold | Effect Size | n |
|---|---|---|---|
| Int4 vs oracle | ≤5/32 mismatches | 15.6% | 12 |
| Int8 vs oracle | ≤3/32 mismatches | 9.4% | 12 |
| Int8 vs Int4 drift | Int8 ≤ Int4+1 | ≤1 token | 12 |
| Cross-runtime | Exact text match | 0% | 12 |
| PPL Int4 vs Int8 | Diff < 0.5 | <0.5 PPL | 3 |
| Canary (text) | Exact text match | 0% | 12 |
All checks: CI=100% (deterministic greedy decoding).
- Total sample size: n = 139 parity checks (exhaustive cross-product across 17 suites), plus pytest tests including property-based tests with n = 100 iterations via hypothesis.
- Standard deviation: σ = 0 for all parity checks. Greedy decoding (temperature=0) is fully deterministic. Outputs are bit-for-bit identical across runs. Uncertainty is ±0.
- Confidence interval: [exact, exact] for all checks. 95% and 99% CIs are not applicable (variance = 0). The CI is trivially 100%.
- Sample size justification: 4 prompts × 3 models = n = 12 per claim. Prompts cover 4 categories (arithmetic, NLP, code, social). 3 models cover 3 architectures (LLaMA, Qwen/GQA, GPT-2). Exhaustive over the roster. Total: n = 139 (20 suites).
- Effect size: 5/32 = 15.6% ±0 for Int4, 3/32 = 9.4% ±0 for Int8. Cohen's d: large (Int4), medium (Int8).
- PPL bounds: Model-specific ceilings (SmolLM: 20.0, Qwen2: 15.0, GPT-2: 30.0) from published benchmarks, with 2× headroom (σ_headroom ≈ 2× base PPL).
| Model | HF ID | Parameters | Architecture | License |
|---|---|---|---|---|
| SmolLM-135M | HuggingFaceTB/SmolLM-135M |
135M | LLaMA-style (30 layers, 9 heads) | Apache 2.0 |
| Qwen2-0.5B | Qwen/Qwen2-0.5B |
500M | Qwen (GQA, 24 layers, 14 heads) | Apache 2.0 |
| GPT-2 | openai-community/gpt2 |
124M | GPT-2 (12 layers, 12 heads) | MIT |
| Prompt | Category | Text | Purpose |
|---|---|---|---|
| arithmetic | Math | What is 2+2? Answer: |
Tests numerical token generation |
| completion | NLP | The capital of France is |
Tests factual continuation |
| code | Programming | def fibonacci(n): |
Tests code token patterns |
| greeting | Social | Hello, my name is |
Tests natural language patterns |
Each oracle JSON file contains:
{
"model": "HuggingFace model ID",
"prompt": "input text",
"runtime": "transformers",
"format": "float32",
"transformers_version": "5.1.0",
"torch_version": "2.10.0",
"tokens": [token_id_1, token_id_2, ...],
"text": "decoded output text",
"token_count": 32,
"max_new_tokens": 32,
"do_sample": false
}Python (uv) Parity Checker
─────────── ──────────────
gen_oracle.py ──► oracle/*.json ◄── parity_check.py
(rare, manual) │
▼
subprocess: apr {subcommand} --json
23 subcommands: run, eval, inspect,
validate, tensors, lint, check, diff,
tree, oracle, hex, debug, bench, qa,
list, parity, rosetta diff-tensors,
pull, import, export, --version
│
▼
compare output vs oracle / MODEL_METADATA
generate GitHub issue markdown
- Lock files:
uv.lockpins all Python dependencies. Rust tools installed viacargo install --locked. - Docker:
Dockerfileprovides fully reproducible environment. - CI: GitHub Actions runs weekly (see
ci/parity.yml). - Determinism: All inference uses greedy decoding. No random seeds needed (ADR-001).
- Versioning: Oracle JSON includes
transformers_versionandtorch_versionfor provenance.
- All work happens on
master— no feature branches. - Commit format:
feat|fix|test|docs: message (Refs TMGT-XXX). make testmust pass (330+ tests, <25s) before committing.pmat quality-gatemust report 0 violations.- File bugs against aprender when parity checks fail.
MIT. See LICENSE.