tiny-model-ground-truth

Thesis: Given a tiny model from HuggingFace, every format conversion and runtime engine in the Sovereign AI Stack must produce token-identical greedy outputs (or bounded quantization drift). A single failure proves a bug.

Current status: 0/139 checks passing (but inference now works for SmolLM/Qwen2). Exercises 23 of 40+ apr subcommands across 17 check suites. Twelve issues filed against aprender/realizar — see Filed Issues.

See CLAIMS.md for pre-registered falsifiable claims and design rationale (ADRs).

Installation

Prerequisites

Hardware: Any x86_64 or ARM64 machine with ≥4GB RAM. CPU-only (no GPU required).
Software: apr (v0.2.16+), uv (v0.5+), Python 3.11+
Disk: ~5GB for models directory

Setup

# Docker (fully reproducible)
docker build -t tmgt . && docker run tmgt

# Native (requires Rust toolchain)
bash ci/setup.sh   # Install apr
uv sync            # Install Python deps (locked via uv.lock)

Usage

make pull      # Download 3 tiny models (~1.5GB)
make convert   # Import to APR (Int4/Int8) + export GGUF
make check     # Run all parity checks (actually invokes apr inference)
make ticket    # Generate GitHub issue markdown for failures

Run a single check suite or model:

uv run python scripts/parity_check.py --check canary
uv run python scripts/parity_check.py --model smollm-135m
uv run python scripts/parity_check.py --check token --model qwen2-0.5b

Run tests:

make test          # Unit + property tests (no apr/model deps)
make test-parity   # Integration parity tests (requires apr + models)
make coverage      # Coverage report

Parity Matrix (Current Results)

Model	APR Int8	APR Int4	GGUF Roundtrip	PPL
SmolLM-135M	RUNS but --json broken (#240)	RUNS garbled output	BLOCKED (#240)	BLOCKED (#242)
Qwen2-0.5B	RUNS "Paris" correct (#240)	RUNS "Paris" correct	BLOCKED (#240)	BLOCKED (#242)
GPT-2 124M	FAIL qkv_weight zeros (#241)	FAIL qkv_weight zeros	FAIL (#241)	BLOCKED (#242)

Current blockers:

apr run --json outputs human-readable text instead of JSON (#240) — blocks programmatic checks
GPT-2 fused qkv_weight not dequantized (#241) — blocks all GPT-2 inference
apr eval rejects APR format, GGUF fused_matmul type error (#242) — blocks PPL checks

Filed Issues

Issue	Bug	Status
#231	Int8 embedding NaN/Inf + shape mismatch	Fixed
#232	Int4 all-zero embedding tensors	Fixed
#233	GPT-2 tensor name mapping + wpe density	Fixed
#234	lm_head not excluded from quantization	Fixed
#235	GPT-2 hidden_dim=64 instead of 768	Fixed
#236	GPT-2 GGUF export hidden_dim=0	Fixed
#237	Quantization write pipeline broken for all tensors	Fixed
#239	realizar loader reads Q8/Q4 bytes as f32	Fixed
#240	`apr run --json` flag ignored	Open — blocks programmatic checks
#241	GPT-2 fused qkv_weight not dequantized	Open — blocks GPT-2
#242	`apr eval` rejects APR + GGUF type error	Open — blocks PPL

Check Suites

Inference Checks (blocked by #240 JSON output)

Suite	Checks	What it tests
`check-canary`	12	Golden output regression — Int8 text must exactly match oracle
`check-token`	24	Int4/Int8 token mismatch bounds (≤5/32 and ≤3/32)
`check-drift`	12	Int8 mismatches ≤ Int4 mismatches + 1
`check-roundtrip`	6	APR → GGUF → reimport produces identical tokens
`check-ppl`	9	PPL within model-specific bounds, Int4/Int8 diff < 0.5

Model Structure Checks (independent of inference)

Suite	Checks	What it tests
`check-inspect`	6	Metadata matches expected arch params (Claim 9)
`check-validate`	6	Magic/header/version integrity (Claim 10)
`check-tensors`	6	Tensor count and dtypes match quant level (Claim 11)
`check-lint`	6	No critical lint violations (Claim 12)
`check-selftest`	6	≥7/10 `apr check` pipeline stages pass (Claim 13)
`check-diff`	3	Int4 vs Int8 differ only in dtype (Claim 14)
`check-tree`	6	Layer/tensor count matches expected (Claim 15)
`check-oracle-id`	6	`apr oracle` identifies correct architecture (Claim 16)
`check-hex`	6	Tensor statistics non-zero (Claim 17)
`check-debug`	6	`apr debug` reports health=OK (Claim 18)
`check-bench`	6	`apr bench` produces >0 tok/s (Claim 19, expected fail)
`check-qa`	6	≥3 QA gates execute without critical failure (Claim 20)

Global Checks

Suite	Checks	What it tests
`check-list`	1	`apr list` cache inventory succeeds
`check-rosetta-diff`	3	No layout mismatches between Int8/Int4
`check-parity-gpu`	3	GPU/CPU produce identical results (GGUF)

	Total: 139	20 suites

Subcommand Coverage

23 of 40+ apr subcommands exercised:

Category	Subcommands
Pipeline	`pull`, `import`, `export`, `run`, `eval`
Inspection	`inspect`, `validate`, `tensors`, `lint`, `tree`, `hex`, `debug`
Analysis	`check`, `diff`, `oracle`, `bench`, `qa`, `parity`
Rosetta	`rosetta diff-tensors`
Utility	`list`, `--version`

Deferred (broken or inapplicable): compare-hf (ZSTD error), flow (ZSTD error), canary (audio-only), serve (server), tui/chat/cbtop (interactive), rm/publish (destructive), rosetta fingerprint (Q8 bug).

Methodology

This repo uses Popperian falsification: we attempt to disprove parity rather than prove it. Each test encodes a specific falsifiable prediction. A single failure constitutes evidence of a bug in the format conversion or runtime engine.

Oracle Generation

Source: HuggingFace transformers (v5.1.0) with PyTorch (v2.10.0)
Precision: float32, CPU-only, do_sample=False (deterministic greedy)
Random seeds: Not applicable — greedy decoding is fully deterministic (see ADR-001)
Max tokens: 32 per prompt
Output: JSON with token IDs, decoded text, model metadata

Tolerance Thresholds

Comparison	Threshold	Effect Size	n
Int4 vs oracle	≤5/32 mismatches	15.6%	12
Int8 vs oracle	≤3/32 mismatches	9.4%	12
Int8 vs Int4 drift	Int8 ≤ Int4+1	≤1 token	12
Cross-runtime	Exact text match	0%	12
PPL Int4 vs Int8	Diff < 0.5	<0.5 PPL	3
Canary (text)	Exact text match	0%	12

All checks: CI=100% (deterministic greedy decoding).

Statistical Notes

Total sample size: n = 139 parity checks (exhaustive cross-product across 17 suites), plus pytest tests including property-based tests with n = 100 iterations via hypothesis.
Standard deviation: σ = 0 for all parity checks. Greedy decoding (temperature=0) is fully deterministic. Outputs are bit-for-bit identical across runs. Uncertainty is ±0.
Confidence interval: [exact, exact] for all checks. 95% and 99% CIs are not applicable (variance = 0). The CI is trivially 100%.
Sample size justification: 4 prompts × 3 models = n = 12 per claim. Prompts cover 4 categories (arithmetic, NLP, code, social). 3 models cover 3 architectures (LLaMA, Qwen/GQA, GPT-2). Exhaustive over the roster. Total: n = 139 (20 suites).
Effect size: 5/32 = 15.6% ±0 for Int4, 3/32 = 9.4% ±0 for Int8. Cohen's d: large (Int4), medium (Int8).
PPL bounds: Model-specific ceilings (SmolLM: 20.0, Qwen2: 15.0, GPT-2: 30.0) from published benchmarks, with 2× headroom (σ_headroom ≈ 2× base PPL).

Dataset Documentation

Models

Model	HF ID	Parameters	Architecture	License
SmolLM-135M	`HuggingFaceTB/SmolLM-135M`	135M	LLaMA-style (30 layers, 9 heads)	Apache 2.0
Qwen2-0.5B	`Qwen/Qwen2-0.5B`	500M	Qwen (GQA, 24 layers, 14 heads)	Apache 2.0
GPT-2	`openai-community/gpt2`	124M	GPT-2 (12 layers, 12 heads)	MIT

Prompts

Prompt	Category	Text	Purpose
arithmetic	Math	`What is 2+2? Answer:`	Tests numerical token generation
completion	NLP	`The capital of France is`	Tests factual continuation
code	Programming	`def fibonacci(n):`	Tests code token patterns
greeting	Social	`Hello, my name is`	Tests natural language patterns

Oracle Format

Each oracle JSON file contains:

{
  "model": "HuggingFace model ID",
  "prompt": "input text",
  "runtime": "transformers",
  "format": "float32",
  "transformers_version": "5.1.0",
  "torch_version": "2.10.0",
  "tokens": [token_id_1, token_id_2, ...],
  "text": "decoded output text",
  "token_count": 32,
  "max_new_tokens": 32,
  "do_sample": false
}

Architecture

Python (uv)                    Parity Checker
───────────                    ──────────────
gen_oracle.py ──► oracle/*.json ◄── parity_check.py
  (rare, manual)                      │
                                      ▼
                              subprocess: apr {subcommand} --json
                              23 subcommands: run, eval, inspect,
                              validate, tensors, lint, check, diff,
                              tree, oracle, hex, debug, bench, qa,
                              list, parity, rosetta diff-tensors,
                              pull, import, export, --version
                                      │
                                      ▼
                              compare output vs oracle / MODEL_METADATA
                              generate GitHub issue markdown

Reproducibility

Lock files: uv.lock pins all Python dependencies. Rust tools installed via cargo install --locked.
Docker: Dockerfile provides fully reproducible environment.
CI: GitHub Actions runs weekly (see ci/parity.yml).
Determinism: All inference uses greedy decoding. No random seeds needed (ADR-001).
Versioning: Oracle JSON includes transformers_version and torch_version for provenance.

Contributing

All work happens on master — no feature branches.
Commit format: feat|fix|test|docs: message (Refs TMGT-XXX).
make test must pass (330+ tests, <25s) before committing.
pmat quality-gate must report 0 violations.
File bugs against aprender when parity checks fail.

License

MIT. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.devcontainer		.devcontainer
.github		.github
.pmat-work		.pmat-work
.pmat		.pmat
benches		benches
benchmarks		benchmarks
ci		ci
docs		docs
models		models
oracle-ops		oracle-ops
oracle		oracle
prompts		prompts
scripts		scripts
tests		tests
.codecov.yml		.codecov.yml
.commitlintrc.yml		.commitlintrc.yml
.gitignore		.gitignore
.gitmessage		.gitmessage
.pmat-metrics.toml		.pmat-metrics.toml
.tool-versions		.tool-versions
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CLAIMS.md		CLAIMS.md
CLAUDE.md		CLAUDE.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
commitlint.config.js		commitlint.config.js
coverage.toml		coverage.toml
dvc.yaml		dvc.yaml
flake.nix		flake.nix
mutants.toml		mutants.toml
package.json		package.json
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tiny-model-ground-truth

Installation

Prerequisites

Setup

Usage

Parity Matrix (Current Results)

Filed Issues

Check Suites

Inference Checks (blocked by #240 JSON output)

Model Structure Checks (independent of inference)

Global Checks

Subcommand Coverage

Methodology

Oracle Generation

Tolerance Thresholds

Statistical Notes

Dataset Documentation

Models

Prompts

Oracle Format

Architecture

Reproducibility

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tiny-model-ground-truth

Installation

Prerequisites

Setup

Usage

Parity Matrix (Current Results)

Filed Issues

Check Suites

Inference Checks (blocked by #240 JSON output)

Model Structure Checks (independent of inference)

Global Checks

Subcommand Coverage

Methodology

Oracle Generation

Tolerance Thresholds

Statistical Notes

Dataset Documentation

Models

Prompts

Oracle Format

Architecture

Reproducibility

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages