TQ3_1S tiered weight quantization for large language models.
Compress 30B+ parameter MoE models to ~3.5 bits per weight using Walsh-Hadamard rotation + Lloyd-Max optimal scalar quantization + dual FP16 half-block scales. Produces GGUF files ready for llama.cpp inference.
TQ3_1S implements Google's TurboQuant algorithm (ICLR 2026) for weight quantization:
- Walsh-Hadamard rotation — O(d log d) transform that decorrelates weight coordinates, making them approximately iid Gaussian
- Lloyd-Max 3-bit quantization — 8 optimal centroids for Gaussian distribution, precomputed (no calibration data needed)
- Dual FP16 half-block scales — two scales per 32-element block for fine-grained dynamic range recovery
Combined with tiered assignment (experts at 3.5-bit, attention at 4-bit, routers at FP16), this achieves 3.46x compression with >0.99 cosine similarity to original weights.
| Metric | BF16 | TQ3_1S | Change |
|---|---|---|---|
| Model size | 64.4 GB | 18.62 GB | 71% smaller |
| Min GPU | 4x A100 | 1x RTX 4090 | 75% fewer GPUs |
| Expert cosine sim | 1.000 | 0.993 | -0.7% |
| Relative MSE | 0% | 2.5% | — |
Quantized model: VibeStudio/sarvam-30b-TQ3_1S-GGUF
pip install -e ".[all]"tqllm profile --model sarvamai/sarvam-30b --tier-config tiers.yamltqllm quantize --model sarvamai/sarvam-30b --output ./sarvam-tq3/ --tier-config tiers.yamltqllm export --input ./sarvam-tq3/ --output sarvam-30b-tq3.gguftqllm validate --bench expert-simsrc/tqllm/
fwht.py — Fast Walsh-Hadamard Transform
codebook.py — Lloyd-Max codebook for Gaussian N(0,1)
packing.py — 3-bit code packing (16 bytes per 32 weights)
quantizer.py — TQ3_1S encode/decode
q4_quantizer.py — Q4_0 for attention/embedding tiers
tiered.py — Orchestrator: param name → quantization tier
model_loader.py — Memory-efficient HF safetensors loader
gguf/ — GGUF v3 writer + reader
inference/ — TQ3_1SLinear / Q4Linear nn.Module replacements
eval/ — Perplexity, expert similarity, NIAH benchmarks
cli.py — CLI entry point
Different model components have different sensitivity to quantization:
| Component | Why this tier | Bits |
|---|---|---|
| Routed experts (90% of params) | MoE isolation: error in one expert only affects tokens routed to it | 3.5 |
| Shared experts | Small, always-active | 3.5 |
| Attention | Every token passes through; GQA amplifies errors | 4.0 |
| Embeddings + LM head | Non-uniform distribution needs more centroids | 4.0 |
| Routers + norms | Tiny but catastrophic if wrong (discrete routing decisions) | 16.0 |
pytest tests/ -v
# 164 tests, all passing- TurboQuant (ICLR 2026) — Walsh-Hadamard rotation + Lloyd-Max codebooks
- Sarvam 30B — Base model (Apache 2.0)
MIT