tqllm

TQ3_1S tiered weight quantization for large language models.

Compress 30B+ parameter MoE models to ~3.5 bits per weight using Walsh-Hadamard rotation + Lloyd-Max optimal scalar quantization + dual FP16 half-block scales. Produces GGUF files ready for llama.cpp inference.

What is TQ3_1S?

TQ3_1S implements Google's TurboQuant algorithm (ICLR 2026) for weight quantization:

Walsh-Hadamard rotation — O(d log d) transform that decorrelates weight coordinates, making them approximately iid Gaussian
Lloyd-Max 3-bit quantization — 8 optimal centroids for Gaussian distribution, precomputed (no calibration data needed)
Dual FP16 half-block scales — two scales per 32-element block for fine-grained dynamic range recovery

Combined with tiered assignment (experts at 3.5-bit, attention at 4-bit, routers at FP16), this achieves 3.46x compression with >0.99 cosine similarity to original weights.

Results: Sarvam 30B

Metric	BF16	TQ3_1S	Change
Model size	64.4 GB	18.62 GB	71% smaller
Min GPU	4x A100	1x RTX 4090	75% fewer GPUs
Expert cosine sim	1.000	0.993	-0.7%
Relative MSE	0%	2.5%	—

Quantized model: VibeStudio/sarvam-30b-TQ3_1S-GGUF

Installation

pip install -e ".[all]"

Usage

Profile a model

tqllm profile --model sarvamai/sarvam-30b --tier-config tiers.yaml

Quantize

tqllm quantize --model sarvamai/sarvam-30b --output ./sarvam-tq3/ --tier-config tiers.yaml

Export to GGUF

tqllm export --input ./sarvam-tq3/ --output sarvam-30b-tq3.gguf

Validate

tqllm validate --bench expert-sim

Architecture

src/tqllm/
  fwht.py          — Fast Walsh-Hadamard Transform
  codebook.py      — Lloyd-Max codebook for Gaussian N(0,1)
  packing.py       — 3-bit code packing (16 bytes per 32 weights)
  quantizer.py     — TQ3_1S encode/decode
  q4_quantizer.py  — Q4_0 for attention/embedding tiers
  tiered.py        — Orchestrator: param name → quantization tier
  model_loader.py  — Memory-efficient HF safetensors loader
  gguf/            — GGUF v3 writer + reader
  inference/       — TQ3_1SLinear / Q4Linear nn.Module replacements
  eval/            — Perplexity, expert similarity, NIAH benchmarks
  cli.py           — CLI entry point

Tiered Quantization

Different model components have different sensitivity to quantization:

Component	Why this tier	Bits
Routed experts (90% of params)	MoE isolation: error in one expert only affects tokens routed to it	3.5
Shared experts	Small, always-active	3.5
Attention	Every token passes through; GQA amplifies errors	4.0
Embeddings + LM head	Non-uniform distribution needs more centroids	4.0
Routers + norms	Tiny but catastrophic if wrong (discrete routing decisions)	16.0

Tests

pytest tests/ -v
# 164 tests, all passing

References

TurboQuant (ICLR 2026) — Walsh-Hadamard rotation + Lloyd-Max codebooks
Sarvam 30B — Base model (Apache 2.0)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
scripts		scripts
src/tqllm		src/tqllm
tests		tests
.gitignore		.gitignore
MODEL_CARD.md		MODEL_CARD.md
README.md		README.md
REPORT.md		REPORT.md
pyproject.toml		pyproject.toml
quantize_sarvam.py		quantize_sarvam.py
tiers.yaml		tiers.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tqllm

What is TQ3_1S?

Results: Sarvam 30B

Installation

Usage

Profile a model

Quantize

Export to GGUF

Validate

Architecture

Tiered Quantization

Tests

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

tqllm

What is TQ3_1S?

Results: Sarvam 30B

Installation

Usage

Profile a model

Quantize

Export to GGUF

Validate

Architecture

Tiered Quantization

Tests

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages