Skip to content

junainfinity/tqllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

tqllm

TQ3_1S tiered weight quantization for large language models.

Compress 30B+ parameter MoE models to ~3.5 bits per weight using Walsh-Hadamard rotation + Lloyd-Max optimal scalar quantization + dual FP16 half-block scales. Produces GGUF files ready for llama.cpp inference.

What is TQ3_1S?

TQ3_1S implements Google's TurboQuant algorithm (ICLR 2026) for weight quantization:

  1. Walsh-Hadamard rotation — O(d log d) transform that decorrelates weight coordinates, making them approximately iid Gaussian
  2. Lloyd-Max 3-bit quantization — 8 optimal centroids for Gaussian distribution, precomputed (no calibration data needed)
  3. Dual FP16 half-block scales — two scales per 32-element block for fine-grained dynamic range recovery

Combined with tiered assignment (experts at 3.5-bit, attention at 4-bit, routers at FP16), this achieves 3.46x compression with >0.99 cosine similarity to original weights.

Results: Sarvam 30B

Metric BF16 TQ3_1S Change
Model size 64.4 GB 18.62 GB 71% smaller
Min GPU 4x A100 1x RTX 4090 75% fewer GPUs
Expert cosine sim 1.000 0.993 -0.7%
Relative MSE 0% 2.5%

Quantized model: VibeStudio/sarvam-30b-TQ3_1S-GGUF

Installation

pip install -e ".[all]"

Usage

Profile a model

tqllm profile --model sarvamai/sarvam-30b --tier-config tiers.yaml

Quantize

tqllm quantize --model sarvamai/sarvam-30b --output ./sarvam-tq3/ --tier-config tiers.yaml

Export to GGUF

tqllm export --input ./sarvam-tq3/ --output sarvam-30b-tq3.gguf

Validate

tqllm validate --bench expert-sim

Architecture

src/tqllm/
  fwht.py          — Fast Walsh-Hadamard Transform
  codebook.py      — Lloyd-Max codebook for Gaussian N(0,1)
  packing.py       — 3-bit code packing (16 bytes per 32 weights)
  quantizer.py     — TQ3_1S encode/decode
  q4_quantizer.py  — Q4_0 for attention/embedding tiers
  tiered.py        — Orchestrator: param name → quantization tier
  model_loader.py  — Memory-efficient HF safetensors loader
  gguf/            — GGUF v3 writer + reader
  inference/       — TQ3_1SLinear / Q4Linear nn.Module replacements
  eval/            — Perplexity, expert similarity, NIAH benchmarks
  cli.py           — CLI entry point

Tiered Quantization

Different model components have different sensitivity to quantization:

Component Why this tier Bits
Routed experts (90% of params) MoE isolation: error in one expert only affects tokens routed to it 3.5
Shared experts Small, always-active 3.5
Attention Every token passes through; GQA amplifies errors 4.0
Embeddings + LM head Non-uniform distribution needs more centroids 4.0
Routers + norms Tiny but catastrophic if wrong (discrete routing decisions) 16.0

Tests

pytest tests/ -v
# 164 tests, all passing

References

License

MIT

About

TQ3_1S tiered weight quantization for large language models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors