MagicQuant

Evolutionary Tensor Search for Optimal LLM GGUF Hybrid Quantization

A Python implementation of the MagicQuant framework — an evolutionary search algorithm that discovers optimal per-group quantization configurations for LLM GGUF files. Instead of applying one quantization scheme globally, MagicQuant assigns different schemes to different tensor groups (embeddings, attention, FFN) based on measured sensitivity, producing models that break the standard size/quality/speed Pareto frontier.

Origin & Credit

This project is an implementation of the MagicQuant methodology created by magiccodingman. The original research, empirical findings, and the "MXFP4 Anomaly" discovery are documented in the MagicQuant Wiki. Published hybrid models are available on HuggingFace:

Magic Quant Collection — Verified best-of-the-best hybrid quants
MXFP4 Hybrid GGUF Collection — Experimental MXFP4 hybrids

The core insight from the original research: the vast majority of parameters in a transformer (FFN layers) can tolerate aggressive MXFP4 compression, while a small set of "brain" layers (embeddings, attention output, LM head) need protection at higher precision. This "Carbon Fiber Body, Ferrari Engine" pattern consistently produces models that are smaller than Q4 but retain Q5/Q6 quality.

How It Works

BF16 Source Model
     |
     v
[1. Sensitivity Probing] — Quantize one group at a time, measure PPL impact
     |
     v
[2. Evolutionary Search] — Discover optimal hybrid configs per compression tier
     |                      Protector/Crusher mutations, epsilon-greedy exploration
     v
[3. Tiered Generation]  — Output best Q4, Q5, Q6 hybrid GGUFs
     |                     Each with per-tensor quantization via GGUF writer
     v
  Q4: E:BF16 H:Q8 Q:Q6K K:Q8 O:BF16 U:MXFP4 D:MXFP4  (24 GB from 60 GB)
  Q5: E:BF16 H:BF16 Q:Q8 K:BF16 O:BF16 U:MXFP4 D:MXFP4 (29 GB from 60 GB)
  Q6: E:BF16 H:BF16 Q:Q6K K:Q8 O:BF16 U:BF16 D:Q8       (44 GB from 60 GB)

Tensor Groups

Group	Role	Typical Sensitivity	Default Scheme
E	Token Embeddings	Very High	BF16
H	LM Head	Very High	BF16
O	Attention Output	High	Q8_0 / BF16
Q	Attention Query	Moderate	Q6_K / IQ4_NL
K	Attention Key/Value	Moderate	Q8_0 / Q6_K
U	FFN Up/Gate	Low (robust)	MXFP4
D	FFN Down	Low (robust)	MXFP4
X	MoE Experts	Very Low	MXFP4
R	MoE Router	High	Q8_0

Supported Quantization Schemes

Scheme	Type	bpw	Noise	Best For
BF16	Float	16.0	0.0	Brain layers (E, H, O)
Q8_0	Integer	8.5	1.0	Near-lossless protection
Q6_K	K-quant	6.56	2.2	High-quality attention
Q5_K	K-quant	5.5	3.0	Balanced attention
IQ4_NL	Non-linear	4.5	3.8	Best ~4-bit quality
MXFP4	FP4 (E2M1)	4.25	4.0	FFN / MoE experts
Q4_K_M	K-quant	4.5	4.5	Fallback integer 4-bit

MXFP4 implements the OCP MX Microscaling FP4 format (E2M1 values with shared E8M0 exponent). Its non-uniform quantization levels (0, 0.5, 1, 1.5, 2, 3, 4, 6) are denser near zero, naturally matching the Gaussian-like weight distribution of transformers — producing lower noise than integer Q4 at better compression.

Installation

git clone https://github.com/lucasmcoleman/MagicQuant.git
cd MagicQuant
pip install -e .

Requires Python 3.9+ and NumPy. Optional: llama.cpp for real perplexity measurement during probing.

Usage

Full Pipeline

# 1. Analyze model structure
magicquant analyze model-bf16.gguf

# 2. Run sensitivity probing (uses heuristics if llama.cpp not available)
magicquant probe model-bf16.gguf --output-dir ./output

# 3. Run evolutionary search
magicquant search model-bf16.gguf --output-dir ./output --generations 50

# 4. Generate best Q4, Q5, Q6 hybrid GGUFs
magicquant generate model-bf16.gguf --output-dir ./output --tiers Q4,Q5,Q6

Manual Hybrid from YAML Config

# config.yaml
model:
  name: Qwen3-30B-A3B
  source: ./Qwen3-30B-A3B-BF16.gguf
quantization:
  base: MXFP4_MOE
  groups:
    E: BF16
    H: BF16
    O: Q8_0
    Q: IQ4_NL
    K: IQ4_NL

magicquant hybrid config.yaml --output-dir ./output

Python API

from magicquant.gguf.writer import create_hybrid_gguf

create_hybrid_gguf(
    output_path="model-hybrid.gguf",
    base_model_path="model-bf16.gguf",
    quant_config={
        "base": "MXFP4_MOE",
        "groups": {
            "E": "BF16",
            "H": "BF16",
            "O": "Q8_0",
            "Q": "IQ4_NL",
            "K": "IQ4_NL",
        }
    }
)

Architecture

magicquant/
  gguf/
    reader.py          — GGUF binary parser
    writer.py          — Hybrid GGUF writer (two-pass streaming)
    tensor_groups.py   — Tensor group classification (E, H, Q, K, O, U, D, X, R)
  quant/
    schemes.py         — Quantization scheme definitions
    converters.py      — Vectorized ggml block encoders (single source of truth)
  evolution/
    probing.py         — Sensitivity measurement (real or heuristic)
    predictor.py       — Loss/size/speed prediction with collapse penalties
    survival.py        — Evolutionary search with Protector/Crusher mutations
  utils/
    llamacpp.py        — llama.cpp integration for perplexity measurement
    naming.py          — Hybrid model filename generation
  orchestrator.py      — Pipeline coordination
  __main__.py          — CLI entry point

Configuration via Environment

All settings can be provided via environment variables with MAGICQUANT_ prefix or a .env file:

Variable	Default	Description
`MAGICQUANT_SOURCE_MODEL_PATH`	(required)	Path to source model
`MAGICQUANT_OUTPUT_DIR`	`./output`	Output directory
`MAGICQUANT_LLAMACPP_PATH`	auto-detect	Path to llama.cpp
`MAGICQUANT_TARGET_BASE_QUANT`	`MXFP4_MOE`	Base quantization scheme
`MAGICQUANT_SEARCH_GENERATIONS`	`30`	Generations per round
`MAGICQUANT_POPULATION_SIZE`	`80`	Candidates per generation
`MAGICQUANT_MEASUREMENT_ROUNDS`	`3`	Build-measure-learn cycles
`MAGICQUANT_TIERS`	`["Q4","Q5","Q6"]`	Compression tiers

Docker

docker build -t magicquant:latest .
docker run --rm -v ./output:/app/output magicquant:latest search /data/model.gguf

Development

pip install -e ".[dev]"
make test    # Run pytest suite
make lint    # Syntax check
make clean   # Remove build artifacts

Known Limitations

K-quant encoders use simple min/max with RMSE optimization, not llama.cpp's full importance-matrix-weighted quantization
Tokenizer reading only handles BPE (tokenizer.json); SentencePiece (.model) is not supported
Source models must be BF16/F16/F32 — pre-quantized sources are rejected with a clear error

License

MIT. The MagicQuant methodology and research are credited to magiccodingman.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
docker		docker
magicquant		magicquant
tests		tests
.gitignore		.gitignore
AUDIT_REPORT.md		AUDIT_REPORT.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
Dockerfile		Dockerfile
LICENSE		LICENSE
LOGIC_AUDIT.md		LOGIC_AUDIT.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MagicQuant

Origin & Credit

How It Works

Tensor Groups

Supported Quantization Schemes

Installation

Usage

Full Pipeline

Manual Hybrid from YAML Config

Python API

Architecture

Configuration via Environment

Docker

Development

Known Limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MagicQuant

Origin & Credit

How It Works

Tensor Groups

Supported Quantization Schemes

Installation

Usage

Full Pipeline

Manual Hybrid from YAML Config

Python API

Architecture

Configuration via Environment

Docker

Development

Known Limitations

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages