Evolutionary Tensor Search for Optimal LLM GGUF Hybrid Quantization
A Python implementation of the MagicQuant framework — an evolutionary search algorithm that discovers optimal per-group quantization configurations for LLM GGUF files. Instead of applying one quantization scheme globally, MagicQuant assigns different schemes to different tensor groups (embeddings, attention, FFN) based on measured sensitivity, producing models that break the standard size/quality/speed Pareto frontier.
This project is an implementation of the MagicQuant methodology created by magiccodingman. The original research, empirical findings, and the "MXFP4 Anomaly" discovery are documented in the MagicQuant Wiki. Published hybrid models are available on HuggingFace:
- Magic Quant Collection — Verified best-of-the-best hybrid quants
- MXFP4 Hybrid GGUF Collection — Experimental MXFP4 hybrids
The core insight from the original research: the vast majority of parameters in a transformer (FFN layers) can tolerate aggressive MXFP4 compression, while a small set of "brain" layers (embeddings, attention output, LM head) need protection at higher precision. This "Carbon Fiber Body, Ferrari Engine" pattern consistently produces models that are smaller than Q4 but retain Q5/Q6 quality.
BF16 Source Model
|
v
[1. Sensitivity Probing] — Quantize one group at a time, measure PPL impact
|
v
[2. Evolutionary Search] — Discover optimal hybrid configs per compression tier
| Protector/Crusher mutations, epsilon-greedy exploration
v
[3. Tiered Generation] — Output best Q4, Q5, Q6 hybrid GGUFs
| Each with per-tensor quantization via GGUF writer
v
Q4: E:BF16 H:Q8 Q:Q6K K:Q8 O:BF16 U:MXFP4 D:MXFP4 (24 GB from 60 GB)
Q5: E:BF16 H:BF16 Q:Q8 K:BF16 O:BF16 U:MXFP4 D:MXFP4 (29 GB from 60 GB)
Q6: E:BF16 H:BF16 Q:Q6K K:Q8 O:BF16 U:BF16 D:Q8 (44 GB from 60 GB)
| Group | Role | Typical Sensitivity | Default Scheme |
|---|---|---|---|
| E | Token Embeddings | Very High | BF16 |
| H | LM Head | Very High | BF16 |
| O | Attention Output | High | Q8_0 / BF16 |
| Q | Attention Query | Moderate | Q6_K / IQ4_NL |
| K | Attention Key/Value | Moderate | Q8_0 / Q6_K |
| U | FFN Up/Gate | Low (robust) | MXFP4 |
| D | FFN Down | Low (robust) | MXFP4 |
| X | MoE Experts | Very Low | MXFP4 |
| R | MoE Router | High | Q8_0 |
| Scheme | Type | bpw | Noise | Best For |
|---|---|---|---|---|
| BF16 | Float | 16.0 | 0.0 | Brain layers (E, H, O) |
| Q8_0 | Integer | 8.5 | 1.0 | Near-lossless protection |
| Q6_K | K-quant | 6.56 | 2.2 | High-quality attention |
| Q5_K | K-quant | 5.5 | 3.0 | Balanced attention |
| IQ4_NL | Non-linear | 4.5 | 3.8 | Best ~4-bit quality |
| MXFP4 | FP4 (E2M1) | 4.25 | 4.0 | FFN / MoE experts |
| Q4_K_M | K-quant | 4.5 | 4.5 | Fallback integer 4-bit |
MXFP4 implements the OCP MX Microscaling FP4 format (E2M1 values with shared E8M0 exponent). Its non-uniform quantization levels (0, 0.5, 1, 1.5, 2, 3, 4, 6) are denser near zero, naturally matching the Gaussian-like weight distribution of transformers — producing lower noise than integer Q4 at better compression.
git clone https://github.com/lucasmcoleman/MagicQuant.git
cd MagicQuant
pip install -e .Requires Python 3.9+ and NumPy. Optional: llama.cpp for real perplexity measurement during probing.
# 1. Analyze model structure
magicquant analyze model-bf16.gguf
# 2. Run sensitivity probing (uses heuristics if llama.cpp not available)
magicquant probe model-bf16.gguf --output-dir ./output
# 3. Run evolutionary search
magicquant search model-bf16.gguf --output-dir ./output --generations 50
# 4. Generate best Q4, Q5, Q6 hybrid GGUFs
magicquant generate model-bf16.gguf --output-dir ./output --tiers Q4,Q5,Q6# config.yaml
model:
name: Qwen3-30B-A3B
source: ./Qwen3-30B-A3B-BF16.gguf
quantization:
base: MXFP4_MOE
groups:
E: BF16
H: BF16
O: Q8_0
Q: IQ4_NL
K: IQ4_NLmagicquant hybrid config.yaml --output-dir ./outputfrom magicquant.gguf.writer import create_hybrid_gguf
create_hybrid_gguf(
output_path="model-hybrid.gguf",
base_model_path="model-bf16.gguf",
quant_config={
"base": "MXFP4_MOE",
"groups": {
"E": "BF16",
"H": "BF16",
"O": "Q8_0",
"Q": "IQ4_NL",
"K": "IQ4_NL",
}
}
)magicquant/
gguf/
reader.py — GGUF binary parser
writer.py — Hybrid GGUF writer (two-pass streaming)
tensor_groups.py — Tensor group classification (E, H, Q, K, O, U, D, X, R)
quant/
schemes.py — Quantization scheme definitions
converters.py — Vectorized ggml block encoders (single source of truth)
evolution/
probing.py — Sensitivity measurement (real or heuristic)
predictor.py — Loss/size/speed prediction with collapse penalties
survival.py — Evolutionary search with Protector/Crusher mutations
utils/
llamacpp.py — llama.cpp integration for perplexity measurement
naming.py — Hybrid model filename generation
orchestrator.py — Pipeline coordination
__main__.py — CLI entry point
All settings can be provided via environment variables with MAGICQUANT_ prefix or a .env file:
| Variable | Default | Description |
|---|---|---|
MAGICQUANT_SOURCE_MODEL_PATH |
(required) | Path to source model |
MAGICQUANT_OUTPUT_DIR |
./output |
Output directory |
MAGICQUANT_LLAMACPP_PATH |
auto-detect | Path to llama.cpp |
MAGICQUANT_TARGET_BASE_QUANT |
MXFP4_MOE |
Base quantization scheme |
MAGICQUANT_SEARCH_GENERATIONS |
30 |
Generations per round |
MAGICQUANT_POPULATION_SIZE |
80 |
Candidates per generation |
MAGICQUANT_MEASUREMENT_ROUNDS |
3 |
Build-measure-learn cycles |
MAGICQUANT_TIERS |
["Q4","Q5","Q6"] |
Compression tiers |
docker build -t magicquant:latest .
docker run --rm -v ./output:/app/output magicquant:latest search /data/model.ggufpip install -e ".[dev]"
make test # Run pytest suite
make lint # Syntax check
make clean # Remove build artifacts- K-quant encoders use simple min/max with RMSE optimization, not llama.cpp's full importance-matrix-weighted quantization
- Tokenizer reading only handles BPE (tokenizer.json); SentencePiece (.model) is not supported
- Source models must be BF16/F16/F32 — pre-quantized sources are rejected with a clear error
MIT. The MagicQuant methodology and research are credited to magiccodingman.