This is the official repository for the paper Advancing SVD-based LLM Compression via Layer-Wise Error Model Search. Moritz Thoma, Maximilian Groezinger, Maximilian ForstenhΓ€usler, Emad Aghajanzadeh, Manoj-Rohit Vemparala, Christos Anagnostopoulos, Pierpaolo Mori, Nael Fasfous, Alexander Frickenstein, Daniel Mueller-Gritschneder, Ulf Schlichtmann. ICML 2026
Low-rank SVD-based compression offers a powerful strategy to reduce the computational costs of Large Language Models (LLMs). However, existing methods commonly encounter two recurring obstacles: (i) global rank allocation, where uncalibrated error proxies fail to account for complex error propagation, and (ii) decomposition quality, where Fisher-based estimators suffer from severe rank collapse.
In this work, we address these limitations by presenting Layer-wise Error Modeling Search (LEMS) and KFAC-SVD.
- LEMS advances rank allocation by introducing a layer-wise error surrogate that integrates both local and global layer importance alongside a propagation bias, allowing us to determine global rank configurations efficiently as an Integer Linear Program (ILP).
- KFAC-SVD improves decomposition quality by utilizing token-wise statistics, preventing the rank deficiency observed in prior Fisher-based SVD.
We demonstrate across Mistral, Qwen3, and Llama-3 families that KFAC-SVD achieves average perplexity improvements of 15%, while LEMS consistently outperforms existing search strategies, delivering significant zero-shot accuracy improvements of up to 4.8 p.p. that generalize to scales of 70B parameters.
You can directly download and test our pre-compressed models from HuggingFace without running the compression pipeline locally.
Models available on HuggingFace
| Model | Search Type | Ratio | Hugging Face Name |
|---|---|---|---|
| Llama-3-8B | LEMS | 0.9 | MoritzMo123/kfac-svd_lems_llama-3-8b_0.9 |
| LEMS | 0.8 | MoritzMo123/kfac-svd_lems_llama-3-8b_0.8 |
|
| LEMS | 0.7 | MoritzMo123/kfac-svd_lems_llama-3-8b_0.7 |
|
| LEMS | 0.6 | MoritzMo123/kfac-svd_lems_llama-3-8b_0.6 |
|
| Uniform | 0.9 | MoritzMo123/kfac-svd_uniform_llama-3-8b_0.9 |
|
| Uniform | 0.8 | MoritzMo123/kfac-svd_uniform_llama-3-8b_0.8 |
|
| Uniform | 0.7 | MoritzMo123/kfac-svd_uniform_llama-3-8b_0.7 |
|
| Uniform | 0.6 | MoritzMo123/kfac-svd_uniform_llama-3-8b_0.6 |
|
| Mistral-7B | LEMS | 0.9 | MoritzMo123/kfac-svd_lems_mistral-7b_0.9 |
| LEMS | 0.8 | MoritzMo123/kfac-svd_lems_mistral-7b_0.8 |
|
| LEMS | 0.7 | MoritzMo123/kfac-svd_lems_mistral-7b_0.7 |
|
| LEMS | 0.6 | MoritzMo123/kfac-svd_lems_mistral-7b_0.6 |
|
| Uniform | 0.9 | MoritzMo123/kfac-svd_uniform_mistral-7b_0.9 |
|
| Uniform | 0.8 | MoritzMo123/kfac-svd_uniform_mistral-7b_0.8 |
|
| Uniform | 0.7 | MoritzMo123/kfac-svd_uniform_mistral-7b_0.7 |
|
| Uniform | 0.6 | MoritzMo123/kfac-svd_uniform_mistral-7b_0.6 |
|
| Qwen3-8B | LEMS | 0.9 | MoritzMo123/kfac-svd_lems_Qwen3-8B_0.9 |
| LEMS | 0.8 | MoritzMo123/kfac-svd_lems_Qwen3-8B_0.8 |
|
| LEMS | 0.7 | MoritzMo123/kfac-svd_lems_Qwen3-8B_0.7 |
|
| LEMS | 0.6 | MoritzMo123/kfac-svd_lems_Qwen3-8B_0.6 |
|
| Uniform | 0.9 | MoritzMo123/kfac-svd_uniform_Qwen3-8B_0.9 |
|
| Uniform | 0.8 | MoritzMo123/kfac-svd_uniform_Qwen3-8B_0.8 |
|
| Uniform | 0.7 | MoritzMo123/kfac-svd_uniform_Qwen3-8B_0.7 |
|
| Uniform | 0.6 | MoritzMo123/kfac-svd_uniform_Qwen3-8B_0.6 |
from transformers import AutoModelForCausalLM, AutoTokenizer
model_string="MoritzMo123/kfac-svd_lems_llama-3-8b_0.8"
model = AutoModelForCausalLM.from_pretrained(
model_string,
trust_remote_code=True,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_string)To compress a model from scratch and recreate our core results, use the following template (replace mistral_7b with llama3_8b or qwen3_8b as needed).
Use eval=extended for full zero-shot task sweeps or eval=quick for Wikitext-only evaluation.
Note: For very large models (e.g., 70B), set the environment variable
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueto avoid OOM.
python run.py model=mistral_7b svd=kfac_svd search=lems compression_target=0.8 eval=extendedKFAC-SVD/
βββ run.py # Main entry point (Hydra-based)
βββ finetune_LLM.py # Post-compression fine-tuning
βββ configs/ # Hydra configuration groups
β βββ config.yaml # Root config (merges all groups)
β βββ model/ # Model definitions (llama3_8b, mistral_7b, β¦)
β βββ svd/ # Decomposition methods (kfac_svd, svd_llm, β¦)
β β βββ _common_svd.yaml # Shared defaults inherited by all SVD configs
β βββ search/ # Search / rank-allocation methods
β βββ data/ # Calibration dataset settings
β βββ eval/ # Evaluation profiles (quick / extended)
β βββ export/ # HuggingFace export settings
β βββ preset/ # Named presets combining svd + search + tuned params
βββ compression/
β βββ svd_core.py # Orchestrator: factorize β search β compress
β βββ factorization/ # SVD / decomposition implementations
β βββ search/ # Search / rank-allocation implementations
βββ all_utils/ # Shared utilities (data, evaluation, model I/O, β¦)
βββ local_datasets/ # Local dataset loaders (C4, MathQA, β¦)
βββ docker/ # Dockerfile and compose files
This project relies heavily on Hydra for configuration. The root configuration at configs/config.yaml manages six main groups (model, svd, search, data, eval, export).
You can override any parameter directly via the command line:
python run.py model=llama3_8b svd=kfac_svd search=lems compression_target=0.8 \
search.enforce_rank_multiples_of=16 svd.do_post_calibration=FalseYou can also use presets to combine multiple configuration groups quickly:
python run.py +preset=kfac_lems model=qwen3_8b compression_target=0.9We strongly encourage using the dockerized environment for reproduction to avoid dependency conflicts.
cd docker
docker build -t lems:torch2.2.1 .
# Mount your HF cache or dataset folder as needed
docker run --gpus 'all' -it --name="LLM_COMPRESS" -v ~/.cache/huggingface:/root/.cache/huggingface -v ./:/workspace lems:torch2.2.1To evaluate using the C4 dataset, additional manual setup is necessary. Please refer to local_datasets/c4/readme.md to make the dataset available. Note: Without this step, running with eval=extended will fail!
All LEMS results in the paper were obtained using Gurobi as the ILP solver.
- Gurobi (Requires License): Free for educational institutions. Place your license in
gurobi_license.jsonformatted as:{"WLSACCESSID": "your id", "WLSSECRET": "your secret", "LICENSEID": your_id}. - PuLP/CBC (No License Required): We provide an experimental open-source alternative by passing
search.solver=cbc. To simplify the problem for faster solving with CBC, increasesearch.enforce_rank_multiples_ofto a higher number (e.g., 64). Note: We do not guarantee CBC will perfectly match Gurobi results.
To compress a model, you can mix and match different search and decomposition (SVD) configurations. The provided code is highly flexible and works out of the box for most HuggingFace models.
| Config | Method | Key Parameters |
|---|---|---|
lems |
LEMS (Ours) | solver (gurobi/cbc), crosslayer_term, halpha, hgamma, enforce_rank_multiples_of |
uniform |
Uniform | enforce_rank_multiples_of |
asvd |
ASVD threshold | sensitivity_loss, target_metric, min_ratio, max_ratio |
asvd_plus |
ASVD+ (bias + Optuna) | Same as ASVD + crosslayer_term, halpha, hgamma, optuna_trials |
memvit |
MRCS (greedy) | sensitivity_loss, target_metric, lower_bound, enforce_rank_multiples_of |
memvit_plus |
MRCS+ (bias + Optuna) | Same as MRCS + crosslayer_term, halpha, hgamma, optuna_trials |
svd_llmv2 |
SVD-LLMv2 | sensitivity_loss |
atp |
ATP | beta |
loadconfig |
Load from JSON | layer_compression_json_path |
| Config | Method | Description |
|---|---|---|
kfac_svd |
KFAC-SVD (Ours) | Token-wise Fisher-based whitened SVD |
svd_llm |
SVD-LLM | Cholesky-whitened SVD |
svd_llmv2 |
SVD-LLM (SVD whitening) | SVD-based whitening variant |
svd_llm_large |
SVD-LLM (memory efficient) | CPU-offloaded whitening for 70B+ models |
svd |
Vanilla SVD | Plain truncated SVD (no activation awareness) |
asvd |
ASVD | Activation-scaled SVD |
fwsvd |
FWSVD | Fisher-weighted SVD |
gfwsvd |
GFWSVD | Gradient Fisher-weighted SVD |
dobi_svd |
DOBI-SVD | Double-sided bi-orthogonal SVD |
After compression, you have the option to fine-tune the compressed model to recover further performance using finetune_LLM.py:
python finetune_LLM.py \
--model_path outputs/runs/<run_name>/checkpoint.pt \
--tuning_strategy peft_lora \
--num_epochs 1 \
--learning_rate 2e-5Available Tuning Strategies:
peft_lora: Standard PEFT LoRA adapters.tune_svd: Unfreezes only the decomposed parameters.custom_lora: Native LoRA applied specifically to the decomposed layers.
For a complete breakdown of zero-shot accuracy, perplexity generalizability, and scaling laws, please refer to the main paper and our project page.
Below are the primary results comparing LEMS to baseline search and SVD methods. Lower Wiki (Perplexity) is better; Higher Acc (Accuracy) is better.
| Ratio | Search Method | Mistral-7B Wiki / Acc |
Llama3-8B Wiki / Acc |
Qwen3-8B Wiki / Acc |
|---|---|---|---|---|
| - | Baseline | 5.25 / 63.95 | 6.14 / 63.34 | 9.71 / 62.03 |
| 0.8 | Uniform | 7.14 / 52.35 | 11.44 / 47.69 | 12.52 / 53.56 |
| ASVD | 7.20 / 47.81 | 12.92 / 45.95 | 15.70 / 47.72 | |
| SVD-LLMv2 | 7.13 / 52.68 | 11.40 / 47.70 | 12.52 / 53.36 | |
| MRCS | 7.11 / 51.75 | 11.99 / 46.63 | 14.52 / 52.04 | |
| ATP | 7.14 / 52.35 | 11.07 / 51.56 | 12.52 / 53.56 | |
| LEMS (Ours) | 5.98 / 57.67 | 8.16 / 55.99 | 10.38 / 59.28 | |
| 0.6 | Uniform | 14.38 / 39.10 | 48.56 / 34.39 | 21.68 / 39.63 |
| ASVD | 16.78 / 35.56 | 75.21 / 33.98 | 29.22 / 36.09 | |
| SVD-LLMv2 | 14.90 / 39.29 | 47.53 / 34.43 | 21.42 / 39.74 | |
| MRCS | 14.21 / 37.44 | 67.95 / 33.41 | 38.72 / 36.42 | |
| ATP | 13.59 / 40.24 | 25.14 / 39.95 | 19.47 / 40.20 | |
| LEMS (Ours) | 10.58 / 45.60 | 17.86 / 43.09 | 15.70 / 45.51 |
| Ratio | Search Method | Mistral-7B Wiki / Acc |
Llama3-8B Wiki / Acc |
Qwen3-8B Wiki / Acc |
|---|---|---|---|---|
| - | Baseline | 5.25 / 63.96 | 6.14 / 63.33 | 9.71 / 62.04 |
| 0.9 | FWSVD | 9.47 / 56.93 | 42.11 / 48.88 | 16.95 / 53.82 |
| ASVD | 9.14 / 57.30 | 65.09 / 47.89 | 20.36 / 51.19 | |
| SVD-LLM | 6.46 / 56.81 | 10.14 / 52.27 | 12.52 / 55.96 | |
| SVD-LLMv2 | 6.46 / 56.78 | 10.18 / 52.17 | 12.52 / 55.80 | |
| DOBI-SVD | 7.11 / 55.17 | 11.22 / 52.98 | 13.46 / 55.48 | |
| GFWSVD | 31.66 / 40.73 | 2569.75 / 33.40 | 171.49 / 46.66 | |
| KFAC-SVD (Ours) | 6.22 / 56.92 | 8.84 / 54.45 | 11.51 / 57.07 | |
| + LEMS (Full) | 5.37 / 62.77 | 6.58 / 61.99 | 9.85 / 62.89 | |
| 0.7 | FWSVD | 34.84 / 44.24 | 716.39 / 33.40 | 41.37 / 42.68 |
| ASVD | 28.16 / 46.09 | 10989.41 / 33.03 | 70.66 / 43.10 | |
| SVD-LLM | 10.96 / 44.39 | 34.64 / 37.69 | 17.11 / 45.76 | |
| SVD-LLMv2 | 10.96 / 44.35 | 34.98 / 37.70 | 17.15 / 45.62 | |
| DOBI-SVD | 12.37 / 42.94 | 31.54 / 38.48 | 18.54 / 46.24 | |
| GFWSVD | 2549.75 / 32.29 | 41798.89 / 31.77 | 21684.98 / 34.06 | |
| KFAC-SVD (Ours) | 9.29 / 45.76 | 19.43 / 40.73 | 14.69 / 46.78 | |
| + LEMS (Full) | 7.42 / 51.52 | 11.07 / 49.39 | 11.81 / 53.80 |
If our work or code helps your research, please consider citing our paper:
@inproceedings{thoma_advancing_2026,
location = {Seoul, South Korea},
title = {Advancing {SVD}-based {LLM} Compression via Layer-Wise Error Model Search},
url = {https://openreview.net/forum?id=IjIgNPFuCt},
abstract = {Low-rank {SVD}-based compression offers a powerful strategy to reduce the computational costs of Large language models ({LLMs}); however, existing methods commonly encounter two recurring obstacles: (i) global rank allocation, where uncalibrated error proxies fail to account for complex error propagation, and (ii) decomposition quality, where Fisher-based estimators suffer from severe rank collapse. In this work, we address these limitations by presenting Layer-wise Error Modeling Search ({LEMS}) and {KFAC}-{SVD}. {LEMS} advances rank allocation by introducing a layer-wise error surrogate that integrates both local and global layer importance alongside a propagation bias, allowing us to determine global rank configurations efficiently as an Integer Linear Program ({ILP}). Simultaneously, {KFAC}-{SVD} improves decomposition quality by utilizing token-wise statistics, preventing the rank deficiency observed in prior Fisher-based {SVD}. We demonstrate across Mistral, Qwen3, and Llama-3 families that {KFAC}-{SVD} achieves an average perplexity improvements of 15\%, while {LEMS} consistently outperforms existing search strategies, delivering significant zero-shot accuracy improvements of up to 4.7 p.p. that generalize to scales of 70B parameters. Code is made available in the Supplement.},
eventtitle = {Forty-third International Conference on Machine Learning},
author = {Thoma, Moritz and Groezinger, Maximilian and ForstenhΓ€usler, Maximilian and Aghajanzadeh, Emad and Vemparala, Manoj Rohit and Anagnostopoulos, Christos and Mori, Pierpaolo and Fasfous, Nael and Frickenstein, Alexander and Mueller-Gritschneder, Daniel and Schlichtmann, Ulf},
urldate = {2026-05-27},
date = {2026-04-30},
langid = {english},
}
