GEMQ is a post-training quantization framework for Mixture-of-Experts (MoE) LLMs that enables extreme low-bit quantization (down to 1.5 bits per expert) with minimal accuracy degradation. It works by:
- automatically assigning different bit-widths to experts based on their importance;
- fine-tuning the routers so they can better work with quantized experts;
- optionally using progressive quantization to refine the bit allocation.
- An ILP solver for global expert-level bit allocation
- GPTQ-based quantization and router fine-tuning pipelines
- Efficient low-bit MoE triton kernels for real quantized inference
conda create -n gemq python=3.10 -y
conda activate gemq
git clone https://github.com/jndeng/GEMQ
cd GEMQ
pip install -e .Note
This project currently uses gurobipy as the integer linear programming (ILP) solver for bit allocation. A Gurobi license may be required for certain MoE models with a large number of experts, such as the DeepSeek and Qwen series.
All scripts for Mixtral-8×7B and DeepSeek-V2-Lite are provided in
scripts.
Note
We provide pre-generated bit allocation configs under configs, which can be used directly for quantization. You may skip this section if you do not want to regenerate them.
To generate the configs from scratch, follow the steps below.
-
Download the first shard of the C4 training dataset (c4-train.00000-of-01024.json) from allenai/c4 and save it under
./data. -
Run
scripts/compute_stats_<model>.shto compute model statistics on the calibration dataset. The resulting statistics (gradients and perturbation errors) will be saved undercache. -
Run
scripts/allocate_<model>.shto solve the ILP for bit allocation using the generated model statistics. The allocation results (bit configs) will be saved underconfigs.
Simply run scripts/quantize_<model>.sh for model quantization. Please refer to the script for the detailed available options.
The evaluation code runs automatically after quantization. If you want to evaluate the model on downstream tasks, please ensure that lm-evaluation-harness is installed.
Quantized models will be saved under results.
Use scripts/bench_generate_<model>.sh to run inference demos and benchmark the real quantized models.
This repository builds upon several excellent open-source projects, including MC-MoE, GPTQ, HQQ, GemLite, and gpt-fast. We sincerely thank the authors and contributors for making their code publicly available.
If you find GEMQ useful for your research or project, please consider citing our paper.