GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

GEMQ is a post-training quantization framework for Mixture-of-Experts (MoE) LLMs that enables extreme low-bit quantization (down to 1.5 bits per expert) with minimal accuracy degradation. It works by:

automatically assigning different bit-widths to experts based on their importance;
fine-tuning the routers so they can better work with quantized experts;
optionally using progressive quantization to refine the bit allocation.

What's in this repo

An ILP solver for global expert-level bit allocation
GPTQ-based quantization and router fine-tuning pipelines
Efficient low-bit MoE triton kernels for real quantized inference

Installation

conda create -n gemq python=3.10 -y
conda activate gemq
git clone https://github.com/jndeng/GEMQ
cd GEMQ
pip install -e .

Note

This project currently uses gurobipy as the integer linear programming (ILP) solver for bit allocation. A Gurobi license may be required for certain MoE models with a large number of experts, such as the DeepSeek and Qwen series.

Usage

All scripts for Mixtral-8×7B and DeepSeek-V2-Lite are provided in scripts.

1. Bit Allocation

Note

We provide pre-generated bit allocation configs under configs, which can be used directly for quantization. You may skip this section if you do not want to regenerate them.

To generate the configs from scratch, follow the steps below.

Download the first shard of the C4 training dataset (c4-train.00000-of-01024.json) from allenai/c4 and save it under ./data.
Run scripts/compute_stats_<model>.sh to compute model statistics on the calibration dataset. The resulting statistics (gradients and perturbation errors) will be saved under cache.
Run scripts/allocate_<model>.sh to solve the ILP for bit allocation using the generated model statistics. The allocation results (bit configs) will be saved under configs.

2. Mixed-Precision Quantization

Simply run scripts/quantize_<model>.sh for model quantization. Please refer to the script for the detailed available options.

The evaluation code runs automatically after quantization. If you want to evaluate the model on downstream tasks, please ensure that lm-evaluation-harness is installed.

Quantized models will be saved under results.

3. Inference

Use scripts/bench_generate_<model>.sh to run inference demos and benchmark the real quantized models.

Acknowledgements

This repository builds upon several excellent open-source projects, including MC-MoE, GPTQ, HQQ, GemLite, and gpt-fast. We sincerely thank the authors and contributors for making their code publicly available.

Citation

If you find GEMQ useful for your research or project, please consider citing our paper.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
configs		configs
data		data
gemq		gemq
scripts		scripts
.gitignore		.gitignore
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

What's in this repo

Installation

Usage

1. Bit Allocation

2. Mixed-Precision Quantization

3. Inference

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

What's in this repo

Installation

Usage

1. Bit Allocation

2. Mixed-Precision Quantization

3. Inference

Acknowledgements

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages