Highlights
-
Support one cpu/xpu backend by @Zhenzhong1 in #1723, @luoyu-intel in #1806, @Copilot in #1819
-
Support model_free WOQ quantization by @xin3he in #1699, @xin3he in #1809
-
Expanded support for new model architectures across LLMs, VLMs, diffusion, audio, and MoE models, including Gemma4 (#1655), MiMo-V2-Flash (#1718), BAGEL-7B-MoT (#1633), WAN2.2 (#1678), Ovis-Image-7B (#1616), Qwen-TTS and MiMo-Audio (#1810), and Qwen3.6-35B-A3B (#1705).
-
Add compressed-tensors format export support for W4A16 and W8A16 by @thuang6 in #1669
-
New architecture for auto_round by @n1ck-guo in #1542, #1761, #1781, #1796, in #1807, #1765, #1808, #1817, #1832
-
Support AWQ based smoothing algorithm by @WeiweiZhang1 in #1749
Bug fixes and Improvements
- Reduce XPU memory usage in CI by @xin3he in #1812
- Reduce flux CUDA CI tuning memory from 30G to 5GB by @xin3he in #1694
- Support MXINT4 scheme by @mengniwang95 in #1666
- Remove IPEX related code, doc, and test by @xin3he in #1787
- Enhance dataset preprocessing memory management and fix hash failure by @xin3he in #1621
- Enhance quantization configuration support for mixed precision and schemes in utils and tests by @xin3he in #1643
- Force auto-scheme low_gpu to True in CLI by @wenhuach21 in #1653
- Enhance performance test by @XuehaoSun in #1610
- Fix auto-scheme accuracy drop bug w/o low_gpu, add CI test by @WeiweiZhang1 in #1658
- Add gptqmodel 6.0 compatible change and MTP quantization skip warning. by @xin3he in #1663
- Import gptqmodel in confest to avoid error in transformers by @xin3he in #1668
- Fix missing extra_config export for unsupported ignore_layers like
mlp.gateby @lvliang-intel in #1660 - Update unit test requirements for compressed-tensors and transformers… by @XuehaoSun in #1670
- Fix omni model test CI issue by @lvliang-intel in #1667
- Refactor module access to use PyTorch get_submodule / set_submodule by @scopophobic in #1590
- Support longcat_next by @xin3he in #1637
- Stabilize Qwen3-Omni MoE weight fidelity test for matching NaNs by @lvliang-intel in #1681
- Fix hadamard transform weight dtype, using float32 as default and in-place transformed weight . by @lkk12014402 in #1665
- fp8_block bug fix by @mengniwang95 in #1693
- Support diffusion model AIDC-AI/Ovis-Image-7B quantization by @lvliang-intel in #1616
- Gate large FP4 packing test by GPU memory by @yiliu30 in #1696
- Add Claude skills for AutoRound by @lvliang-intel in #1686
- Add missing run_mllm entry point alias by @JGSphaela in #1695
- Rename scheme INT8_W8A8 to INT8 by @thuang6 in #1687
- Update mtp quant for special cases by @xin3he in #1691
- Update gaudi-docker to v1.24.0 & fix CUDA UT by @XuehaoSun in #1708
- Add support for gemma4 model by @n1ck-guo in #1655
- Ignore mtp.fc for qwen3_5 due to vllm failure by @xin3he in #1710
- Introduce INT4 support at the algorithm level by @wenhuach21 in #1641
- Refine int4 doc by @wenhuach21 in #1720
- Revert "ignore mtp.fc for qwen3_5 due to vllm failure (#1710)" by @xin3he in #1730
- Skip quantizing mtp.fc since vLLM doesn't support by @xin3he in #1731
- Create model_support_request.yml by @xin3he in #1738
- Remove threaded packing from exporters by @yiliu30 in #1719
- Reduce XPU memory usage with patch_xpu_sdpa_drop_causal_mask by @xin3he in #1716
- Add MLX format export support and AutoScheme for vlm support by @wenhuach21 in #1732
- Add warnings for lm_head activation scale fallback by @n1ck-guo in #1728
- Add support for MiMo-V2-Flash by @n1ck-guo in #1718
- Fix vllm CUDA CI by @XuehaoSun in #1750
- Fix hpu error by @n1ck-guo in #1766
- MTP split gate_up_proj and fix accu gap in rtn quantization by @xin3he in #1758
- Support gptqmodel 7.0.0 and fix bug in CI by @xin3he in #1772
- Optimize CUDA CI and Code Scan workflows by @XuehaoSun in #1770
- Fix accuracy regression and check it in CUDA CI by @xin3he in #1785
- Fix amp by @wenhuach21 in #1768, @wenhuach21 in #1767
- Fix incompatible weight names by @mengniwang95 in #1759
- Fix CT export metadata for KV cache and attention by @yiliu30 in #1752, @yiliu30 in #1861
- Enhance AutoRound Lib test workflow by @chensuyue in #1805, @chensuyue in #1801
- Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled by @lvliang-intel in #1753
- Reduce XPU memory usage in CI by @xin3he in #1812
- Add shared agent config layout by @yiliu30 in #1700
- Support ByteDance-Seed/BAGEL-7B-MoT quantization in w4a16 format by @lvliang-intel in #1633
- Adjust gguf tuning algorithm by @wenhuach21 in #1649, @wenhuach21 in #1824
- Feats: Quantize/save/evaluate the Wan-AI/WAN2.2 models in w4a16 format by @lvliang-intel in #1678
- Fix layer name mismatch of VLM(qwen3.5-2B) in hf loading by @xin3he in #1823
- Update dependencies in CI and installation scripts for cuda compatibility by @XuehaoSun in #1825
- Reduce RAM usage of quantizing VLM models and fix some issues of quantizing gemma4 by @lvliang-intel in #1791
- Add mimo-audio, Qwen-TTS model backbone quantization by @WeiweiZhang1 in #1810
- Reduce VRAM usage of quantizing VLM models by @lvliang-intel in #1777
- Support quarot/spinquant rotation before quantization by @lkk12014402 in #1797
- Support Exporting Block-Wise FP8 AR Format by @Zhenzhong1 in #1798
- Fix Gemma4 KeyError sliding_attention issue by @lvliang-intel in #1839
- Fix gpt-j-6b RTN RuntimeError by @lvliang-intel in #1848
- Sage fast sfm by @luoyu-intel in #1843
- Security: HTTP requests are performed without timeout safeguards by @tomaioo in #1683
- Dynamic map checkpoint naming based on model objective. by @xin3he in #1840
- Refine/fix gptq format by @wenhuach21 in #1853
- Fix save_quantized log conflict by @WeiweiZhang1 in #1845
- Fix Qwen Omni quantization model issue for long form audio generation by @lvliang-intel in #1698
- Fix bug of qwen and gguf export by @n1ck-guo in #1846
- Refactor quarot/spinquant rotation with simplying code. by @lkk12014402 in #1849
- Fix gemma4 crash issue during quantizing by @lvliang-intel in #1860
- Fix SDPA bug by @luoyu-intel in #1862
- Fix special-model predefined ignore layer filtering by @lvliang-intel in #1863
- Fix packing format in quantization config and update variable assignment in tests by @xin3he in #1857
- Fix unsupported dtype by @wenhuach21 in #1868
- Support WOQ model input, such as kimi2.5 by @xin3he in #1642
- Enable low_cpu_mem_usage for mxfp/nvfp by @Kaihui-intel in #1648
New Contributors
- @JGSphaela made their first contribution in #1695
- @tomaioo made their first contribution in #1683
Full Changelog: v0.12.3...v0.13.0