V0.12.0 release
Highlights
- Support hadamard transform for mxfp4 with RTN or AutoRound method by @lkk12014402 in #1515, @lkk12014402 in #1607
- Support transformers 5.0 by @n1ck-guo in #1474, @lvliang-intel in #1475, @lvliang-intel in #1475, @lvliang-intel in #1535, @XuehaoSun in #1604, @xin3he in #1627, @xin3he in #1626
- Support Qwen3 and Qwen2.5 Omni model quantization by @lvliang-intel in #1404, @lvliang-intel in #1568, @xin3he in #1598
- Support GLM-Image model quantizaiton by @lvliang-intel in #1512
- Support block-wise fp8 quant by @mengniwang95 in #1487
- Support MTP module WOQ quantization and step3p5 model quantization by @xin3he in #1526, @xin3he in #1554
- Enable CUDA CI Test @XuehaoSun in #1473
- Enable Performance CI Test by @XuehaoSun in #1561
Bug fixes and Improvements
- Refactor evaluation in tests to use evaluate_accuracy function by @xin3he in #1402
- Remove require_intel_extension_for_pytorch and fix TypeError: unhashable type: 'set' by @xin3he in #1425
- Fix cuda model ut [glm4][Molmo] by @Kaihui-intel in #1428
- Update BackendInfos for AutoGPTQ based on transformer version and save g_idx for gptqmodel backend by @xin3he in #1429
- Refine global scale calculation to blockwise by @WeiweiZhang1 in #1421
- Support GPTQ_FORMAT for "gptqmodel:exllamav2" backend by @xin3he in #1434
- Update diffusion README by @mengniwang95 in #1439
- Enable new cpu pool for CI test by @chensuyue in #1435
- Update auto-round-lib README.md by @chensuyue in #1437
- Update logic for compatibility with gptqmodel:exllamav2 backend by @xin3he in #1438
- Update readme for Intel xpu usage by @chensuyue in #1441
- Fix bug of evaluate_accuracy by @xin3he in #1430
- Attach
act_max_hookfor FP8 model by @yiliu30 in #1447 - [Regression] fix FP8_STATIC loading by @xin3he in #1452
- Fix requirements packaging for source distribution by @timkpaine in #1455
- Fix: preserve classmethod descriptor in from_pretrained monkey patch by @yiliu30 in #1460
- Support load FP8 model on HPU by @yiliu30 in #1449
- Update compatibility test by @XuehaoSun in #1463
- Support multiple device evaluation for activation quantized model by @wenhuach21 in #1394
- Fix KeyError when GPU is missing from accelerate max_memory by @lvliang-intel in #1457
- Support glm5 by @wenhuach21 in #1466
- Fix cuda ut workflow by @XuehaoSun in #1472
- Fix: qwen3-next NVFP4 quantization returns act_max incorrectly by @xin3he in #1470
- Reduce AutoGPTQ priority and add transformer requirement by @xin3he in #1467
- Add envs.py for environment checks and update imports in test files by @xin3he in #1432
- Fix HPU CI: apply temporary version restriction on compressed_tensors by @XuehaoSun in #1479
- Fix typo: remove extra backtick in comment by @04cb in #1481
- Support qwen3_5 moe by @wenhuach21 in #1476
- Add qwen35 ut by @wenhuach21 in #1482
- Grep core dumped issue in UT test by @chensuyue in #1484
- Update torch to 2.10.0 in CPU CI by @XuehaoSun in #1492
- Update MOE linear_loop implementation for speedup and matching name in vLLM by @xin3he in #1478
- Add handle_generation_config function to manage model generation_config saving failure by @xin3he in #1448
- Force ignore mlp.gate by @xin3he in #1501
- Update README.md: Add nightly installation instructions for auto-round by @chensuyue in #1505
- Fix FP8 quantizer for Transformers v4 by @yiliu30 in #1504
- Support minimax_m2 ignore layer: block_sparse_moe.gate by @xin3he in #1508
- Reduce ram&vram usage for vlm calib stage by @WeiweiZhang1 in #1488
- [BUG] update moe check logic and make .gate ignore general by @xin3he in #1517
- Fix low_gpu default value, refine doc by @WeiweiZhang1 in #1520
- Release CUDA memory in WeightConverter and avoid meaningless print by @xin3he in #1498
- Fix CUDA UT and HPU UT by @XuehaoSun in #1514
- Fix bug of quantizing Z-image by @xin3he in #1516
- Enable more xpu instance for xpu CI by @chensuyue in #1521
- Fix Inference tensors do not track version counter error by @mengniwang95 in #1502
- Refactor build process to use 'uv build' instead of 'python setup.py' by @XuehaoSun in #1495
- Enable --eval for diffusion model by @xin3he in #1522
- Support gpt-oss mxfp4 directly loading by @xin3he in #1401
- Support diffusion model saving by @mengniwang95 in #1519
- Optimize CPU RAM peak memory during quantization by @lvliang-intel in #1386
- Fix dynamic int8 w8a8 export issue with tuning by @thuang6 in #1525
- Fix missing trust_remote_code in unfused_moe by @xin3he in #1532
- Fix sglang ut by @mengniwang95 in #1541
- Fix alg_ext torch compile issue by @wenhuach21 in #1540
- Add markdown-link-check by @XuehaoSun in #1531
- Fix google/gemma-3-4b-it by @xin3he in #1547
- Fix wrong match of ignore_layers and use warning instead of error for mismatch by @xin3he in #1553
- Use AR_WORK_SPACE/offload/ as base dir for offload temp files by @yiliu30 in #1552
- Fix shard finalization issue to prevent skipping non-LLM tail layers by @lvliang-intel in #1548
- Fix missing change for compress_layer_names and add UTs by @xin3he in #1556
- Fix MoE for llama/gpt-oss by @yiliu30 in #1557
- Fix reload issue for lm_head quantization and update CI to cover it by @xin3he in #1563
- Refine model save folder name by @Kaihui-intel in #1549
- Add gptqmodel:awq backend support and keep autoawq WQLinear_GEMM for compatibility by @WeiweiZhang1 in #1545
- Refine CPU RAM offload strategy by @lvliang-intel in #1544
- Fix deadlock issue when do quantization on hpu. by @lkk12014402 in #1571
- Fix compatibility with latest compressed_tensors compressor refactor by @yiliu30 in #1576
- [Regression] fix copy missing tensors for llama4 by @xin3he in #1579
- Disable torch compile for gguf with alg_ext by @wenhuach21 in #1582
- Update awq backend infos, refine UT by @WeiweiZhang1 in #1581
- Fix CI test FP8 static backend matching failure by @lvliang-intel in #1589
- Fix packing nvfp/mxfp
max_wokers& extend xpu ut by @Kaihui-intel in #1555 - Fix contiguous issue by @xin3he in #1594
- Update ipex backend to be deprecated. by @xin3he in #1599
- Fix CUDA UT for 0.12.0 by @xin3he in #1600
- Fix gguf bug for qwen3.5 moe and some small models by @n1ck-guo in #1575
- Support compressed-tensors refactor by @xin3he in #1595
- Support HPU memory dump in MemoryMonitor by @xin3he in #1605
- Add workflow to close stale issues and pull requests by @chensuyue in #1612
- Fix diffusion model tuning issue by @mengniwang95 in #1617
- Revert awq dtype change for vllm inference limitation by @WeiweiZhang1 in #1613
- Fix fp8 accuracy drop issue by @mengniwang95 in #1611
- Fix clean offload clearing of non-checkpoint tensors by @lvliang-intel in #1583
- Fix rtn bug by @wenhuach21 in #1622
- Allow disabling copy_missing_tensors_from_source via env var, default enabled by @xin3he in #1623
- Fix AutoScheme low memory flag propagation from CLI by @lvliang-intel in #1596
- Update torch to 2.11.0 in CI by @XuehaoSun in #1606
- Fix group_size issue by @mengniwang95 in #1638
New Contributors
- @timkpaine made their first contribution in #1455
- @04cb made their first contribution in #1481
- @thuang6 made their first contribution in #1525
Full Changelog: v0.10.2...v0.12.0