Release V0.12.0 release · intel/auto-round

Highlights

Support hadamard transform for mxfp4 with RTN or AutoRound method by @lkk12014402 in #1515, @lkk12014402 in #1607
Support transformers 5.0 by @n1ck-guo in #1474, @lvliang-intel in #1475, @lvliang-intel in #1475, @lvliang-intel in #1535, @XuehaoSun in #1604, @xin3he in #1627, @xin3he in #1626
Support Qwen3 and Qwen2.5 Omni model quantization by @lvliang-intel in #1404, @lvliang-intel in #1568, @xin3he in #1598
Support GLM-Image model quantizaiton by @lvliang-intel in #1512
Support block-wise fp8 quant by @mengniwang95 in #1487
Support MTP module WOQ quantization and step3p5 model quantization by @xin3he in #1526, @xin3he in #1554
Enable CUDA CI Test @XuehaoSun in #1473
Enable Performance CI Test by @XuehaoSun in #1561

Bug fixes and Improvements

Refactor evaluation in tests to use evaluate_accuracy function by @xin3he in #1402
Remove require_intel_extension_for_pytorch and fix TypeError: unhashable type: 'set' by @xin3he in #1425
Fix cuda model ut [glm4][Molmo] by @Kaihui-intel in #1428
Update BackendInfos for AutoGPTQ based on transformer version and save g_idx for gptqmodel backend by @xin3he in #1429
Refine global scale calculation to blockwise by @WeiweiZhang1 in #1421
Support GPTQ_FORMAT for "gptqmodel:exllamav2" backend by @xin3he in #1434
Update diffusion README by @mengniwang95 in #1439
Enable new cpu pool for CI test by @chensuyue in #1435
Update auto-round-lib README.md by @chensuyue in #1437
Update logic for compatibility with gptqmodel:exllamav2 backend by @xin3he in #1438
Update readme for Intel xpu usage by @chensuyue in #1441
Fix bug of evaluate_accuracy by @xin3he in #1430
Attach act_max_hook for FP8 model by @yiliu30 in #1447
[Regression] fix FP8_STATIC loading by @xin3he in #1452
Fix requirements packaging for source distribution by @timkpaine in #1455
Fix: preserve classmethod descriptor in from_pretrained monkey patch by @yiliu30 in #1460
Support load FP8 model on HPU by @yiliu30 in #1449
Update compatibility test by @XuehaoSun in #1463
Support multiple device evaluation for activation quantized model by @wenhuach21 in #1394
Fix KeyError when GPU is missing from accelerate max_memory by @lvliang-intel in #1457
Support glm5 by @wenhuach21 in #1466
Fix cuda ut workflow by @XuehaoSun in #1472
Fix: qwen3-next NVFP4 quantization returns act_max incorrectly by @xin3he in #1470
Reduce AutoGPTQ priority and add transformer requirement by @xin3he in #1467
Add envs.py for environment checks and update imports in test files by @xin3he in #1432
Fix HPU CI: apply temporary version restriction on compressed_tensors by @XuehaoSun in #1479
Fix typo: remove extra backtick in comment by @04cb in #1481
Support qwen3_5 moe by @wenhuach21 in #1476
Add qwen35 ut by @wenhuach21 in #1482
Grep core dumped issue in UT test by @chensuyue in #1484
Update torch to 2.10.0 in CPU CI by @XuehaoSun in #1492
Update MOE linear_loop implementation for speedup and matching name in vLLM by @xin3he in #1478
Add handle_generation_config function to manage model generation_config saving failure by @xin3he in #1448
Force ignore mlp.gate by @xin3he in #1501
Update README.md: Add nightly installation instructions for auto-round by @chensuyue in #1505
Fix FP8 quantizer for Transformers v4 by @yiliu30 in #1504
Support minimax_m2 ignore layer: block_sparse_moe.gate by @xin3he in #1508
Reduce ram&vram usage for vlm calib stage by @WeiweiZhang1 in #1488
[BUG] update moe check logic and make .gate ignore general by @xin3he in #1517
Fix low_gpu default value, refine doc by @WeiweiZhang1 in #1520
Release CUDA memory in WeightConverter and avoid meaningless print by @xin3he in #1498
Fix CUDA UT and HPU UT by @XuehaoSun in #1514
Fix bug of quantizing Z-image by @xin3he in #1516
Enable more xpu instance for xpu CI by @chensuyue in #1521
Fix Inference tensors do not track version counter error by @mengniwang95 in #1502
Refactor build process to use 'uv build' instead of 'python setup.py' by @XuehaoSun in #1495
Enable --eval for diffusion model by @xin3he in #1522
Support gpt-oss mxfp4 directly loading by @xin3he in #1401
Support diffusion model saving by @mengniwang95 in #1519
Optimize CPU RAM peak memory during quantization by @lvliang-intel in #1386
Fix dynamic int8 w8a8 export issue with tuning by @thuang6 in #1525
Fix missing trust_remote_code in unfused_moe by @xin3he in #1532
Fix sglang ut by @mengniwang95 in #1541
Fix alg_ext torch compile issue by @wenhuach21 in #1540
Add markdown-link-check by @XuehaoSun in #1531
Fix google/gemma-3-4b-it by @xin3he in #1547
Fix wrong match of ignore_layers and use warning instead of error for mismatch by @xin3he in #1553
Use AR_WORK_SPACE/offload/ as base dir for offload temp files by @yiliu30 in #1552
Fix shard finalization issue to prevent skipping non-LLM tail layers by @lvliang-intel in #1548
Fix missing change for compress_layer_names and add UTs by @xin3he in #1556
Fix MoE for llama/gpt-oss by @yiliu30 in #1557
Fix reload issue for lm_head quantization and update CI to cover it by @xin3he in #1563
Refine model save folder name by @Kaihui-intel in #1549
Add gptqmodel:awq backend support and keep autoawq WQLinear_GEMM for compatibility by @WeiweiZhang1 in #1545
Refine CPU RAM offload strategy by @lvliang-intel in #1544
Fix deadlock issue when do quantization on hpu. by @lkk12014402 in #1571
Fix compatibility with latest compressed_tensors compressor refactor by @yiliu30 in #1576
[Regression] fix copy missing tensors for llama4 by @xin3he in #1579
Disable torch compile for gguf with alg_ext by @wenhuach21 in #1582
Update awq backend infos, refine UT by @WeiweiZhang1 in #1581
Fix CI test FP8 static backend matching failure by @lvliang-intel in #1589
Fix packing nvfp/mxfp max_wokers & extend xpu ut by @Kaihui-intel in #1555
Fix contiguous issue by @xin3he in #1594
Update ipex backend to be deprecated. by @xin3he in #1599
Fix CUDA UT for 0.12.0 by @xin3he in #1600
Fix gguf bug for qwen3.5 moe and some small models by @n1ck-guo in #1575
Support compressed-tensors refactor by @xin3he in #1595
Support HPU memory dump in MemoryMonitor by @xin3he in #1605
Add workflow to close stale issues and pull requests by @chensuyue in #1612
Fix diffusion model tuning issue by @mengniwang95 in #1617
Revert awq dtype change for vllm inference limitation by @WeiweiZhang1 in #1613
Fix fp8 accuracy drop issue by @mengniwang95 in #1611
Fix clean offload clearing of non-checkpoint tensors by @lvliang-intel in #1583
Fix rtn bug by @wenhuach21 in #1622
Allow disabling copy_missing_tensors_from_source via env var, default enabled by @xin3he in #1623
Fix AutoScheme low memory flag propagation from CLI by @lvliang-intel in #1596
Update torch to 2.11.0 in CI by @XuehaoSun in #1606
Fix group_size issue by @mengniwang95 in #1638

New Contributors

@timkpaine made their first contribution in #1455
@04cb made their first contribution in #1481
@thuang6 made their first contribution in #1525

Full Changelog: v0.10.2...v0.12.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V0.12.0 release

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Bug fixes and Improvements

New Contributors

Contributors

Uh oh!