V0.9.3 Release
Highlights
- Added ark backend by @Zhenzhong1 in #1075
- reduce vram usage for optimized RTN mode by @wenhuach21 in #1043
- Support alg_ext on windows by @chensuyue in #1082
- adjust 2/3 bits hyperparameters at auto-round-best by @wenhuach21 in #1081
- fix bug of outside block layers do not use best_params by @n1ck-guo in #1128
- Fix nvfp4&fp8 packing ram issue, refine all exporting format ram release for origin layer by @yiliu30 in #1129
What's Changed
- fix accuracy regression by @wenhuach21 in #1041
- add environment.md and remove AutoRoundMLLM usage in readme by @xin3he in #1042
- add memory monitor and import auto-scheme on demand by @wenhuach21 in #1049
- Update get_block_names func by @mengniwang95 in #1047
- set add_bos_token=True for llama model by @n1ck-guo in #1046
- Announce llmc integration by @yiliu30 in #1055
- [High Risk]reduce vram usage for optimized RTN mode by @wenhuach21 in #1043
- Add static FP8 attention support by @yiliu30 in #1045
- Enhance tokenizer saving by @mengniwang95 in #1057
- Revert "Add static FP8 attention support" by @yiliu30 in #1060
- refine readme by @wenhuach21 in #1063
- Fix transformers==4.57.1 in CI by @XuehaoSun in #1066
- Add static FP8 attention support by @yiliu30 in #1061
- Remove tbb by @yiliu30 in #1069
- support for gguf mixed q2_k_s by @n1ck-guo in #1059
- Add compatibility test for ARM by @XuehaoSun in #1073
- Optimize CPU CI pipelines by @XuehaoSun in #1071
- Export KV Scheme in LLMC config by @yiliu30 in #1068
- Add LLMC integration test by @yiliu30 in #1053
- Support transformers loading quantized moe model by @mengniwang95 in #1067
- update alg_ext and add ut by @n1ck-guo in #1064
- simplify what's new and add publication_list by @xin3he in #1070
- Add MXFP8 MOE/Linear and MXFP4 Linear by @yiliu30 in #1034
- improve accuracy for 2bit with auto-round-best by @wenhuach21 in #1078
- Support mxfp nvfp lmhead quant by @WeiweiZhang1 in #1051
- refine sampler by @wenhuach21 in #1077
- fix bf16 option in AutoScheme by @wenhuach21 in #1079
- adjust 3 bits hyperparameters at auto-round-best by @wenhuach21 in #1081
- Support alg_ext on windows by @chensuyue in #1082
- [vLLM Ext]Fix MXFP4 Quant by @yiliu30 in #1088
- remove numpy restriction for gptq kernel by @wenhuach21 in #1084
- Fix MXFP/NVFP + FP8 Attn/KV by @yiliu30 in #1086
- Remove accelerate version limitation by @chensuyue in #1090
- bump version to v0.9.3 by @chensuyue in #1091
- add system checker in backend by @wenhuach21 in #1097
- Refactor input normalization by replaying inputs for consistent preprocessing by @yiliu30 in #1094
- fix gguf acc and oom bug when iters > 0 by @n1ck-guo in #1098
- Add Python 3.14 to compatibility test pipeline by @XuehaoSun in #1096
- Fix typo in README.md by @xin3he in #1102
- refine lmhead ut by @WeiweiZhang1 in #1106
- Move packed res to cpu by @yiliu30 in #1104
- remove non-essential requirements by @n1ck-guo in #1103
- fix auto-scheme/alg-ext multiple devices issue by @wenhuach21 in #1107
- fix release version cuda ut fail by @n1ck-guo in #1110
- fix gguf packing device by @n1ck-guo in #1105
- fix quant fp8 model with iters=0 and scheme=nvfp4 by @n1ck-guo in #1114
- Move quantized block to cpu by @yiliu30 in #1115
- fix gguf extension bug by @wenhuach21 in #1116
- fix bug of triton pow wrong data_type when enable torch compile by @n1ck-guo in #1120
- Use modelscope cache in CPU UT by @XuehaoSun in #1124
- fix bug of outside block layers do not use best_params by @n1ck-guo in #1128
- Fix nvfp4&fp8 packing ram issue, refine all exporting format ram release for origin layer by @yiliu30 in #1129
- fix regression by @xin3he in #1135
- Upgrade llmc to main and add cuda UT by @yiliu30 in #1111
- Enable load MXFP4/MXFP8 + FP8 KV by @yiliu30 in #1095
- Remove duplicate packages by @XuehaoSun in #1139
- fix bug of auto scheme with user layer config by @n1ck-guo in #1133
- Add AutoRound binary build and publish workflow by @chensuyue in #1132
- update readme by @wenhuach21 in #1141
- update document for eval by @xin3he in #1140
- update windows binary for alg_ext by @chensuyue in #1142
- Fix device mismatch of nvfp fuse scale by @WeiweiZhang1 in #1143
- add low_cpu_mem_usage in cli by @n1ck-guo in #1146
- fix cuda ut fail by @n1ck-guo in #1144
- add llmc for cuda ut by @yiliu30 in #1145
- Added ark backend by @Zhenzhong1 in #1075
New Contributors
- @Zhenzhong1 made their first contribution in #1075
Full Changelog: v0.9.2...v0.9.3