GPT-QModel v5.8.0
Notable Changes
-
Transformers 5.3.0 compatibility.
-
Video Quantization Support
- Added support for video input during quantization.
-
MoE & Model Support
- Added support for Qwen 3.5 and Qwen 3.5 MoE.
- Expanded compatibility for Qwen 3 variants including MoE / VL / Omni / Next.
- Added support for LLada2 block diffusion LLM models.
- Improved compatibility for Mixtral, Phi-4, Nemotron Ultra, BaiChuan, ChatGLM, Yi, and GLM4V.
- Fixed multiple MoE-specific AWQ and multi-GPU issues, including routing, module tree, position embeddings, and device mismatches.
-
AWQ / GPTQ Kernels
- Added CPU fused AWQ kernels for
torch_fusedandhf_kernel. - Added torch_int8 AWQ kernel.
- Added BitBLAS AWQ kernel.
- Ported Intel int8 GPTQ/AWQ kernels.
- Updated kernel selection to prefer HF kernels where they provide the best performance and compatibility.
- Added BitBLAS fallback protection and fixed BitBLAS accuracy and qzero remap regressions.
- Added CPU fused AWQ kernels for
-
Quantization Improvements
- Replaced greedy search with ternary search in SmoothBSE.
- Fixed SmoothMAD overly aggressive clipping.
- Added layer-level dynamic skip for fast quantization.
- Added early stop when all remaining layers are skipped during quantization.
- Fixed AWQ OOM and dequantization-related issues.
-
Runtime & Dequantization
- Added optional CPU int64
g_idxcache for TorchQuantLinear dequantization. - Improved TorchFused dequantization and fp32 dtype support.
- Removed unnecessary symmetric handling in
dequantize_gemm. - Fixed rotary embedding device mismatch by storing per-device rotary copies.
- Added warmup protection for threaded timing.
- Added optional CPU int64
-
Defuser Integration
- Integrated
defuser.convert_hf_model(). - Integrated
defuser.materialize_model(). - Integrated
defuser.replace_fused_blocks(). - Improved defuser meta/offload compatibility and fused block handling.
- Integrated
-
Compatibility Fixes
- Improved compatibility with older and newer Hugging Face Transformers / Optimum versions.
- Fixed import compatibility issues in
models/utils. - Fixed rotary / embedding config compatibility with older HF and model variants.
- Improved tokenizer and model compatibility updates related to
tokenicer. - Fixed OSS compatibility issues.
-
Kernel / Backend Changes
- Hard deprecated ExLLaMA v1 kernel.
- Exposed the Triton patcher as an externally callable API.
What's Changed
- support video input for quantization by @techshoww in #2386
- feat: moe-router-bypass-batch-size by @avtc in #2349
- [CI] use UV as python manager by @CSY-ModelCloud in #2415
- [CI] fix deps installation & gpu service api path by @CSY-ModelCloud in #2416
- [CI] auto release GPU if job has sth wrong or unrecoverable by @CSY-ModelCloud in #2417
- [CI] save log to disk & fix deps installation by @CSY-ModelCloud in #2418
- Replace Greedy with Tenary Search for SmoothBSE by @namgyu-youn in #2419
- Feature/LLada2 support: Block Diffusion LLM by @blazingbhavneek in #2422
- Bump the github-actions group with 2 updates by @dependabot[bot] in #2426
- [MODEL] supports qwen3_5 by @ZX-ModelCloud in #2427
- [FIX] eval bug for qwn3_5 quantized model by @ZX-ModelCloud in #2428
- [MODEL] supports qwen3_5_moe by @ZX-ModelCloud in #2433
- Update tokenicer dependency version to 0.0.7 by @Qubitium in #2434
- Optional
CPUg_idx int64 cache for TorchQuantLinear dequant path by @Qubitium in #2431 - fix import compat issues for models/utils that is locked to higher ve… by @Qubitium in #2436
- call defuser.convert_hf_model() by @ZX-ModelCloud in #2437
- Update defuser dependency version to 0.0.3 by @Qubitium in #2439
- quantize mlp experts module for qwen3_5_moe by @ZX-ModelCloud in #2443
- Fix typo in setup.py causing wheel build failure (sys.abiflag -> sys.abiflags) by @beomchan0 in #2444
- call defuser's materialize_model() by @ZX-ModelCloud in #2446
- Update defuser dependency version to 0.0.4 by @Qubitium in #2447
- port intel's int8 gptq/awq kernel over by @Qubitium in #2438
- expose triton patcher as externally callable by @Qubitium in #2448
- docs by @Qubitium in #2449
- Add AWQ support for CPU fused kernels (torch_fused & hf_kernel) by @jiqing-feng in #2445
- Cleanupx by @Qubitium in #2450
- Make HF kernels for gptq/awq highest priority as they are the highest… by @Qubitium in #2451
- rm sym in dequantize_gemm by @jiqing-feng in #2452
- fix awq rotary device mismatch. store per-device copy of rotary by @Qubitium in #2453
- add torch_int8 awq kernel by @Qubitium in #2454
- [CI] move check log to a new step by @CSY-ModelCloud in #2455
- cleanup hf kernel gptq/awq post_init loading by @Qubitium in #2457
- fix SmoothMAD overly-aggressive clipping by @Qubitium in #2459
- upgrade defuser version to 0.0.5 by @ZX-ModelCloud in #2460
- [FIX] test_qwen3_5_moe by @ZX-ModelCloud in #2461
- Update defuser dependency version to 0.0.6 by @Qubitium in #2462
- fix awq oom by @CSY-ModelCloud in #2458
- [CI] CUDA 131 + Torch 2.10.0 + Python 3.13 by @CSY-ModelCloud in #2463
- Fix the module_tree in Qwen3_5_Moe to correctly support AWQ by @ZX-ModelCloud in #2464
- [CI] fix git link cannot be installed by uv by @CSY-ModelCloud in #2465
- [FIX] GEMM can't pack by @ZX-ModelCloud in #2466
- [CI] add peft for test_asym_gptq_v1 & check log after test by @CSY-ModelCloud in #2467
- [CI] get path error from log & install pre-compiled bitblas by @CSY-ModelCloud in #2468
- [CI] fix log files were saved with wrong runid by @CSY-ModelCloud in #2469
- [FIX] where qwen3_5_moe got incorrect
position_embeddingsduringAWQquantization by @ZX-ModelCloud in #2470 - Update pypcre version to 0.2.13 by @CSY-ModelCloud in #2471
- read dependencies from requirements.txt by @CSY-ModelCloud in #2472
- add setuptools to requirements.txt by @CSY-ModelCloud in #2474
- set minimum setuptools version to 78.1.1 by @CSY-ModelCloud in #2475
- [FIX] device mismatch issue that occurred during multi-GPU AWQ quantization in moe Model by @ZX-ModelCloud in #2476
- [CI] auto uninstall unneeded pkgs by @CSY-ModelCloud in #2478
- fix ci failed tests by @ZX-ModelCloud in #2477
- update mixtral's module_tree by @ZX-ModelCloud in #2480
- Fix CI by @Qubitium in #2481
- [CI] add pypi as backup by @CSY-ModelCloud in #2482
- Ci fixes 2 by @Qubitium in #2483
- CI Tests Fix 3 by @Qubitium in #2484
- [CI] fix old models need old transformers by @CSY-ModelCloud in #2485
- fix failed test by @ZX-ModelCloud in #2486
- [CI] install latest bitblas & fix missing pkgs by @CSY-ModelCloud in #2487
- BaiChuan fix by @Qubitium in #2488
- Ci fix 5 by @Qubitium in #2489
- Shelll/Src module buffer registratio mismatch + Qwen 2.5 VL patch by @Qubitium in #2490
- [CI] install latest evalplus wheel by @CSY-ModelCloud in #2492
- [CI] throw error for fast check by @CSY-ModelCloud in #2493
- [FIX] test_post_quant_eora by @ZX-ModelCloud in #2494
- llama 3.2 gptq + marlin score update by @Qubitium in #2495
- [CI] fix pkg was installed twice if common list has one by @CSY-ModelCloud in #2497
- fix rotary/embedding config hf compat with older models and older hf by @Qubitium in #2496
- Update HFKernelLinear selection by @jiqing-feng in #2498
- fix awq llama test scores by @Qubitium in #2499
- fix ci test import/env by @Qubitium in #2500
- fix ci test by @ZX-ModelCloud in #2501
- Ci test bloom by @Qubitium in #2502
- CI: Use generate_stable_with_limit by @Qubitium in #2503
- [CI] test_multi_gpu_inference will get 5 GPUs by @CSY-ModelCloud in #2504
- hard deprecate exllama v1 kernel by @Qubitium in #2505
- [CI] fix multi-gpu by @CSY-ModelCloud in #2506
- fix qwen3-vl compat by @Qubitium in #2507
- Oss compat fix by @Qubitium in #2508
- missing arg by @Qubitium in #2509
- threaded timing needs warmup protection by @Qubitium in #2510
- bitblas arch fallback protection by @Qubitium in #2511
- baichun fix: 1) rotary compat 2) buffers not correctly registered by @Qubitium in #2512
- add missing by @Qubitium in #2513
- refractor awq weight mean cpu/cuda paths by @Qubitium in #2514
- Qwen 3 [MoE/VL/Omni/Next] Compat by @Qubitium in #2515
- Ci qwen2 compat by @Qubitium in #2516
- defuser and meta/offload fixes by @Qubitium in #2517
- hf transformer/optimum compat by @Qubitium in #2518
- Ci fix 11 by @Qubitium in #2519
- fix mmlupro compat by @Qubitium in #2520
- tokenicer update for model compat by @Qubitium in #2521
- add fast/slow mode for ci testing by @Qubitium in #2522
- fix bitblas kernel accuracy by enabling fp32 accumulation by @Qubitium in #2523
- [CI] Clean up transformers dependencies in deps.yaml by @CSY-ModelCloud in #2525
- [CI] max-parallel jobs -> 6 by @CSY-ModelCloud in #2527
- fix ci test by @ZX-ModelCloud in #2526
- add bitblas awq kernel by @Qubitium in #2524
- [CI] print env after all installations by @CSY-ModelCloud in #2528
- use layer-level dynamic skip for fast quant by @ZX-ModelCloud in #2529
- TorchFused: Add fp32 dtype support & fix AWQ dequantize by @jiqing-feng in #2531
- [CI] install bitblas for test_awq_bitblas by @CSY-ModelCloud in #2532
- [CI] fix pipline exit code was covered by @CSY-ModelCloud in #2533
- [CI] add missing deps & delete unneeded test by @CSY-ModelCloud in #2534
- use local model path & [CI] add HF_TOKEN for CI by @CSY-ModelCloud in #2535
- Bitblas gptq bf16 fix by @Qubitium in #2530
- Stop quantization early if all subsequent layers are skipped by @ZX-ModelCloud in #2536
- [CI] set MAX_JOBS to nproc by @CSY-ModelCloud in #2539
- fix model's path by @CSY-ModelCloud in #2540
- Fix bitblas qzero remap regression by @Qubitium in #2541
- granite non-standard paramter turtle/shell sync issue by @Qubitium in #2542
- phi4 compat fix by @Qubitium in #2543
- nemotron ultra compat by @Qubitium in #2544
- call defuser.replace_fused_blocks() by @ZX-ModelCloud in #2545
- auto force fp16 for awq when working kernel does not support bf16 by @Qubitium in #2546
- bump defuser to 0.0.12 by @CSY-ModelCloud in #2547
- [FIX] replace random-based string generator with secrets by @ZX-ModelCloud in #2549
- [CI] fix bitblas cache invalid ELF header by @CSY-ModelCloud in #2548
- [FIX]
quant_log.csvwas not saved upon completion of AWQ quantization by @ZX-ModelCloud in #2550 - [CI] remove unused parameters by @CSY-ModelCloud in #2551
- Ci fix ovis 1 6 by @Qubitium in #2552
- [CI] compile with python build module & fix git files failed by @CSY-ModelCloud in #2554
- patch yi and glm4v by @Qubitium in #2553
- Update requirements.txt by @Qubitium in #2555
- upgrade defuser to 0.0.15 by @ZX-ModelCloud in #2557
- Compat regression fix by @Qubitium in #2556
- update marin score slow by @Qubitium in #2558
- [CI] revert build by @CSY-ModelCloud in #2561
- update require defuser version to 0.0.15 by @ZX-ModelCloud in #2562
- [CI] use python -m build -w --no-isolation by @CSY-ModelCloud in #2563
- update
module_treeforphimoeby @ZX-ModelCloud in #2564 - [CI] remove torch 2.9.1 build by @CSY-ModelCloud in #2565
- [CI] fix vllm test in test_eval by @CSY-ModelCloud in #2566
- update readme by @Qubitium in #2567
- fix ci test by @ZX-ModelCloud in #2568
- [FIX] test_chatglm by @ZX-ModelCloud in #2569
- fix test_chatglm_test by @ZX-ModelCloud in #2570
- Emit version by @Qubitium in #2571
- add torch 2.10 wheels by @Qubitium in #2572
- update logbar by @Qubitium in #2574
New Contributors
- @blazingbhavneek made their first contribution in #2422
- @beomchan0 made their first contribution in #2444
Full Changelog: v5.7.0...v5.8.0