Release GPT-QModel v5.8.0 · ModelCloud/GPTQModel

Notable Changes

Transformers 5.3.0 compatibility.
Video Quantization Support
- Added support for video input during quantization.
MoE & Model Support
- Added support for Qwen 3.5 and Qwen 3.5 MoE.
- Expanded compatibility for Qwen 3 variants including MoE / VL / Omni / Next.
- Added support for LLada2 block diffusion LLM models.
- Improved compatibility for Mixtral, Phi-4, Nemotron Ultra, BaiChuan, ChatGLM, Yi, and GLM4V.
- Fixed multiple MoE-specific AWQ and multi-GPU issues, including routing, module tree, position embeddings, and device mismatches.
AWQ / GPTQ Kernels
- Added CPU fused AWQ kernels for torch_fused and hf_kernel.
- Added torch_int8 AWQ kernel.
- Added BitBLAS AWQ kernel.
- Ported Intel int8 GPTQ/AWQ kernels.
- Updated kernel selection to prefer HF kernels where they provide the best performance and compatibility.
- Added BitBLAS fallback protection and fixed BitBLAS accuracy and qzero remap regressions.
Quantization Improvements
- Replaced greedy search with ternary search in SmoothBSE.
- Fixed SmoothMAD overly aggressive clipping.
- Added layer-level dynamic skip for fast quantization.
- Added early stop when all remaining layers are skipped during quantization.
- Fixed AWQ OOM and dequantization-related issues.
Runtime & Dequantization
- Added optional CPU int64 g_idx cache for TorchQuantLinear dequantization.
- Improved TorchFused dequantization and fp32 dtype support.
- Removed unnecessary symmetric handling in dequantize_gemm.
- Fixed rotary embedding device mismatch by storing per-device rotary copies.
- Added warmup protection for threaded timing.
Defuser Integration
- Integrated defuser.convert_hf_model().
- Integrated defuser.materialize_model().
- Integrated defuser.replace_fused_blocks().
- Improved defuser meta/offload compatibility and fused block handling.
Compatibility Fixes
- Improved compatibility with older and newer Hugging Face Transformers / Optimum versions.
- Fixed import compatibility issues in models/utils.
- Fixed rotary / embedding config compatibility with older HF and model variants.
- Improved tokenizer and model compatibility updates related to tokenicer.
- Fixed OSS compatibility issues.
Kernel / Backend Changes
- Hard deprecated ExLLaMA v1 kernel.
- Exposed the Triton patcher as an externally callable API.

What's Changed

support video input for quantization by @techshoww in #2386
feat: moe-router-bypass-batch-size by @avtc in #2349
[CI] use UV as python manager by @CSY-ModelCloud in #2415
[CI] fix deps installation & gpu service api path by @CSY-ModelCloud in #2416
[CI] auto release GPU if job has sth wrong or unrecoverable by @CSY-ModelCloud in #2417
[CI] save log to disk & fix deps installation by @CSY-ModelCloud in #2418
Replace Greedy with Tenary Search for SmoothBSE by @namgyu-youn in #2419
Feature/LLada2 support: Block Diffusion LLM by @blazingbhavneek in #2422
Bump the github-actions group with 2 updates by @dependabot[bot] in #2426
[MODEL] supports qwen3_5 by @ZX-ModelCloud in #2427
[FIX] eval bug for qwn3_5 quantized model by @ZX-ModelCloud in #2428
[MODEL] supports qwen3_5_moe by @ZX-ModelCloud in #2433
Update tokenicer dependency version to 0.0.7 by @Qubitium in #2434
Optional CPU g_idx int64 cache for TorchQuantLinear dequant path by @Qubitium in #2431
fix import compat issues for models/utils that is locked to higher ve… by @Qubitium in #2436
call defuser.convert_hf_model() by @ZX-ModelCloud in #2437
Update defuser dependency version to 0.0.3 by @Qubitium in #2439
quantize mlp experts module for qwen3_5_moe by @ZX-ModelCloud in #2443
Fix typo in setup.py causing wheel build failure (sys.abiflag -> sys.abiflags) by @beomchan0 in #2444
call defuser's materialize_model() by @ZX-ModelCloud in #2446
Update defuser dependency version to 0.0.4 by @Qubitium in #2447
port intel's int8 gptq/awq kernel over by @Qubitium in #2438
expose triton patcher as externally callable by @Qubitium in #2448
docs by @Qubitium in #2449
Add AWQ support for CPU fused kernels (torch_fused & hf_kernel) by @jiqing-feng in #2445
Cleanupx by @Qubitium in #2450
Make HF kernels for gptq/awq highest priority as they are the highest… by @Qubitium in #2451
rm sym in dequantize_gemm by @jiqing-feng in #2452
fix awq rotary device mismatch. store per-device copy of rotary by @Qubitium in #2453
add torch_int8 awq kernel by @Qubitium in #2454
[CI] move check log to a new step by @CSY-ModelCloud in #2455
cleanup hf kernel gptq/awq post_init loading by @Qubitium in #2457
fix SmoothMAD overly-aggressive clipping by @Qubitium in #2459
upgrade defuser version to 0.0.5 by @ZX-ModelCloud in #2460
[FIX] test_qwen3_5_moe by @ZX-ModelCloud in #2461
Update defuser dependency version to 0.0.6 by @Qubitium in #2462
fix awq oom by @CSY-ModelCloud in #2458
[CI] CUDA 131 + Torch 2.10.0 + Python 3.13 by @CSY-ModelCloud in #2463
Fix the module_tree in Qwen3_5_Moe to correctly support AWQ by @ZX-ModelCloud in #2464
[CI] fix git link cannot be installed by uv by @CSY-ModelCloud in #2465
[FIX] GEMM can't pack by @ZX-ModelCloud in #2466
[CI] add peft for test_asym_gptq_v1 & check log after test by @CSY-ModelCloud in #2467
[CI] get path error from log & install pre-compiled bitblas by @CSY-ModelCloud in #2468
[CI] fix log files were saved with wrong runid by @CSY-ModelCloud in #2469
[FIX] where qwen3_5_moe got incorrect position_embeddings during AWQ quantization by @ZX-ModelCloud in #2470
Update pypcre version to 0.2.13 by @CSY-ModelCloud in #2471
read dependencies from requirements.txt by @CSY-ModelCloud in #2472
add setuptools to requirements.txt by @CSY-ModelCloud in #2474
set minimum setuptools version to 78.1.1 by @CSY-ModelCloud in #2475
[FIX] device mismatch issue that occurred during multi-GPU AWQ quantization in moe Model by @ZX-ModelCloud in #2476
[CI] auto uninstall unneeded pkgs by @CSY-ModelCloud in #2478
fix ci failed tests by @ZX-ModelCloud in #2477
update mixtral's module_tree by @ZX-ModelCloud in #2480
Fix CI by @Qubitium in #2481
[CI] add pypi as backup by @CSY-ModelCloud in #2482
Ci fixes 2 by @Qubitium in #2483
CI Tests Fix 3 by @Qubitium in #2484
[CI] fix old models need old transformers by @CSY-ModelCloud in #2485
fix failed test by @ZX-ModelCloud in #2486
[CI] install latest bitblas & fix missing pkgs by @CSY-ModelCloud in #2487
BaiChuan fix by @Qubitium in #2488
Ci fix 5 by @Qubitium in #2489
Shelll/Src module buffer registratio mismatch + Qwen 2.5 VL patch by @Qubitium in #2490
[CI] install latest evalplus wheel by @CSY-ModelCloud in #2492
[CI] throw error for fast check by @CSY-ModelCloud in #2493
[FIX] test_post_quant_eora by @ZX-ModelCloud in #2494
llama 3.2 gptq + marlin score update by @Qubitium in #2495
[CI] fix pkg was installed twice if common list has one by @CSY-ModelCloud in #2497
fix rotary/embedding config hf compat with older models and older hf by @Qubitium in #2496
Update HFKernelLinear selection by @jiqing-feng in #2498
fix awq llama test scores by @Qubitium in #2499
fix ci test import/env by @Qubitium in #2500
fix ci test by @ZX-ModelCloud in #2501
Ci test bloom by @Qubitium in #2502
CI: Use generate_stable_with_limit by @Qubitium in #2503
[CI] test_multi_gpu_inference will get 5 GPUs by @CSY-ModelCloud in #2504
hard deprecate exllama v1 kernel by @Qubitium in #2505
[CI] fix multi-gpu by @CSY-ModelCloud in #2506
fix qwen3-vl compat by @Qubitium in #2507
Oss compat fix by @Qubitium in #2508
missing arg by @Qubitium in #2509
threaded timing needs warmup protection by @Qubitium in #2510
bitblas arch fallback protection by @Qubitium in #2511
baichun fix: 1) rotary compat 2) buffers not correctly registered by @Qubitium in #2512
add missing by @Qubitium in #2513
refractor awq weight mean cpu/cuda paths by @Qubitium in #2514
Qwen 3 [MoE/VL/Omni/Next] Compat by @Qubitium in #2515
Ci qwen2 compat by @Qubitium in #2516
defuser and meta/offload fixes by @Qubitium in #2517
hf transformer/optimum compat by @Qubitium in #2518
Ci fix 11 by @Qubitium in #2519
fix mmlupro compat by @Qubitium in #2520
tokenicer update for model compat by @Qubitium in #2521
add fast/slow mode for ci testing by @Qubitium in #2522
fix bitblas kernel accuracy by enabling fp32 accumulation by @Qubitium in #2523
[CI] Clean up transformers dependencies in deps.yaml by @CSY-ModelCloud in #2525
[CI] max-parallel jobs -> 6 by @CSY-ModelCloud in #2527
fix ci test by @ZX-ModelCloud in #2526
add bitblas awq kernel by @Qubitium in #2524
[CI] print env after all installations by @CSY-ModelCloud in #2528
use layer-level dynamic skip for fast quant by @ZX-ModelCloud in #2529
TorchFused: Add fp32 dtype support & fix AWQ dequantize by @jiqing-feng in #2531
[CI] install bitblas for test_awq_bitblas by @CSY-ModelCloud in #2532
[CI] fix pipline exit code was covered by @CSY-ModelCloud in #2533
[CI] add missing deps & delete unneeded test by @CSY-ModelCloud in #2534
use local model path & [CI] add HF_TOKEN for CI by @CSY-ModelCloud in #2535
Bitblas gptq bf16 fix by @Qubitium in #2530
Stop quantization early if all subsequent layers are skipped by @ZX-ModelCloud in #2536
[CI] set MAX_JOBS to nproc by @CSY-ModelCloud in #2539
fix model's path by @CSY-ModelCloud in #2540
Fix bitblas qzero remap regression by @Qubitium in #2541
granite non-standard paramter turtle/shell sync issue by @Qubitium in #2542
phi4 compat fix by @Qubitium in #2543
nemotron ultra compat by @Qubitium in #2544
call defuser.replace_fused_blocks() by @ZX-ModelCloud in #2545
auto force fp16 for awq when working kernel does not support bf16 by @Qubitium in #2546
bump defuser to 0.0.12 by @CSY-ModelCloud in #2547
[FIX] replace random-based string generator with secrets by @ZX-ModelCloud in #2549
[CI] fix bitblas cache invalid ELF header by @CSY-ModelCloud in #2548
[FIX] quant_log.csv was not saved upon completion of AWQ quantization by @ZX-ModelCloud in #2550
[CI] remove unused parameters by @CSY-ModelCloud in #2551
Ci fix ovis 1 6 by @Qubitium in #2552
[CI] compile with python build module & fix git files failed by @CSY-ModelCloud in #2554
patch yi and glm4v by @Qubitium in #2553
Update requirements.txt by @Qubitium in #2555
upgrade defuser to 0.0.15 by @ZX-ModelCloud in #2557
Compat regression fix by @Qubitium in #2556
update marin score slow by @Qubitium in #2558
[CI] revert build by @CSY-ModelCloud in #2561
update require defuser version to 0.0.15 by @ZX-ModelCloud in #2562
[CI] use python -m build -w --no-isolation by @CSY-ModelCloud in #2563
update module_tree for phimoe by @ZX-ModelCloud in #2564
[CI] remove torch 2.9.1 build by @CSY-ModelCloud in #2565
[CI] fix vllm test in test_eval by @CSY-ModelCloud in #2566
update readme by @Qubitium in #2567
fix ci test by @ZX-ModelCloud in #2568
[FIX] test_chatglm by @ZX-ModelCloud in #2569
fix test_chatglm_test by @ZX-ModelCloud in #2570
Emit version by @Qubitium in #2571
add torch 2.10 wheels by @Qubitium in #2572
update logbar by @Qubitium in #2574

New Contributors

@blazingbhavneek made their first contribution in #2422
@beomchan0 made their first contribution in #2444

Full Changelog: v5.7.0...v5.8.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPT-QModel v5.8.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Notable Changes

What's Changed

New Contributors

Contributors

Uh oh!