Release 🚀 GPTQModel v7.0.0 · ModelCloud/GPTQModel

🔥 Major

New Huawei Ascend NPU quantization support with torch based kernels for inference
All CUDA/ROCm compiled kernels are now JIT (just-in-time) compiled on first use
Pip/UV install no longer requires the --no-build-isolation flag

🧠 New model support and compatibility wins

Added support for GLM 5/5.1, GLM OCR, GLM ASR, Gemma 3n, Falcon Mamba, and InternVL Chat.
Extended OpenVINO GPTQ patching to understand GPTQModel's newer kernels.
Fixed Qwen3 dtype handling, Qwen3.5 MoE module-tree assertions, Qwen2-VL calibration input capture, and Qwen 3.6 MoE regressions.
Fixed Llama4Router replacement behavior, Phi-3 defused MLP module mapping, Phi-4 runtime requirements, Instella rope-scaling compatibility, Ling compatibility, Mixtral MoE checkpoint module names, Brumby thread safety, Baichuan compatibility, and Gemma
3 saving.
Fixed exllamav3_torch import under meta-device context.

⚡ Kernels, JIT, and hardware acceleration

Moved all compilation required kernels to JIT compilation on first-use and cleaned up Marlin import probing, CUDA header handling, nvcc flag checks, and Torch/CUDA mismatch handling.
Synced Marlin/Machete kernels with upstream and added hardware-specific Marlin boost paths.
Guarded CUTLASS version mismatches and fixed generated-kernel staleness.
Added global kernel rebuild support for CI and safer shared extension locks.
Added Ascend NPU support.
Fixed AWQ JIT cache invalidation, illegal memory access, SM120 execution, GEMM_Fast shared-memory launch, and BF16 bias validation.
Fixed BACKEND.MARLIN loading for gptq_v2 format and added Marlin import coverage.

🔥 Quantization, AWQ, FP8, and dequant

Added FP8/FP4 CPU dequant and DeepSeek FP8 .scale dequant export.
Added dtype auto-decoding and decode path updates.
Reduced AWQ scale-search activation memory and split AWQ integration tests for cleaner coverage.
Fail fast on unsupported act-group-aware GPTQ shapes instead of continuing into invalid layouts.
Fixed INT3 qzero format conversion, GAR width compatibility, and GPTQ batched keep-mask handling.
Improved AWQ W4A8 and BF16 validation paths, plus post-quant MoE routing behavior.
Used loader device selection for EoRA adapter generation.

🐢 LazyTurtle, loading, and model plumbing

Refactored input capture into BaseQModel and model-specific QModels for cleaner replay and calibration flows.
Renamed and hardened the turtle path into LazyTurtle, with stricter materialization failures and better expected-skip handling.
Fixed LazyTurtle materialization for non-square fused experts, PhiMoE, nested HF weight renames, reversed WeightRenaming semantics, and non-Safetensors checkpoints.
Improved out-of-model tensor handling for MTP prefix/files paths.
Removed BaseModel.loader_requires_dtype and normalized config dtype handling through get_hf_config_dtype().
Fixed multi-GPU replay output retention, GPTQ finalizer overlap, and quantization OOMs from retained callable cache keys.

🧰 CI, packaging, and developer workflow

Cleaned up CI shell logic, environment setup, UV cache handling, reusable Torch tests, CPU-only grouping, runner selection, retry behavior, and offload temp paths.
Kept CI and Torch CUDA versions aligned, moved to newer Docker images, and surfaced real exit codes and GPU names.
Removed lm-eval, deprecated tests, deprecated artifact IDs, pause UI lifecycle code, and tabulate from CI/test paths.
Migrated more regex usage to pcre/pcre2.
Replaced temp path helpers with tempfile.TemporaryDirectory() for automatic cleanup.
Updated requirements, dependencies, setuptools compatibility, and install-with-Torch validation.

💥 Breaking and removed

Kernel loading behavior has shifted heavily toward JIT compilation, so custom deployment environments should verify compiler/CUDA compatibility.
lm-eval references were removed from CI and test/docs paths.
Deprecated tests, artifact handling, and pause UI lifecycle code were removed.

Full Changelog:

Refactor input capture flow into BaseQModel and model-specific QModels by @ZX-ModelCloud in #2666
[CI] adjust venv logic by @CSY-ModelCloud in #2667
[CI] remove verbose log flag for build by @CSY-ModelCloud in #2669
Move more kernels to JIT compile path by @Qubitium in #2668
Kernels migrate to jit compile by @Qubitium in #2670
remove hf kernels dependency for cpu by @Qubitium in #2671
fix marlin import paths probe by @Qubitium in #2673
fix failed test by @Qubitium in #2674
Extend OpenVINO's GPTQ patcher to understand GPTQModel new kernels. by @ZX-ModelCloud in #2675
[CI] use same cuda version for CI & torch by @CSY-ModelCloud in #2676
Handle mtp prefix/filesin out_of_model_tensors by @ZX-ModelCloud in #2677
bonsai refractor by @Qubitium in #2672
glm 5/5.1 support by @Qubitium in #2680
Normalize config dtype to torch.dtype in get_hf_config_dtype() by @ZX-ModelCloud in #2681
[FIX] Qwen3ForCausalLM does not require the dtype argument. by @ZX-ModelCloud in #2682
fix jit error because torch's cuda mismatchs local nvcc version by @CSY-ModelCloud in #2683
fix: rotary_embed init by @Qubitium in #2684
remove BaseModel.loader_requires_dtype by @ZX-ModelCloud in #2686
[CI] no need build step by @CSY-ModelCloud in #2688
refractor turtle to lazy by @Qubitium in #2687
[CI] fix jobs are skipped by @CSY-ModelCloud in #2689
[FIX] multi-GPU replay output retention OOM by @ZX-ModelCloud in #2692
fp8/fp4 cpu dequant by @Qubitium in #2691
refactor all monekypatches to use same lock by @ZX-ModelCloud in #2693
[CI] decrease max parallel jobs to 4 by @CSY-ModelCloud in #2695
[FIX] All cpp extensions should share the same lock instead of using a map of locks by @ZX-ModelCloud in #2696
dtype auto decoder by @Qubitium in #2690
Decode update by @Qubitium in #2698
refractor processors by @Qubitium in #2697
fix cuda header path conflict by @Qubitium in #2701
[CI] add prefix for env name by @CSY-ModelCloud in #2704
ignore .codex by @CSY-ModelCloud in #2703
fix: stabilize baichuan compat test by @Qubitium in #2702
Fix Qwen3.5 MoE module tree assertion by @Qubitium in #2705
[CI] remove UV_INDEX_URL by @CSY-ModelCloud in #2706
Fix: LazyTurtle materialization for non-square fused experts by @ZX-ModelCloud in #2707
[CI] uv won't R/W /monster now by @CSY-ModelCloud in #2708
Fix AWQ JIT cache invalidation by @Qubitium in #2709
Split AWQ integration tests by @Qubitium in #2710
Fix CI to install ModelCloud deps from git by @Qubitium in #2711
migrate stdlib.re to pcre2 by @Qubitium in #2712
[CI] show real exit code by @CSY-ModelCloud in #2713
Remove lm-eval from CI and test/docs references by @Qubitium in #2714
Remove pause UI controller lifecycle by @Qubitium in #2715
[CI] re-mount /monster for uv by @CSY-ModelCloud in #2718
Sync Marlin/Machete Kernel with upstream by @Qubitium in #2717
Fix GPU CI allocation and streaming regressions by @Qubitium in #2719
Guard CUTLASS version mismatches by @Qubitium in #2720
Fix Marlin generated kernel staleness by @Qubitium in #2721
Fix balanced MoE vram usage by @Qubitium in #2716
Fix bias dtype and validate AWQ bf16 ops by @Qubitium in #2722
Raise on LazyTurtle materialization failures and silence expected skips by @ZX-ModelCloud in #2723
HW specific boost for Marlin by @Qubitium in #2724
Update requirements.txt by @Qubitium in #2725
[CI] share common venvs & add lock when installing pkgs by @CSY-ModelCloud in #2726
[CI] set uv cache for differrent envs by @CSY-ModelCloud in #2728
[FIX] Mixtral MoE checkpoint module name may not match modeling code by @ZX-ModelCloud in #2727
Delete redundant AWQ logs. by @ZX-ModelCloud in #2731
[CI] clean sh codes, simpilify logic by @CSY-ModelCloud in #2730
Sync setuptools by @Qubitium in #2732
[CI] test install with torch by @CSY-ModelCloud in #2733
[CI] update release CI to latest docker by @CSY-ModelCloud in #2734
[CI] check setuptools compatibility by @CSY-ModelCloud in #2735
[CI] fix compat test by @CSY-ModelCloud in #2736
[CI] remvoe unused params by @CSY-ModelCloud in #2737
Disable MoE routing config in post-quant forawrd by @ZX-ModelCloud in #2738
file cleanup by @Qubitium in #2740
[CI] mount ci host path as uv cache dir by @CSY-ModelCloud in #2741
marlin: cleanup by @Qubitium in #2742
Fix nvcc flag check by @Qubitium in #2745
[FIX] Avoid replacing modules like Llama4Router with HookedLinear by @ZX-ModelCloud in #2744
Update depends by @Qubitium in #2746
[CI] show gpu name on response by @CSY-ModelCloud in #2747
[CI] no sn by @CSY-ModelCloud in #2748
fix brumby compat, thread safety by @Qubitium in #2749
[CI] always start with a new clean env by @CSY-ModelCloud in #2750
Add Qwen 3.6 MoE quantization regressions by @Qubitium in #2752
[FIX] phimoe quantization error with LazyTurtle by @ZX-ModelCloud in #2754
[CI] refactor CI & mark no gpu tests by @CSY-ModelCloud in #2755
[CI] no gpu tests first by @CSY-ModelCloud in #2756
fix model def tree execution order ground truth by @Qubitium in #2753
[CI] fix CI cannot get runner group, use cpu model instead by @CSY-ModelCloud in #2758
[CI] fix no arg for ths func by @CSY-ModelCloud in #2759
Looper fix by @Qubitium in #2757
[CI] late import device_smi, so no need to install it in pre env by @CSY-ModelCloud in #2760
Fail fast for unsupported act-group-aware GPTQ shapes by @ZX-ModelCloud in #2761
fix gar width compat by @Qubitium in #2762
fix int3 qzero format conversion by @Qubitium in #2763
Fix Phi-3 defused MLP module mapping by @Qubitium in #2765
Fix ci exposed compat issues by @Qubitium in #2766
Fix ci regressions by @Qubitium in #2767
[CI] install required pkgs for tests by @CSY-ModelCloud in #2768
[FIX] test_instella by @ZX-ModelCloud in #2769
[CI] add compute cap requirement for special tests by @CSY-ModelCloud in #2770
[CI] allow CI skip test & clean ci logic by @CSY-ModelCloud in #2771
[CI] install scipy for test_phi_4 by @CSY-ModelCloud in #2773
fix mmlu ValueError by @CSY-ModelCloud in #2774
Reduce AWQ scale-search activation memory by @Qubitium in #2775
[REFACTOR] replace checkpoint_path_aliases with HF_CONVERSION_MAP_REVERSED by @ZX-ModelCloud in #2776
[CI] add log to ensure old env is removed by @CSY-ModelCloud in #2777
Fix thread-local CUDA linalg warmup in threadx by @Qubitium in #2778
Fix CI offload temp path handling by @Qubitium in #2779
[CI] retry if failed by @CSY-ModelCloud in #2780
[CI] check time instead of checking retry count by @CSY-ModelCloud in #2781
Add global kernel rebuild flag for CI by @Qubitium in #2782
Use thread-and-device linalg warmups by @Qubitium in #2783
Add Phi-4 runtime dependency requirements by @Qubitium in #2784
Fix Qwen2-VL calibration input capture by @Qubitium in #2785
Fix Ling compat by @Qubitium in #2786
Remove tabulate from CI tests and migrate regex to pcre by @Qubitium in #2787
Fix CI bootstrap imports and env activation by @Qubitium in #2788
Refractor CI by @Qubitium in #2789
Fix CI bootstrap regex dependency by @Qubitium in #2790
use tempfile to replace /tmp by @CSY-ModelCloud in #2791
[CI] fix matrix over 255 & make torch tests reusable & separate cpu only tests to a new group. by @CSY-ModelCloud in #2792
Align LazyTurtle HF key resolution with reversed WeightRenaming semantics by @ZX-ModelCloud in #2793
[CI} clean uv cache for CI by @CSY-ModelCloud in https://github.com//pull/2794
replace makeTmp() with tempfile.TemporaryDirectory(), auto clean by @CSY-ModelCloud in #2795
LazyTurtle now supports loading non-Safetensors models by @ZX-ModelCloud in #2796
fix awq jit illegal memory access by @CSY-ModelCloud in #2798
Fix LLMAWQ execution on SM120 GPUs by @CSY-ModelCloud in #2799
Use loader device selection for EoRA adapter generation by @Qubitium in #2800
Fix GEMM_Fast AWQ decode kernel shared memory launch by @Qubitium in #2801
[CI] don't clean whl by @CSY-ModelCloud in #2802
fix marlin jit error & update CI to latest image by @CSY-ModelCloud in #2803
remove deprecated test by @CSY-ModelCloud in #2806
[MODEL] support glm_ocr and glm_asr model by @ZX-ModelCloud in #2807
[CI] add attn_gym for test_hymba by @CSY-ModelCloud in #2811
split into 2 tests & fix fast mode didn't work by @CSY-ModelCloud in #2812
Fix multi-GPU GPTQ finalizer overlap by @Qubitium in #2808
[CI] remove deperacated artifact_id by @CSY-ModelCloud in #2816
[FIX] multi-GPU quantization OOM by canonicalizing get_supported_kwargs cache keys by @ZX-ModelCloud in #2815
remove deperacated tests. lazy_load_kernel no longer calls get_kernel by @CSY-ModelCloud in #2814
[MODEL] support gemma3n by @ZX-ModelCloud in #2817
[CI] install missing pkgs for tests & test_llama3_2_fp8 need 5090 & update .gitignore by @CSY-ModelCloud in #2819
[MODEL] support falcon_mamba by @ZX-ModelCloud in #2820
[FIX] an error occurring during Gemma 3 saving. by @ZX-ModelCloud in #2822
fix batched calibration masking by @CSY-ModelCloud in #2823
fix test_mmlupro by @CSY-ModelCloud in #2821
[MODEL] support internvl_chat by @ZX-ModelCloud in #2826
Support DeepSeek FP8 .scale dequant export by @CSY-ModelCloud in #2827
[FIX] BACKEND.MARLIN Currently correct load gptq_v2 format by @ZX-ModelCloud in #2828
skip tests, until todo fixed by @CSY-ModelCloud in #2829
Fix exllamav3_torch import under meta-device context by @dblundell in #2825
Add Ascend NPU support by @Qubitium in #2831
add marlin import test by @CSY-ModelCloud in #2832
[FIX] LazyTurtle tensor-key alias resolution for nested HF weight renames by @ZX-ModelCloud in #2835

New Contributors

@dblundell made their first contribution in #2825

Full Changelog: v6.0.3...v7.0.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 GPTQModel v7.0.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

🔥 Major

🧠 New model support and compatibility wins

⚡ Kernels, JIT, and hardware acceleration

🔥 Quantization, AWQ, FP8, and dequant

🐢 LazyTurtle, loading, and model plumbing

🧰 CI, packaging, and developer workflow

💥 Breaking and removed

Full Changelog:

New Contributors

Contributors

Uh oh!