🚀 GPTQModel v7.0.0
🔥 Major
- New Huawei Ascend NPU quantization support with torch based kernels for inference
- All CUDA/ROCm compiled kernels are now JIT (just-in-time) compiled on first use
- Pip/UV install no longer requires the
--no-build-isolationflag
🧠 New model support and compatibility wins
- Added support for GLM 5/5.1, GLM OCR, GLM ASR, Gemma 3n, Falcon Mamba, and InternVL Chat.
- Extended OpenVINO GPTQ patching to understand GPTQModel's newer kernels.
- Fixed Qwen3 dtype handling, Qwen3.5 MoE module-tree assertions, Qwen2-VL calibration input capture, and Qwen 3.6 MoE regressions.
- Fixed Llama4Router replacement behavior, Phi-3 defused MLP module mapping, Phi-4 runtime requirements, Instella rope-scaling compatibility, Ling compatibility, Mixtral MoE checkpoint module names, Brumby thread safety, Baichuan compatibility, and Gemma
3 saving. - Fixed
exllamav3_torchimport under meta-device context.
⚡ Kernels, JIT, and hardware acceleration
- Moved all compilation required kernels to JIT compilation on first-use and cleaned up Marlin import probing, CUDA header handling, nvcc flag checks, and Torch/CUDA mismatch handling.
- Synced Marlin/Machete kernels with upstream and added hardware-specific Marlin boost paths.
- Guarded CUTLASS version mismatches and fixed generated-kernel staleness.
- Added global kernel rebuild support for CI and safer shared extension locks.
- Added Ascend NPU support.
- Fixed AWQ JIT cache invalidation, illegal memory access, SM120 execution, GEMM_Fast shared-memory launch, and BF16 bias validation.
- Fixed
BACKEND.MARLINloading forgptq_v2format and added Marlin import coverage.
🔥 Quantization, AWQ, FP8, and dequant
- Added FP8/FP4 CPU dequant and DeepSeek FP8
.scaledequant export. - Added dtype auto-decoding and decode path updates.
- Reduced AWQ scale-search activation memory and split AWQ integration tests for cleaner coverage.
- Fail fast on unsupported act-group-aware GPTQ shapes instead of continuing into invalid layouts.
- Fixed INT3 qzero format conversion, GAR width compatibility, and GPTQ batched keep-mask handling.
- Improved AWQ W4A8 and BF16 validation paths, plus post-quant MoE routing behavior.
- Used loader device selection for EoRA adapter generation.
🐢 LazyTurtle, loading, and model plumbing
- Refactored input capture into
BaseQModeland model-specific QModels for cleaner replay and calibration flows. - Renamed and hardened the turtle path into LazyTurtle, with stricter materialization failures and better expected-skip handling.
- Fixed LazyTurtle materialization for non-square fused experts, PhiMoE, nested HF weight renames, reversed
WeightRenamingsemantics, and non-Safetensors checkpoints. - Improved out-of-model tensor handling for MTP
prefix/filespaths. - Removed
BaseModel.loader_requires_dtypeand normalized config dtype handling throughget_hf_config_dtype(). - Fixed multi-GPU replay output retention, GPTQ finalizer overlap, and quantization OOMs from retained callable cache keys.
🧰 CI, packaging, and developer workflow
- Cleaned up CI shell logic, environment setup, UV cache handling, reusable Torch tests, CPU-only grouping, runner selection, retry behavior, and offload temp paths.
- Kept CI and Torch CUDA versions aligned, moved to newer Docker images, and surfaced real exit codes and GPU names.
- Removed
lm-eval, deprecated tests, deprecated artifact IDs, pause UI lifecycle code, and tabulate from CI/test paths. - Migrated more regex usage to pcre/pcre2.
- Replaced temp path helpers with
tempfile.TemporaryDirectory()for automatic cleanup. - Updated requirements, dependencies, setuptools compatibility, and install-with-Torch validation.
💥 Breaking and removed
- Kernel loading behavior has shifted heavily toward JIT compilation, so custom deployment environments should verify compiler/CUDA compatibility.
lm-evalreferences were removed from CI and test/docs paths.- Deprecated tests, artifact handling, and pause UI lifecycle code were removed.
Full Changelog:
- Refactor input capture flow into BaseQModel and model-specific QModels by @ZX-ModelCloud in #2666
- [CI] adjust venv logic by @CSY-ModelCloud in #2667
- [CI] remove verbose log flag for build by @CSY-ModelCloud in #2669
- Move more kernels to JIT compile path by @Qubitium in #2668
- Kernels migrate to jit compile by @Qubitium in #2670
- remove hf kernels dependency for cpu by @Qubitium in #2671
- fix marlin import paths probe by @Qubitium in #2673
- fix failed test by @Qubitium in #2674
- Extend OpenVINO's GPTQ patcher to understand GPTQModel new kernels. by @ZX-ModelCloud in #2675
- [CI] use same cuda version for CI & torch by @CSY-ModelCloud in #2676
- Handle mtp
prefix/filesinout_of_model_tensorsby @ZX-ModelCloud in #2677 - bonsai refractor by @Qubitium in #2672
- glm 5/5.1 support by @Qubitium in #2680
- Normalize config
dtypetotorch.dtypeinget_hf_config_dtype()by @ZX-ModelCloud in #2681 - [FIX] Qwen3ForCausalLM does not require the
dtypeargument. by @ZX-ModelCloud in #2682 - fix jit error because torch's cuda mismatchs local nvcc version by @CSY-ModelCloud in #2683
- fix: rotary_embed init by @Qubitium in #2684
- remove BaseModel.loader_requires_dtype by @ZX-ModelCloud in #2686
- [CI] no need build step by @CSY-ModelCloud in #2688
- refractor turtle to lazy by @Qubitium in #2687
- [CI] fix jobs are skipped by @CSY-ModelCloud in #2689
- [FIX] multi-GPU replay output retention OOM by @ZX-ModelCloud in #2692
- fp8/fp4 cpu dequant by @Qubitium in #2691
- refactor all monekypatches to use same lock by @ZX-ModelCloud in #2693
- [CI] decrease max parallel jobs to 4 by @CSY-ModelCloud in #2695
- [FIX] All cpp extensions should share the same lock instead of using a map of locks by @ZX-ModelCloud in #2696
- dtype auto decoder by @Qubitium in #2690
- Decode update by @Qubitium in #2698
- refractor processors by @Qubitium in #2697
- fix cuda header path conflict by @Qubitium in #2701
- [CI] add prefix for env name by @CSY-ModelCloud in #2704
- ignore .codex by @CSY-ModelCloud in #2703
- fix: stabilize baichuan compat test by @Qubitium in #2702
- Fix Qwen3.5 MoE module tree assertion by @Qubitium in #2705
- [CI] remove UV_INDEX_URL by @CSY-ModelCloud in #2706
- Fix: LazyTurtle materialization for non-square fused experts by @ZX-ModelCloud in #2707
- [CI] uv won't R/W /monster now by @CSY-ModelCloud in #2708
- Fix AWQ JIT cache invalidation by @Qubitium in #2709
- Split AWQ integration tests by @Qubitium in #2710
- Fix CI to install ModelCloud deps from git by @Qubitium in #2711
- migrate stdlib.re to pcre2 by @Qubitium in #2712
- [CI] show real exit code by @CSY-ModelCloud in #2713
- Remove lm-eval from CI and test/docs references by @Qubitium in #2714
- Remove pause UI controller lifecycle by @Qubitium in #2715
- [CI] re-mount /monster for uv by @CSY-ModelCloud in #2718
- Sync Marlin/Machete Kernel with upstream by @Qubitium in #2717
- Fix GPU CI allocation and streaming regressions by @Qubitium in #2719
- Guard CUTLASS version mismatches by @Qubitium in #2720
- Fix Marlin generated kernel staleness by @Qubitium in #2721
- Fix balanced MoE vram usage by @Qubitium in #2716
- Fix bias dtype and validate AWQ bf16 ops by @Qubitium in #2722
- Raise on LazyTurtle materialization failures and silence expected skips by @ZX-ModelCloud in #2723
- HW specific boost for Marlin by @Qubitium in #2724
- Update requirements.txt by @Qubitium in #2725
- [CI] share common venvs & add lock when installing pkgs by @CSY-ModelCloud in #2726
- [CI] set uv cache for differrent envs by @CSY-ModelCloud in #2728
- [FIX] Mixtral MoE checkpoint module name may not match modeling code by @ZX-ModelCloud in #2727
- Delete redundant AWQ logs. by @ZX-ModelCloud in #2731
- [CI] clean sh codes, simpilify logic by @CSY-ModelCloud in #2730
- Sync setuptools by @Qubitium in #2732
- [CI] test install with torch by @CSY-ModelCloud in #2733
- [CI] update release CI to latest docker by @CSY-ModelCloud in #2734
- [CI] check setuptools compatibility by @CSY-ModelCloud in #2735
- [CI] fix compat test by @CSY-ModelCloud in #2736
- [CI] remvoe unused params by @CSY-ModelCloud in #2737
- Disable MoE routing config in post-quant forawrd by @ZX-ModelCloud in #2738
- file cleanup by @Qubitium in #2740
- [CI] mount ci host path as uv cache dir by @CSY-ModelCloud in #2741
- marlin: cleanup by @Qubitium in #2742
- Fix nvcc flag check by @Qubitium in #2745
- [FIX] Avoid replacing modules like Llama4Router with HookedLinear by @ZX-ModelCloud in #2744
- Update depends by @Qubitium in #2746
- [CI] show gpu name on response by @CSY-ModelCloud in #2747
- [CI] no sn by @CSY-ModelCloud in #2748
- fix brumby compat, thread safety by @Qubitium in #2749
- [CI] always start with a new clean env by @CSY-ModelCloud in #2750
- Add Qwen 3.6 MoE quantization regressions by @Qubitium in #2752
- [FIX] phimoe quantization error with LazyTurtle by @ZX-ModelCloud in #2754
- [CI] refactor CI & mark no gpu tests by @CSY-ModelCloud in #2755
- [CI] no gpu tests first by @CSY-ModelCloud in #2756
- fix model def tree execution order ground truth by @Qubitium in #2753
- [CI] fix CI cannot get runner group, use cpu model instead by @CSY-ModelCloud in #2758
- [CI] fix no arg for ths func by @CSY-ModelCloud in #2759
- Looper fix by @Qubitium in #2757
- [CI] late import device_smi, so no need to install it in pre env by @CSY-ModelCloud in #2760
- Fail fast for unsupported act-group-aware GPTQ shapes by @ZX-ModelCloud in #2761
- fix gar width compat by @Qubitium in #2762
- fix int3 qzero format conversion by @Qubitium in #2763
- Fix Phi-3 defused MLP module mapping by @Qubitium in #2765
- Fix ci exposed compat issues by @Qubitium in #2766
- Fix ci regressions by @Qubitium in #2767
- [CI] install required pkgs for tests by @CSY-ModelCloud in #2768
- [FIX] test_instella by @ZX-ModelCloud in #2769
- [CI] add compute cap requirement for special tests by @CSY-ModelCloud in #2770
- [CI] allow CI skip test & clean ci logic by @CSY-ModelCloud in #2771
- [CI] install scipy for test_phi_4 by @CSY-ModelCloud in #2773
- fix mmlu ValueError by @CSY-ModelCloud in #2774
- Reduce AWQ scale-search activation memory by @Qubitium in #2775
- [REFACTOR] replace
checkpoint_path_aliaseswithHF_CONVERSION_MAP_REVERSEDby @ZX-ModelCloud in #2776 - [CI] add log to ensure old env is removed by @CSY-ModelCloud in #2777
- Fix thread-local CUDA linalg warmup in threadx by @Qubitium in #2778
- Fix CI offload temp path handling by @Qubitium in #2779
- [CI] retry if failed by @CSY-ModelCloud in #2780
- [CI] check time instead of checking retry count by @CSY-ModelCloud in #2781
- Add global kernel rebuild flag for CI by @Qubitium in #2782
- Use thread-and-device linalg warmups by @Qubitium in #2783
- Add Phi-4 runtime dependency requirements by @Qubitium in #2784
- Fix Qwen2-VL calibration input capture by @Qubitium in #2785
- Fix Ling compat by @Qubitium in #2786
- Remove tabulate from CI tests and migrate regex to pcre by @Qubitium in #2787
- Fix CI bootstrap imports and env activation by @Qubitium in #2788
- Refractor CI by @Qubitium in #2789
- Fix CI bootstrap regex dependency by @Qubitium in #2790
- use tempfile to replace /tmp by @CSY-ModelCloud in #2791
- [CI] fix matrix over 255 & make torch tests reusable & separate cpu only tests to a new group. by @CSY-ModelCloud in #2792
- Align LazyTurtle HF key resolution with reversed WeightRenaming semantics by @ZX-ModelCloud in #2793
- [CI} clean uv cache for CI by @CSY-ModelCloud in https://github.com//pull/2794
- replace makeTmp() with tempfile.TemporaryDirectory(), auto clean by @CSY-ModelCloud in #2795
- LazyTurtle now supports loading non-Safetensors models by @ZX-ModelCloud in #2796
- fix awq jit illegal memory access by @CSY-ModelCloud in #2798
- Fix LLMAWQ execution on SM120 GPUs by @CSY-ModelCloud in #2799
- Use loader device selection for EoRA adapter generation by @Qubitium in #2800
- Fix GEMM_Fast AWQ decode kernel shared memory launch by @Qubitium in #2801
- [CI] don't clean whl by @CSY-ModelCloud in #2802
- fix marlin jit error & update CI to latest image by @CSY-ModelCloud in #2803
- remove deprecated test by @CSY-ModelCloud in #2806
- [MODEL] support
glm_ocrandglm_asrmodel by @ZX-ModelCloud in #2807 - [CI] add attn_gym for test_hymba by @CSY-ModelCloud in #2811
- split into 2 tests & fix fast mode didn't work by @CSY-ModelCloud in #2812
- Fix multi-GPU GPTQ finalizer overlap by @Qubitium in #2808
- [CI] remove deperacated artifact_id by @CSY-ModelCloud in #2816
- [FIX] multi-GPU quantization OOM by canonicalizing get_supported_kwargs cache keys by @ZX-ModelCloud in #2815
- remove deperacated tests. lazy_load_kernel no longer calls get_kernel by @CSY-ModelCloud in #2814
- [MODEL] support gemma3n by @ZX-ModelCloud in #2817
- [CI] install missing pkgs for tests & test_llama3_2_fp8 need 5090 & update .gitignore by @CSY-ModelCloud in #2819
- [MODEL] support
falcon_mambaby @ZX-ModelCloud in #2820 - [FIX] an error occurring during Gemma 3 saving. by @ZX-ModelCloud in #2822
- fix batched calibration masking by @CSY-ModelCloud in #2823
- fix test_mmlupro by @CSY-ModelCloud in #2821
- [MODEL] support
internvl_chatby @ZX-ModelCloud in #2826 - Support DeepSeek FP8 .scale dequant export by @CSY-ModelCloud in #2827
- [FIX]
BACKEND.MARLINCurrently correct loadgptq_v2format by @ZX-ModelCloud in #2828 - skip tests, until todo fixed by @CSY-ModelCloud in #2829
- Fix exllamav3_torch import under meta-device context by @dblundell in #2825
- Add Ascend NPU support by @Qubitium in #2831
- add marlin import test by @CSY-ModelCloud in #2832
- [FIX] LazyTurtle tensor-key alias resolution for nested HF weight renames by @ZX-ModelCloud in #2835
New Contributors
- @dblundell made their first contribution in #2825
Full Changelog: v6.0.3...v7.0.0