Skip to content

🚀 GPTQModel v7.0.0

Choose a tag to compare

@Qubitium Qubitium released this 28 Apr 20:37
· 73 commits to main since this release
f731429

🔥 Major

  • New Huawei Ascend NPU quantization support with torch based kernels for inference
  • All CUDA/ROCm compiled kernels are now JIT (just-in-time) compiled on first use
  • Pip/UV install no longer requires the --no-build-isolation flag

🧠 New model support and compatibility wins

  • Added support for GLM 5/5.1, GLM OCR, GLM ASR, Gemma 3n, Falcon Mamba, and InternVL Chat.
  • Extended OpenVINO GPTQ patching to understand GPTQModel's newer kernels.
  • Fixed Qwen3 dtype handling, Qwen3.5 MoE module-tree assertions, Qwen2-VL calibration input capture, and Qwen 3.6 MoE regressions.
  • Fixed Llama4Router replacement behavior, Phi-3 defused MLP module mapping, Phi-4 runtime requirements, Instella rope-scaling compatibility, Ling compatibility, Mixtral MoE checkpoint module names, Brumby thread safety, Baichuan compatibility, and Gemma
    3 saving.
  • Fixed exllamav3_torch import under meta-device context.

⚡ Kernels, JIT, and hardware acceleration

  • Moved all compilation required kernels to JIT compilation on first-use and cleaned up Marlin import probing, CUDA header handling, nvcc flag checks, and Torch/CUDA mismatch handling.
  • Synced Marlin/Machete kernels with upstream and added hardware-specific Marlin boost paths.
  • Guarded CUTLASS version mismatches and fixed generated-kernel staleness.
  • Added global kernel rebuild support for CI and safer shared extension locks.
  • Added Ascend NPU support.
  • Fixed AWQ JIT cache invalidation, illegal memory access, SM120 execution, GEMM_Fast shared-memory launch, and BF16 bias validation.
  • Fixed BACKEND.MARLIN loading for gptq_v2 format and added Marlin import coverage.

🔥 Quantization, AWQ, FP8, and dequant

  • Added FP8/FP4 CPU dequant and DeepSeek FP8 .scale dequant export.
  • Added dtype auto-decoding and decode path updates.
  • Reduced AWQ scale-search activation memory and split AWQ integration tests for cleaner coverage.
  • Fail fast on unsupported act-group-aware GPTQ shapes instead of continuing into invalid layouts.
  • Fixed INT3 qzero format conversion, GAR width compatibility, and GPTQ batched keep-mask handling.
  • Improved AWQ W4A8 and BF16 validation paths, plus post-quant MoE routing behavior.
  • Used loader device selection for EoRA adapter generation.

🐢 LazyTurtle, loading, and model plumbing

  • Refactored input capture into BaseQModel and model-specific QModels for cleaner replay and calibration flows.
  • Renamed and hardened the turtle path into LazyTurtle, with stricter materialization failures and better expected-skip handling.
  • Fixed LazyTurtle materialization for non-square fused experts, PhiMoE, nested HF weight renames, reversed WeightRenaming semantics, and non-Safetensors checkpoints.
  • Improved out-of-model tensor handling for MTP prefix/files paths.
  • Removed BaseModel.loader_requires_dtype and normalized config dtype handling through get_hf_config_dtype().
  • Fixed multi-GPU replay output retention, GPTQ finalizer overlap, and quantization OOMs from retained callable cache keys.

🧰 CI, packaging, and developer workflow

  • Cleaned up CI shell logic, environment setup, UV cache handling, reusable Torch tests, CPU-only grouping, runner selection, retry behavior, and offload temp paths.
  • Kept CI and Torch CUDA versions aligned, moved to newer Docker images, and surfaced real exit codes and GPU names.
  • Removed lm-eval, deprecated tests, deprecated artifact IDs, pause UI lifecycle code, and tabulate from CI/test paths.
  • Migrated more regex usage to pcre/pcre2.
  • Replaced temp path helpers with tempfile.TemporaryDirectory() for automatic cleanup.
  • Updated requirements, dependencies, setuptools compatibility, and install-with-Torch validation.

💥 Breaking and removed

  • Kernel loading behavior has shifted heavily toward JIT compilation, so custom deployment environments should verify compiler/CUDA compatibility.
  • lm-eval references were removed from CI and test/docs paths.
  • Deprecated tests, artifact handling, and pause UI lifecycle code were removed.

Full Changelog:

New Contributors

Full Changelog: v6.0.3...v7.0.0