Skip to content

GPT-QModel v5.8.0

Choose a tag to compare

@Qubitium Qubitium released this 19 Mar 16:35
· 303 commits to main since this release
9980f01

Notable Changes

  • Transformers 5.3.0 compatibility.

  • Video Quantization Support

    • Added support for video input during quantization.
  • MoE & Model Support

    • Added support for Qwen 3.5 and Qwen 3.5 MoE.
    • Expanded compatibility for Qwen 3 variants including MoE / VL / Omni / Next.
    • Added support for LLada2 block diffusion LLM models.
    • Improved compatibility for Mixtral, Phi-4, Nemotron Ultra, BaiChuan, ChatGLM, Yi, and GLM4V.
    • Fixed multiple MoE-specific AWQ and multi-GPU issues, including routing, module tree, position embeddings, and device mismatches.
  • AWQ / GPTQ Kernels

    • Added CPU fused AWQ kernels for torch_fused and hf_kernel.
    • Added torch_int8 AWQ kernel.
    • Added BitBLAS AWQ kernel.
    • Ported Intel int8 GPTQ/AWQ kernels.
    • Updated kernel selection to prefer HF kernels where they provide the best performance and compatibility.
    • Added BitBLAS fallback protection and fixed BitBLAS accuracy and qzero remap regressions.
  • Quantization Improvements

    • Replaced greedy search with ternary search in SmoothBSE.
    • Fixed SmoothMAD overly aggressive clipping.
    • Added layer-level dynamic skip for fast quantization.
    • Added early stop when all remaining layers are skipped during quantization.
    • Fixed AWQ OOM and dequantization-related issues.
  • Runtime & Dequantization

    • Added optional CPU int64 g_idx cache for TorchQuantLinear dequantization.
    • Improved TorchFused dequantization and fp32 dtype support.
    • Removed unnecessary symmetric handling in dequantize_gemm.
    • Fixed rotary embedding device mismatch by storing per-device rotary copies.
    • Added warmup protection for threaded timing.
  • Defuser Integration

    • Integrated defuser.convert_hf_model().
    • Integrated defuser.materialize_model().
    • Integrated defuser.replace_fused_blocks().
    • Improved defuser meta/offload compatibility and fused block handling.
  • Compatibility Fixes

    • Improved compatibility with older and newer Hugging Face Transformers / Optimum versions.
    • Fixed import compatibility issues in models/utils.
    • Fixed rotary / embedding config compatibility with older HF and model variants.
    • Improved tokenizer and model compatibility updates related to tokenicer.
    • Fixed OSS compatibility issues.
  • Kernel / Backend Changes

    • Hard deprecated ExLLaMA v1 kernel.
    • Exposed the Triton patcher as an externally callable API.

What's Changed

New Contributors

Full Changelog: v5.7.0...v5.8.0