Skip to content

Releases: mobiusml/hqq

v0.1.7.post3

28 May 07:48
Compare
Choose a tag to compare

HQQ v0.1.7.post3

  • Enable CPU quantization and runtime
  • _load_state_dict fix
  • fix extra_repr in HQQLinear
  • fix from_quantized bugs
  • fix | typing
  • fix 3-bit axis=1 slicing bug
  • add 5/6 bit for testing

v0.1.7.post2

06 May 16:41
Compare
Choose a tag to compare

HQQ v0.1.7.post2

  • Various bug fixes, especially with AutoHQQHFModel and the patching logic, to make it work with any transformers model.
  • Readme refactoring.
  • Whisper example.

v0.1.7

24 Apr 08:59
Compare
Choose a tag to compare

HQQ v0.1.7

  • Faster inference with torchao / marlin 4-bit kernels
  • Multi-gpu support for model.quantize()
  • Custom HF generator
  • Various bug fixes/improvements

v0.1.6.post2

19 Mar 18:24
Compare
Choose a tag to compare

HQQ v0.1.6.post2

Same as v0.1.6 with setup.py fixes:

  • find_packages fix: #25
  • Auto-build CUDA kernels via pypi package: #26

v0.1.6.post1

19 Mar 15:16
Compare
Choose a tag to compare

HQQ v0.1.6.post1

Same as v0.1.6 with a find_packages fix #25

v0.1.6

19 Mar 13:35
Compare
Choose a tag to compare

HQQ v0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features

  • Quantize on target device.
  • Meta-offloading uses pinned memory for faster/async transfers.
  • Loading saved LoRA weights automatically adds LoRA modules if not already present.
  • pip install automatically compiles the CUDA kernels now.
  • CUDA backend automatically detected and used when available.
  • You can quantize any HF model automatically via AutoHQQHFModel.
  • Faster meta-offloading with CUDA streams (experimental).
  • Int8 matmul (experimental).
  • Shared memory CUDA kernels (experimental).

Bugs

  • Fix Peft bias dtype.
  • Removed auto backend setting in LoRA.
  • All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other

  • Refactor backends (using backprop backends by default now).
  • Added typing.
  • Ruff fix and reformat all Python files.
  • Refactor ATEN for reference tensors.

Issues

  • Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
  • Shared memory CUDA kernels are a bit slower than without for some reason.
  • The block size setting doesn't have much influence on the speed.
  • Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.

v0.1.5

01 Mar 10:50
Compare
Choose a tag to compare

HQQ v0.1.5

New features

  • Added support for multi-gpu FSDP QLoRA training (#17)

Issues

  • torch.compile and the PYTORCH_COMPILE backend break with view_as_float=True. No known solution for the moment.
  • A bit slower inference with view_as_float=True. Solution: after training, the user can revert back to in bitpacking.

v0.1.4

28 Feb 09:55
Compare
Choose a tag to compare

HQQ v0.1.4

New features

  • Added 1-bit support with CUDA dequant kernels.

v0.1.3.post1

20 Feb 16:41
Compare
Choose a tag to compare

HQQ v0.1.3.post1

New features

  • meta_offloading support: allows offloading meta-data to the CPU hence achieving true n-bit storage on the GPU.

v0.1.3

12 Feb 16:58
96ce17d
Compare
Choose a tag to compare

HQQ v0.1.3

New features

  • Added CUDA kernels for dequantization (up to 2-3x inference speed-up vs. Pytorch)
  • Added support for compute_dtype parameter (useful for float32/bfloat16 LoRA training)