Releases · mobiusml/hqq

28 May 07:48

mobicham

v0.1.7.post3

3f77d6e

v0.1.7.post3 Latest

Latest

HQQ v0.1.7.post3

Enable CPU quantization and runtime
_load_state_dict fix
fix extra_repr in HQQLinear
fix from_quantized bugs
fix | typing
fix 3-bit axis=1 slicing bug
add 5/6 bit for testing

Assets 2

06 May 16:41

mobicham

0.1.7.post2

474f09b

v0.1.7.post2

HQQ v0.1.7.post2

Various bug fixes, especially with AutoHQQHFModel and the patching logic, to make it work with any transformers model.
Readme refactoring.
Whisper example.

Assets 2

24 Apr 08:59

mobicham

0.1.7

085dd96

v0.1.7

HQQ v0.1.7

Faster inference with torchao / marlin 4-bit kernels
Multi-gpu support for model.quantize()
Custom HF generator
Various bug fixes/improvements

Assets 2

19 Mar 18:24

mobicham

0.1.6.post2

0e5f89b

v0.1.6.post2

HQQ v0.1.6.post2

Same as v0.1.6 with setup.py fixes:

find_packages fix: #25
Auto-build CUDA kernels via pypi package: #26

Assets 2

19 Mar 15:16

mobicham

0.1.6.post1

f27bcaa

v0.1.6.post1

HQQ v0.1.6.post1

Same as v0.1.6 with a find_packages fix #25

Assets 2

19 Mar 13:35

mobicham

0.1.6

df43514

v0.1.6

HQQ v0.1.6

Use v0.1.6.post1 instead, unless you clone the repo first then install.

Features

Quantize on target device.
Meta-offloading uses pinned memory for faster/async transfers.
Loading saved LoRA weights automatically adds LoRA modules if not already present.
pip install automatically compiles the CUDA kernels now.
CUDA backend automatically detected and used when available.
You can quantize any HF model automatically via AutoHQQHFModel.
Faster meta-offloading with CUDA streams (experimental).
Int8 matmul (experimental).
Shared memory CUDA kernels (experimental).

Bugs

Fix Peft bias dtype.
Removed auto backend setting in LoRA.
All HQQLinear dtype/device-related overloads now return self which should solve a couple of issues.

Other

Refactor backends (using backprop backends by default now).
Added typing.
Ruff fix and reformat all Python files.
Refactor ATEN for reference tensors.

Issues

Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
Shared memory CUDA kernels are a bit slower than without for some reason.
The block size setting doesn't have much influence on the speed.
Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.

Assets 2

01 Mar 10:50

mobicham

0.1.5

18af5e3

v0.1.5

HQQ v0.1.5

New features

Added support for multi-gpu FSDP QLoRA training (#17)

Issues

torch.compile and the PYTORCH_COMPILE backend break with view_as_float=True. No known solution for the moment.
A bit slower inference with view_as_float=True. Solution: after training, the user can revert back to in bitpacking.

Assets 2

28 Feb 09:55

mobicham

0.1.4

36e02ad

v0.1.4

HQQ v0.1.4

New features

Added 1-bit support with CUDA dequant kernels.

Assets 2

20 Feb 16:41

mobicham

0.1.3.post1

6e3279d

v0.1.3.post1

HQQ v0.1.3.post1

New features

meta_offloading support: allows offloading meta-data to the CPU hence achieving true n-bit storage on the GPU.

Assets 2

12 Feb 16:58

mobicham

0.1.3

96ce17d

v0.1.3

HQQ v0.1.3

New features

Added CUDA kernels for dequantization (up to 2-3x inference speed-up vs. Pytorch)
Added support for compute_dtype parameter (useful for float32/bfloat16 LoRA training)

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HQQ v0.1.7.post3

HQQ v0.1.7.post2

HQQ v0.1.7

HQQ v0.1.6.post2

HQQ v0.1.6.post1

HQQ v0.1.6

Features

Bugs

Other

Issues

HQQ v0.1.5

New features

Issues

HQQ v0.1.4

New features

HQQ v0.1.3.post1

New features

HQQ v0.1.3

New features

Releases: mobiusml/hqq

v0.1.7.post3

HQQ v0.1.7.post3

v0.1.7.post2

HQQ v0.1.7.post2

v0.1.7

HQQ v0.1.7

v0.1.6.post2

HQQ v0.1.6.post2

v0.1.6.post1

HQQ v0.1.6.post1

v0.1.6

HQQ v0.1.6

Features

Bugs

Other

Issues

v0.1.5

HQQ v0.1.5

New features

Issues

v0.1.4

HQQ v0.1.4

New features

v0.1.3.post1

HQQ v0.1.3.post1

New features

v0.1.3

HQQ v0.1.3

New features