Releases: mobiusml/hqq
Releases · mobiusml/hqq
v0.1.7.post3
HQQ v0.1.7.post3
- Enable CPU quantization and runtime
_load_state_dict
fix- fix
extra_repr
inHQQLinear
- fix
from_quantized
bugs - fix
|
typing - fix 3-bit
axis=1
slicing bug - add 5/6 bit for testing
v0.1.7.post2
HQQ v0.1.7.post2
- Various bug fixes, especially with
AutoHQQHFModel
and the patching logic, to make it work with any transformers model. - Readme refactoring.
- Whisper example.
v0.1.7
v0.1.6.post2
v0.1.6.post1
v0.1.6
HQQ v0.1.6
Use v0.1.6.post1 instead, unless you clone the repo first then install.
Features
- Quantize on target device.
- Meta-offloading uses pinned memory for faster/async transfers.
- Loading saved LoRA weights automatically adds LoRA modules if not already present.
pip install
automatically compiles the CUDA kernels now.- CUDA backend automatically detected and used when available.
- You can quantize any HF model automatically via
AutoHQQHFModel
. - Faster meta-offloading with CUDA streams (experimental).
- Int8 matmul (experimental).
- Shared memory CUDA kernels (experimental).
Bugs
- Fix Peft bias dtype.
- Removed auto backend setting in LoRA.
- All
HQQLinear
dtype/device-related overloads now return self which should solve a couple of issues.
Other
- Refactor backends (using backprop backends by default now).
- Added typing.
- Ruff fix and reformat all Python files.
- Refactor ATEN for reference tensors.
Issues
- Using CUDA streams for offloading is faster but uses more memory (~+700MB with Llama2-7B 2-bit/gs=16) . In fact, sometimes it's almost as fast as keeping data on the GPU, so worth looking into this.
- Shared memory CUDA kernels are a bit slower than without for some reason.
- The block size setting doesn't have much influence on the speed.
- Int8 matmul is slower than fp16 with the current "placeholder" implementation, it should be done on the Aten/CUDA side.
v0.1.5
HQQ v0.1.5
New features
- Added support for multi-gpu FSDP QLoRA training (#17)
Issues
torch.compile
and thePYTORCH_COMPILE
backend break withview_as_float=True
. No known solution for the moment.- A bit slower inference with
view_as_float=True
. Solution: after training, the user can revert back to in bitpacking.
v0.1.4
v0.1.3.post1
HQQ v0.1.3.post1
New features
- meta_offloading support: allows offloading meta-data to the CPU hence achieving true n-bit storage on the GPU.