-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature request] Add support for Int4 data-type #5776
Comments
Edit: Original message: While implementing, I'm trying to remove this restriction, allowing int4 tensors of any shape, but I've encountered a few issues, and I'm not sure what is the best way to resolve this. For an even innermost dimension, that's simple:
But for an odd dimension this computation loses data: unpacked_dims [-1]/2. I can think of a few solutions, but neither is ideal: First option, keep the restriction on the tensor shape (innermost dimension has to be even). Weights typically have even dimensions, so that's unlikely to pose a real-world usage problem, but it adds a restriction to the standard that's rooted in an implementation detail. Second option, store every int4 within an int8 element (unpacked). That is quite straightforward and allows any dimension to be used, but means that we will not benefit from the compression of model weights in ONNX. Third option, add padding to make the innermost dimension even, and pass this information through the custom data-type's description. When packing we can set the data-type according to the parity of the innermost dimension, and when unpacking we can check the data-type. It's functional, but it's hacky.
Another thing to consider is how this mechanism will be extended when adding future sub-byte data-types with different widths (i.e. for int3, int6). Creative ideas would be appreciated! |
If I understand correctly, this is essentially option 2 from your list (focusing specifically on the question of the numpy-representation of an int4 tensor, not the TensorProto representation etc.)? |
Option 2 referred to saving the data uncompressed - in the TensorProto as well. The obvious downside here is that the benefit of compression is lost. |
As you stated earlier, I think we can break this into two separate questions:
For numpy, keeping the int4 values uncompressed is reasonable from a pure specification perspective (that is, eg., in the reference implementation). However, I think sooner or later we might need utility methods that deal with the numpy representation and are used in production tools, where a compressed numpy representation would be beneficial too. In that situation, I think your option 3 might be a good choice. This is just for the sake of discussion (whether we add this now or later is a different question). |
### Description - Add INT4 and UINT4 quantized data types - Support for packing and unpacking int4x2->byte - Implementation of Operators: Cast, CastLike, DequantizeLinear, QuantizeLinear - Type support for non-compute operators Constant, ConstantOfShape, Identity, Reshape, Shape, Size, If, Loop, Scan, Flatten, Pad, Squeeze, Unsqueeze, Transpose. ### Motivation and Context See details in issue #5776 --------- Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com> Signed-off-by: galagam <ghubaraagam@nvidia.com>
Implemented in #5811. Closing. |
System information
onnx v1.16.0
main top-of-tree
What is the problem that this feature solves?
LLMs are dominating the DL research and development today. Recent networks require 10s of GBs, and the main inference bottleneck is memory access (capacity and bandwidth). Recent papers are showing promising results using sub-byte data types, specifically weight-only quantization using int4.
The motivation behind weight-only quantization is improving performance (larger batches improve MMA utilization; smaller memory BW benefits memory-bound generation) as well as enabling large models that would otherwise not be able to execute on a single GPU.
By quantizing data to 4 bits, we can both reduce the model size and accelerate memory-bound inference use cases significantly.
Alternatives considered
N/A
Describe the feature
Every two int4 elements (i_0, i_1) will packed into a single uint8, as follows:
buffer = i_0 << 4 | i_1 & 0x0F
Tensors with an odd number of elements are not supported.
Will this influence the current api (Y/N)?
Yes
Additional data type available
Adding int4 to a subset of the operators, including QuantizeLinear, DequantizeLinear (Optionally: Shape, Size, Transpose, Reshape, Constant).
Feature Area
data-types, operators
Are you willing to contribute it (Y/N)
Yes
Notes
Relevant papers using int4 quantization for LLMs:
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
The text was updated successfully, but these errors were encountered: