Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature request] Add support for Int4 data-type #5776

Closed
galagam opened this issue Nov 29, 2023 · 5 comments
Closed

[Feature request] Add support for Int4 data-type #5776

galagam opened this issue Nov 29, 2023 · 5 comments
Assignees
Labels
enhancement Request for new feature or operator spec

Comments

@galagam
Copy link
Contributor

galagam commented Nov 29, 2023

System information

onnx v1.16.0
main top-of-tree

What is the problem that this feature solves?

LLMs are dominating the DL research and development today. Recent networks require 10s of GBs, and the main inference bottleneck is memory access (capacity and bandwidth). Recent papers are showing promising results using sub-byte data types, specifically weight-only quantization using int4.
The motivation behind weight-only quantization is improving performance (larger batches improve MMA utilization; smaller memory BW benefits memory-bound generation) as well as enabling large models that would otherwise not be able to execute on a single GPU.
By quantizing data to 4 bits, we can both reduce the model size and accelerate memory-bound inference use cases significantly.

Alternatives considered

N/A

Describe the feature

Every two int4 elements (i_0, i_1) will packed into a single uint8, as follows:
buffer = i_0 << 4 | i_1 & 0x0F

Tensors with an odd number of elements are not supported.

  • Add data-type TensorProto.INT4
  • Add helper functions to pack and unpack int4
  • Add support in QuantizeLinear, DequantizeLinear and some shape ops

Will this influence the current api (Y/N)?

Yes
Additional data type available
Adding int4 to a subset of the operators, including QuantizeLinear, DequantizeLinear (Optionally: Shape, Size, Transpose, Reshape, Constant).

Feature Area

data-types, operators

Are you willing to contribute it (Y/N)

Yes

Notes

Relevant papers using int4 quantization for LLMs:
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

@galagam galagam added the enhancement Request for new feature or operator label Nov 29, 2023
@galagam
Copy link
Contributor Author

galagam commented Dec 4, 2023

Edit:
Found a way around this. Int4 will be packed when creating a tensor and unpacked when converting to a numpy array.
ReferenceEvaluator will use unpacked int4 (each int4 element stored as an np.int8 - sign extended).

Original message:
When initially suggesting this change, I've stated that the innermost dimension has to be even. This restriction comes from the way int4 is stored - int4x2 packed into a single uint8.

While implementing, I'm trying to remove this restriction, allowing int4 tensors of any shape, but I've encountered a few issues, and I'm not sure what is the best way to resolve this.
Ideally, when packing data to int4, we'll store a flat buffer (add a single element if the total number of elements is odd), and when unpacking we'll (drop the element) and reshape according to the original shape.
I can do just that when using TensorProto, because it allows to set the dimensions independently from the size of the data buffer. However, ONNX's ReferenceEvaluator is using numpy.ndarray as input and output, so I'm looking for a solution that can be extended to Numpy ndarrays.
In numpy.ndarray, the shape and size attributes are not writeable. That means that we need to preserve the shape information somehow within the ndarray.

For an even innermost dimension, that's simple:

packing: packed_dims = [unpacked_dims [:-1]] + [unpacked_dims [-1]/2].
unpacking: unpacked_dims = [packed_dims [:-1]] + [packed_dims [-1]*2].

But for an odd dimension this computation loses data: unpacked_dims [-1]/2.

I can think of a few solutions, but neither is ideal:

First option, keep the restriction on the tensor shape (innermost dimension has to be even). Weights typically have even dimensions, so that's unlikely to pose a real-world usage problem, but it adds a restriction to the standard that's rooted in an implementation detail.

Second option, store every int4 within an int8 element (unpacked). That is quite straightforward and allows any dimension to be used, but means that we will not benefit from the compression of model weights in ONNX.

Third option, add padding to make the innermost dimension even, and pass this information through the custom data-type's description. When packing we can set the data-type according to the parity of the innermost dimension, and when unpacking we can check the data-type. It's functional, but it's hacky.
In onnx/reference/custom_element_types.py

import numpy as np

bfloat16 = np.dtype((np.uint16, {"bfloat16": (np.uint16, 0)}))
float8e4m3fn = np.dtype((np.uint8, {"e4m3fn": (np.uint8, 0)}))
float8e4m3fnuz = np.dtype((np.uint8, {"e4m3fnuz": (np.uint8, 0)}))
float8e5m2 = np.dtype((np.uint8, {"e5m2": (np.uint8, 0)}))
float8e5m2fnuz = np.dtype((np.uint8, {"e5m2fnuz": (np.uint8, 0)}))
+ int4 = np.dtype((np.uint8, {"int4": (np.uint8, 0)}))
+ int4_padded = np.dtype((np.uint8, {"int4_padded": (np.uint8, 0)}))

Another thing to consider is how this mechanism will be extended when adding future sub-byte data-types with different widths (i.e. for int3, int6).

Creative ideas would be appreciated!

@gramalingam
Copy link
Contributor

Found a way around this. Int4 will be packed when creating a tensor and unpacked when converting to a numpy array.
ReferenceEvaluator will use unpacked int4 (each int4 element stored as an np.int8 - sign extended).

If I understand correctly, this is essentially option 2 from your list (focusing specifically on the question of the numpy-representation of an int4 tensor, not the TensorProto representation etc.)?

@galagam
Copy link
Contributor Author

galagam commented Dec 7, 2023

Found a way around this. Int4 will be packed when creating a tensor and unpacked when converting to a numpy array.
ReferenceEvaluator will use unpacked int4 (each int4 element stored as an np.int8 - sign extended).

If I understand correctly, this is essentially option 2 from your list (focusing specifically on the question of the numpy-representation of an int4 tensor, not the TensorProto representation etc.)?

Option 2 referred to saving the data uncompressed - in the TensorProto as well. The obvious downside here is that the benefit of compression is lost.
However, I see now that the TensorProto representation is decoupled from the numpy representation only which is used for operator reference testing. Therefore, keeping the int4 values uncompressed in numpy seems like an ideal solution.

@gramalingam
Copy link
Contributor

As you stated earlier, I think we can break this into two separate questions:

  • The representation in a TensorProto
  • The numpy representation

For numpy, keeping the int4 values uncompressed is reasonable from a pure specification perspective (that is, eg., in the reference implementation). However, I think sooner or later we might need utility methods that deal with the numpy representation and are used in production tools, where a compressed numpy representation would be beneficial too. In that situation, I think your option 3 might be a good choice. This is just for the sake of discussion (whether we add this now or later is a different question).

github-merge-queue bot pushed a commit that referenced this issue Jan 8, 2024
### Description
- Add INT4 and UINT4 quantized data types
- Support for packing and unpacking int4x2->byte
- Implementation of Operators: Cast, CastLike, DequantizeLinear,
QuantizeLinear
- Type support for non-compute operators Constant, ConstantOfShape,
Identity, Reshape, Shape, Size, If, Loop, Scan, Flatten, Pad, Squeeze,
Unsqueeze, Transpose.

### Motivation and Context
See details in issue #5776

---------

Signed-off-by: Gal Hubara Agam <ghubaraagam@nvidia.com>
Signed-off-by: galagam <ghubaraagam@nvidia.com>
@galagam
Copy link
Contributor Author

galagam commented Apr 8, 2024

Implemented in #5811. Closing.

@galagam galagam closed this as completed Apr 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Request for new feature or operator spec
Projects
None yet
Development

No branches or pull requests

3 participants